ConvertHTML
Question answered by this recipe
- How do I convert an HTML page to PmWiki markup?
- How can I migrate a site to PmWiki or import HTML pages?
Description
PmWiki markup does not support all of the HTML markup so a 100% conversion is not possible. However, PmWiki can make replacements to the text as it is being edited or saved. ConvertHTML implements a relatively comprehensive set of rules for converting HTML tags to wiki markup.
To install this recipe:
- download convert-html.phpΔ to your cookbook directory
- add the following line to your configuration file:
if ($action=='edit') include_once("$FarmD/cookbook/convert-html.php");
What it does
ConvertHTML uses the $ROEPatterns
patterns array to translate most HTML tags, leaving the rest intact. All replacements are case-insensitive and attributes may be surrounded by single or double quotes, or in some cases left unquoted. The XHTML / at the end of a lone tag is always optional.
Any HTML inside [=...=]
or [@...@
]
tags will be left untouched.
The following tags will be parsed only if they contain no attributes: B, BIG, BLOCKQUOTE, BODY, CODE, DD, DEL, EM, HEAD, HR, HTML, I, INS, PRE, SMALL, STRONG, SUB, SUP, TITLE, TT.
The following tags will be parsed even if they contain attributes: A, BR, DIV, DL, DT, FORM, H1..6, IMG, INPUT, LI, OL, OPTION, P, SELECT, SPAN, TABLE, TEXTAREA, TD, UL. These attributes will be assigned within an applicable (:...:)
or %...% ... %%
statement. The validity or effectiveness of these attributes as PmWiki markup isn't verified, for the most part.
Some additional notes:
<meta name="description|keywords" content="..." />
is also recognised, as are HTML comments<!-- ... -->
.- Link and image targets that start with a . or a / are prepended with
Path:
, those that contain neither / or : but do contain a . are prepended withAttach:
. - As PmWiki doesn't support spaces within named anchors (
[[#...]]
), these spaces are replaced with the _ character. - IMG tags with
alt
ortitle
attributes are correctly handled, andalign=left|right
on an image results in the markup%lfloat%
or%rfloat%
at the beginning of the line. - Ordered and unordered lists are supported to an arbitrary depth.
- Attributes defined for a TR are only applied to the first TD of the TR.
- Only the
clear
attribute is supported for BR; having it set toall
,left
orright
results in[[<<]]
instead of\\
markup - The generated markup for form elements may differ from the usual PmWiki markup conventions, which make use of positional arguments instead of named arguments. The markup should still be valid, however.
- TEXTAREA is only supported for single-line default values, as PmWiki markup doesn't support it for multiple lines
Usage
- Install the recipe
- Paste HTML into a PmWiki edit box
- Press "Preview" or "Save and edit"
- Verify the resulting markup
Notes
The $ROEPatterns
array is available in the PmWiki core starting from pmwiki-2.2.0-beta45. For earlier versions, you'll need to implement Cookbook.ROEPatterns or replace the reference in the cookbook to use $ROSPatterns
.
Suggestions, fixes and improvements to the regular expressions involved are quite positively encouraged.
I am aware that <p>...</p>
tags end up having two empty rows between blocks, but this shouldn't affect the page's rendering and I'm not quite sure how to fix this in a robust manner.
If you use SourceBlock, you may need to add the following to your config file just before including convert-html.php
:
$ROEPatterns['#\(:(code|source)(?:\s+.*?)?:\).*?\(:\1e?nd:\)#sei'] = 'Keep(stripslashes("$0"), "H")';
I haven't actually tested the html2wiki program mentioned on the talk page, but as far as I can tell from its source files this recipe handles all of the markup also handled by html2wiki.
Release Notes
- 20210207 : update for PHP 7.3-8.0.
- 20150827 : fix incomplete/missing definitions, reported by Oliver Betz.
- 20150816 : update for PHP 5.5, requires PmWiki 2.2.58 or later.
- 2011-02-16Δ
- bugfixed quotes in
[=...=]
and[@...@
]
exclusion (reported by Maxim)
- bugfixed quotes in
- 2010-12-23Δ
- added
[=...=]
and[@...@
]
exclusion
- added
- 2010-04-20Δ
- added U (suggested by overtones99)
- added HTML, HEAD and BODY removal (suggested by Oliver Betz)
- bugfix: <title> markup (reported by Oliver Betz)
- 2009-08-25Δ
- A accesskey, rel, and target attributes are handled, with
<a ... target="_blank">
becoming%newwin%[[...]]
(suggested by overtones99) - bugfix: using stripslashes instead of stripmagic
- A accesskey, rel, and target attributes are handled, with
- 2009-04-20Δ
- 2008-10-07Δ
- better documentation
- bugfixes: white space in output, DL lists
- IMG
alt/title
andalign
attributes - better A names and targets
- 2008-10-05Δ — first public release
See Also
- Cookbook.ROEPatterns Replace On Edit
- Cookbook.ROSPatterns How to use
$ROSPatterns
,$ROEPatterns
, and$TROEPatterns
? - http://search.cpan.org/~diberri/HTML-WikiConverter-0.63/lib/HTML/WikiConverter.pm or the online version at http://toolserver.org/~diberri/cgi-bin/html2wiki/
- Cookbook.WikiMarkupConversion How to convert other wiki markups for PmWiki
Contributors
Comments
See discussion at ConvertHTML-Talk
User notes +4: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.