ConvertHTML-Talk
Comments
Alternative: html2wiki
There is a perl program html2wiki which does a good job. You can use the converter on the web page, or install the program.
It can be installed from CPAN in the usual perl way, or some Linux distributions may have it as a separate package, such as libhtml-wikiconverter-perl.
One needs to install both the HTML::WikiConverter module and the HTML::WikiConverter::PmWiki (which is the PmWiki "dialect" module).
The html2wiki script is a standalone program which takes a HTML input file and creates Wikified output. You can then cut-and-paste the output into the wiki (or use your favourite editor, see EmacsPmWikiMode and Pywe).
For example:
html2wiki --dialect=PmWiki input.html >output.wiki
20 Sept, 2022
html2wiki worked for me using:
sudo apt-get update
sudo apt-get -y install libhtml-wikiconverter-dokuwiki-perl
downloading and following the build and install instructions.
https://github.com/gitpan/HTML-WikiConverter-PmWiki
Then for example, as below:
html2wiki --dialect=PmWiki input.html >output.wiki
Throws deprecated error with PHP 7.3
running PmWiki 2.2.134 with PHP 7.3, ConvertHTML creates a "Deprecated: Function create_function() is deprecated in .../pmwiki.php on line 501" error.
No problem for me but maybe worth mentioning. OliverBetz February 07, 2021, at 04:38 PM
Updated for PHP 7.3-8.0 today. --Petko February 07, 2021, at 05:04 PM
Awesome, thanks!OliverBetz February 07, 2021, at 07:42 PM
Less conversions in version 20150816
is it by intention that the 20150816 version doesn't convert <p>, tables and much more present in the 2011-02-16 version?
No, it is an omission. Thanks for noticing -- should be fixed now. --Petko August 27, 2015, at 02:52 PM
Using ConvertHTML in another recipe --tamouse June 24, 2012, at 12:16 PM
I am looking at creating a recipe that will do the conversion of HTML outside the edit cycle. Would be possible to use this recipe in that way?
Errors?
Version 2011-02-16 converts code included in [=...=]
or [@...@
]
. Example:
$LinkPageSelfFmt = "<span class='selflink'>\$LinkText</span>";
becomes
$LinkPageSelfFmt = "<span class='selflink'>$LinkText</span>";
OliverBetz 2011-05-14
The latest version is giving syntax errors for me when editing certain pages (that contain no HTML):
Parse error: syntax error, unexpected ':', expecting T_VARIABLE or '$' in /home/smspower/public_html/pmwiki.php(1691) : regexp code on line 18 Fatal error: preg_replace() [<a href='function.preg-replace'>function.preg-replace</a>]: Failed evaluating code: Keep(stripslashes("[@ ... in /home/smspower/public_html/pmwiki.php on line 1691
The markup snippet is part-way through my Site.LocalTemplates. I'm not sure what the problem is, it was fine with 2009-08-25.
preg_replace
call with the PREG_REPLACE_EVAL
modifier. Fixed now with version 2011-02-16. —Eemeli Aro February 16, 2011, at 05:49 AM
font face not converted
I've just tried to convert some text that came from google showing me a word document. It took care of most issues, but I had to clean up a few hundred "<font face="Arial" size="5">DITA </font><font face="Arial" size="6">1</font>" type of things. Any possibility these could be included in the ROS patterns? Also & nbsp ; (ampersand-nbsp-semicolon) is left untranslated. --Peter Bowers May 11, 2010, at 08:19 AM
$ROEPatterns['#</?font([^>]*)>#i'] = '';
is left as it is since it's valid PmWiki markup as well. To replace them with normal spaces, you could try adding the following to your config. —Eemeli Aro May 11, 2010, at 09:18 AM
$ROEPatterns['# #'] = ' ';
redundant links
and one more - i was running into the issue that a link like <a href="http://blah.com">http://blah.com</a>
... is getting turned into [[http://blah.com|http://blah.com]]
, which is slightly redundant. i successfully added this line to the bottom of my ROEPatterns to reduce it even futher:
# convert [[http://blam.com|http://blam.com]] to http://blam.com ,'#\[\[(http[^\|]+)\s*\|\s*\1\]\]#i' => '$1'
thanks again! overtones99 August 26, 2009, at 01:28 AM
converting annoying tabs...
sorry - one more - it may just be a result of my own crappy first timer html coding efforts from several years ago, but i'm getting TONS of tabs everywhere in my output. i've found that adding the following very simple line is indispensible in my scenario:
'#\t#i' => "", # get rid of weird tabbing
overtones99 August 25, 2009, at 03:31 PM
Archived comments
"title" error, additional tags
In the title pattern seems to be an error: "\*s" should be "\s*".
<HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> should be removed.
What about <FONT> tags? IMO annoying, shouldn't they be removed?
What about converting character entities (e.g. "Umlauts") to searchable characters? Strings containing "ü
" etc are not searchable! Not an easy task because it might have unwanted side effects and it should respect the used character set. Maybe it should done only when found inside <HTML></HTML>.
OliverBetz 2010-01-24
$ROEPatterns['#</?font([^>]*)>#i'] = '';
converting underlines
hi eemeli. i just noticed that underlines <u>
aren't getting converted. i added the following line to my $ROEPatterns
:
'#<u>(.*?)</u>#i' => "{+$1+}",
thanks. overtones99 October 02, 2009, at 09:18 PM
current content of convert-html file
The title field in the current convert-html file is
Ubuntu Edgy on the Apple Macbook
and the file contents have nothing to do with html2pmwiki markup conversion
Jean-Pierre Chrétien 2010-01-08
Thanks for letting us know, now fixed (copied from convert-html-2009-08-25.php). --Petko February 09, 2010, at 07:38 AM
Update of today?
Hi Eemeli, the convert-html.php script was today uploaded again with no author specified and without further information. Spammed or correct version? -- SchreyP January 19, 2010, at 05:05 PM
converting links with '_blank'
to %newwin%[[url|text]]
hi. this works great. however, i've found that adding the following line to the top of my ROEPatterns is a must-have for my setup - maybe it is for others too?
# add %newwin% before links with _blank '#<a\s[^>]*\bhref=([\'"])([^\'"]*?)\1[^>]*_blank[^>]*>(.*?)</a>#is' => "%newwin%[[$2|$3]]",
overtones99 August 25, 2009, at 04:58 AM
Thanks! the functions works great! in fact, it also solves another problem i was having, where links without "" (ie. <a href=http://blah.com>
, as opposed to <a href="http://blah.com">
) weren't getting converted - but now they are! thanks! overtones99 August 25, 2009, at 03:31 PM
I followed the steps mentioned herein. The cookbook doesn't seem to work. I still see plain HTML code as the output.
I found this recipe really useful, and can fully recommend it. It saved me a lot of time.
But there are a couple of minor things to note:
<a href="Two#two">second</a> <a href="#three">third</a>
incorrectly gives
[[Attach:Two#two|second]] [[Attach:#three|third]]
should give
[[Two#two|second]] [[#three|third]]
-- simon
<a href="#three">third</a>
, yes, the result is wrong, for <a href="Two#two">second</a>
I'm not so sure. Also note that also <a href="Two">second</a>
currently gives [[Attach:Two|second]]
. For a quick fix, change the last parenthesized part of the regular expression on line 58 of convert-html.phpΔ to ([^/:\'"\#]+?)
, ie. add '\#' to the character class. --Eemeli Aro February 19, 2009, at 05:15 PM
Attach:
, both links are clearly source anchors linking to a destination anchor within a page. -- simon
[[Attach:Two#two|second]] [[Attach:#three|third]]
Note
- does not convert <form to
(:input form ...
- does not convert <input to
(:input ...
Talk page for the ConvertHTML recipe (users).