00168: utf-8 encoding doesn't work for [[PITS/links]]

Summary: utf-8 encoding doesn't work for links
Created: 2004-11-19 12:08
Status: Closed - fixed for 2.0.beta14
Category:
Assigned:
Priority: 5
Version: 2
OS:

Description(please click edit page to see the actual content, i am a newbie to pmwiki and i don't know how to edit some characters):

a link which contains utf-8 encoded characters is unusable, fails with 'Page not found' error.

example: doesn't work(romanian characters): Example Link ??âî?? works(no special characters): Example Link saait?

wikipedia is using a trick: the links look like this in HTML: <a href= "/wiki/9EtiinA3ele_p83mA2ntului" title="?tiin?ele p?mântului">?tiin?ele p?mântului</a>

so the browser shows the native language link, while it points to a converted one.


Getting PmWiki to work with utf-8 is non-trivial, as (1) the PHP installation has to have been compiled with unicode support for PCRE, (2) many (but not all) regular expressions in the pmwiki software have to have the '/u' option added to them to properly recognize utf-8 pagenames, and (3) the URL links that are generated have to be adjusted properly. It's on the list of things to do, but until PmWiki 2.0 is a bit more stable this is a suspended item. --Pm


I did some work today to see if I can get PmWiki 2.0 to work with utf-8 charsets and encoding. The UTF8 group has utf-8 enabled; please use this group to test different characters and page names. If it works sufficiently well I can include this in the next PmWiki release. --Pm


UTF8/UTF8 seems to have a bug with links - $Group is appearing in links :/


Yes, apparently PHP (or at least the version I'm running) isn't able to distinguish UTF-8 letters from other characters using preg_match and preg_replace. So I may have to re-think that whole section of the code. --Pm


Seems to be fixed now with my latest changes. I'm closing this PITS for now -- feel free to reopen it if there are further problems.


some utf-8 characters link (pagetitle, ?action=edit, ?action=diff, ?action=print) doesn't urlencode(), encode all link with utf-8 characters would be more compatible.

sample: <a href='http://www.pmwiki.org/wiki/UTF8/urlencode(utf8-char)'>utf8-char</a>

--shouda


Easy to say, not so easy to do -- it's not always possible for PmWiki to detect when something is a link to be displayed (don't urlencode) versus a link to be used by a browser to jump to another page (urlencode). PmWiki has to know the context of the URL in order to know how to encode it, thus there's no central place that I can use to "encode all links".

Best for now is to simply provide examples where PmWiki isn't generating links correctly and I'll fix them. --Pm


page: UTF8.UTF8
-> line 123, 124, 125: link with "?action=edit"

page: http://www.pmwiki.org/wiki/UTF8/%e4%b8%ad%e6%96%87
-> line 106: pagetitle link
-> line 110: link with "?action=print"
-> line 111: link with "?action=diff"
-> line 112: link with "?action=edit"

page UTF8.RecentChanges
-> line 116: link with "?action=edit"
--shouda


I think this is fixed for 2.0.beta14. --Pm

--- I try to enter "happynewyear' in chinese UTF-8 ???? it looks perfect when I edit the page but when I save it it shows ??. What did I miss. Sorry.If I copy the test from the above mentioned UTF-8 group it does not appear herecorrectly : Chinese test:

UTF-8 isn't enabled in this page, it's only enabled in the UTF8 group. --Pm

[(approve links) edit diff]???? (Wikipedia was mentioned earlier, but it also won't work with Japanese.)


Another unicode problem, is that I can create good pages in Japanese, but when I try to move the "wiki.d" folder to another server, the text gets corrupted. At first I though that maybe I had uploaded the files incorrectly, but on the server, the text in the files seems to be non-corrupted. (Note this happened when I had a local copy on a Macintosh, then wanted to put it on a UNIX Apache server). From the UNIX Apache server, downloaded and uploaded to the same server to another account caused no problems.


As far as experimented with this it seems that the files are saved with all the bytes you send as a filename, the only problem is when you get back to see or list the file. The filename may be interpreted with the wrong encoding and you will see a different representation of the bytes you used to save the file. The filename will show some/lots of weird characters in it but pmwiki will continue to work just right and also the contents inside the file will not change... At least in my case using openbsd+kde+LANG="".

CarlosAB December 10, 2008, at 09:00 AM