UTF8Conv

Summary: A recipe to help with the convertion of different encodings to UTF-8
Version: 2008-12-29
Prerequisites: was tested with 2.2.0-beta68, but will probably work with all versions
Status: alpha, abandoned
Maintainer: CarlosAB
Discussion: UTF8Conv-Talk

This recipe was abandoned as it is not PmWiki especific and because there are better examples out there.

Here are some examples:

https://github.com/splitbrain/dokuwiki/blob/204d9c533d983cce1a75f12ef218a92b01961d46/inc/utf8.php

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/normal/

Description

This recipe converts user input in all encodings made available by the mbstring php extension and some with iconv to UTF-8.

Notes

I was having problems with mixed encodings in pagenames and text and so I decided to have an UTF-8 only pmwiki site. With this recipe it doesn't matter what encoding you use for pagename or user submitted text, it will be converted to UTF-8, as long as you convert the input with this recipe.

The recipe uses something like this while(!utf8_compliant($str)){utf8conv($str)} and it loops till your $str gets converted to utf-8 OR the value $Utf8ConvMaxRuns gets reached. The recipe will try to use some different ways to get $str converted to UTF-8, using primarily mbstring and iconv functions. If php doesn't have the Mbstring extension, it will use iconv and you can also use another reipe called HEBCI detect in which encoding the user browser is using.

Some functions were taken from the excellent package called php-utf8 from Henri Sivonen.

If some character doesn't get converted to UTF-8 it will be striped from pagename AND content. You have been warned!

Release Notes

In local/config.php just do the usual and load the recipe from the previously unzipped utf8conv.php in ./cookbook/, with a include_once("$FarmD/cookbook/utf8conv.php").

Download

DOWNLOAD: Attach:utf8conv.phpΔ NEW

I hope this helps you keep your pmwiki site UTF-8 clean and if you have any suggestions, bugs, improvements please let me know.

Changelog

January 02, 2009, at 06:41 PM

  • changed utf8bom var, (Wrong BOM for UTF-8)

January 05, 2009, at 08:58 AM

  • changed the BOM value for utf8BOM var. (it was wrong)
  • now if string is only 7-bit ASCII we don't do nothing, so you will be able to see the contents of PmWiki folder and create pages only in ASCII.
  • removed function utf8_valid and replaced with utf8_compliant (smaller, faster) and fairly secure as we will remove non-utf-8 chars with CharStrip Function.
  • changed a bit (just for readability) the vars involved with MaxRuns
  • included UTF-8 in $MbstringEncodings.
  • included SDVs for:
    $Utf8BOMInASCII - puts BOM even in pure ascii 7-bit
    $Utf8WithoutBOM - creates utf-8 without BOM
    $Utf8AlwaysRunConv - even if a string is pure ascii, run convertion to utf-8
so different systems can benefit.
  • implemeted a better BrowserEncoding for use in utf-8 convertion

See Also

Cookbook.UTF-8 and PmWiki.UTF-8.

Contributors

CarlosAB December 29, 2008, at 06:49 AM

Comments

See discussion at UTF8Conv-Talk

User notes? : If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.