MigrateUTF8

Summary: Convert filenames of PmWiki pages and uploads to UTF-8
Version: 2024-07-14a
Prerequisites:
Status: experimental
Maintainer: Petko
License: GPL
Users: (view? / edit)
Discussion: MigrateUTF8-Talk

How to migrate a wiki from an older 8-bit encoding to UTF-8?

Description

Convert filenames of PmWiki pages and uploads to UTF-8.

If you have an international wiki installed several years ago, it is likely it is still using one of the older 8-bit encodings (character sets).

PmWiki is able to automatically recode text and metadata from page files in a different encoding, but cannot currently recover page filenames or attachment filenames with international characters.

The recipe will scan your wiki.d and uploads directories, and will show how individual words would be recoded, then will list all page files and attachments that where the filenames would be recoded.

You can review the recoded strings and filenames, and if something is incorrect, you can select a different source encoding from a drop-down and preview that.

If you are satisfied, you can then launch the actual recoding.

The semi-automated function will perform these steps:

  1. Temporary directories will be created for wiki page files (wiki.d-temp-utf8) and attachments (uploads-temp-utf8).
  2. $PageIndexFile will be recoded to UTF-8.
  3. All page files and attachments will be moved to the temporary directories, recoding their filenames where necessary. PmWiki cache files are skipped since they would be invalid after the migration.
    • If the processing takes more than 10 seconds, it will stop, and you will need to refresh the page to continue from where it stopped. This is to prevent the PHP process to hit the system limits and to crash.
  4. The old directories will be renamed to "wiki.d,old" and "uploads,old" and the temporary directories will be moved to their names.

Prerequisites and notes

You should really make a backup of your wiki and keep it away from the web server. If you (or I) make a mistake, the recoded page files may be unrecoverable, and the backup may be the only way to save your wiki.

The recipe has not been tested on MacOSX filesystems which encode/normalize international characters in special ways. I don't have a way to test this so I wouldn't know how and if it will work.

This probably requires a recent PmWiki version.

UTF-8 is better supported in PHP 7.0 and more recent, which is strongly recommended.

Only standard or mostly standard page store directories, and upload directories, are supported. The default $WorkDir, or one with PerGroupSubDirectories, or variants of these, or classes that store pages in individual files, should work. Other page store classes like SQLite are not supported at this time. In fact, the recipe will just recursively recode all filenames from wiki.d and uploads that contain international characters, without actually reading the page content (except for $PageIndexFile).

It will recode the filenames of all files in $WorkDir, including .htaccess, but will not recode the content of the files (except $PageIndexFile). If you have other recipes that store data files in $WorkDir or $UploadDir, their content will still be in the old character encoding, and when the other recipe retrieves it, it may appear broken. Examples include Worse which caches the page tiles and attachment filenames.

Installation and usage

1. Place migrateutf8.phpΔ in your pmwiki/cookbook directory.

2. Add near the top of local/config.php:

$EnableReadOnly = 1;
include_once("$FarmD/scripts/xlpage-utf-8.php");

if($action == 'migr8') {
  include_once("$FarmD/cookbook/migrateutf8.php");
}

3. Make a backup snapshot of all your wiki and move it outside of the server, say on your home computer or on a USB drive.

4. On your wiki, open Page?action=migr8

5. Follow the instructions and review the individual words, and the lists of filenames to be recoded.

Click to toggle sample preview output

This function will migrate a wiki content from an old 8-bit encoding to UTF-8.

The semi-automated function will perform these steps:

  1. Temporary directories will be created for wiki page files (wiki.d-temp-utf8) and attachments (uploads-temp-utf8).
  2. wiki.d/.pageindex will be recoded to UTF-8.
  3. All page files and attachments will be moved to the temporary directories, recoding their filenames where necessary. PmWiki cache files are skipped.
    • If the processing takes more than 10 seconds, it will stop, and you will need to repost the form below to continue from where it stopped.
  4. The old directories will be renamed to "wiki.d,old" and "uploads,old" and the temporary directories will be moved to their names.

Note, this only recodes the directory names, and filenames of page files and attachments. Normally PmWiki should recode page content automatically (text, metadata, and history).

Preview recoding of filenames

Please preview unique words to be replaced:

OldNew
A\xeen\xe9eAînée
\xc0Bient\xf4tÀBientôt
\xc2geÂge
Ch\xe2taigneChâtaigne
Cr\xeaperieCrêperie
D\xe9j\xe0VuDéjàVu
\xc9l\xe9phantÉléphant
Fert\xe9Ferté
G\xe2teauGâteau
P\xe2t\xe9Pâté
Tr\xe8sBienTrèsBien
voil\xe0voilà

6 new directories to be created:

  1. uploads-temp-utf8/Ferté
  2. wiki.d-temp-utf8
  3. wiki.d-temp-utf8/Ferté
  4. wiki.d-temp-utf8/Main
  5. wiki.d-temp-utf8/Site
  6. wiki.d-temp-utf8/SiteAdmin

14 files with international filenames to be renamed (recoded):

OldNew
uploads/Fert\xe9/voil\xe0.jpguploads-temp-utf8/Ferté/voilà.jpg
wiki.d/Fert\xe9/.htaccesswiki.d-temp-utf8/Ferté/.htaccess
wiki.d/Fert\xe9/Fert\xe9.\xc0Bient\xf4twiki.d-temp-utf8/Ferté/Ferté.ÀBientôt
wiki.d/Fert\xe9/Fert\xe9.\xc2gewiki.d-temp-utf8/Ferté/Ferté.Âge
wiki.d/Fert\xe9/Fert\xe9.\xc9l\xe9phantwiki.d-temp-utf8/Ferté/Ferté.Éléphant
wiki.d/Fert\xe9/Fert\xe9.A\xeen\xe9ewiki.d-temp-utf8/Ferté/Ferté.Aînée
wiki.d/Fert\xe9/Fert\xe9.Ch\xe2taignewiki.d-temp-utf8/Ferté/Ferté.Châtaigne
wiki.d/Fert\xe9/Fert\xe9.Cr\xeaperiewiki.d-temp-utf8/Ferté/Ferté.Crêperie
wiki.d/Fert\xe9/Fert\xe9.D\xe9j\xe0Vuwiki.d-temp-utf8/Ferté/Ferté.DéjàVu
wiki.d/Fert\xe9/Fert\xe9.Fert\xe9wiki.d-temp-utf8/Ferté/Ferté.Ferté
wiki.d/Fert\xe9/Fert\xe9.G\xe2teauwiki.d-temp-utf8/Ferté/Ferté.Gâteau
wiki.d/Fert\xe9/Fert\xe9.P\xe2t\xe9wiki.d-temp-utf8/Ferté/Ferté.Pâté
wiki.d/Fert\xe9/Fert\xe9.RecentChangeswiki.d-temp-utf8/Ferté/Ferté.RecentChanges
wiki.d/Fert\xe9/Fert\xe9.Tr\xe8sBienwiki.d-temp-utf8/Ferté/Ferté.TrèsBien

11 files with ASCII filenames to be moved (no recode):

OldNew
wiki.d/.flockwiki.d-temp-utf8/.flock
wiki.d/.htaccesswiki.d-temp-utf8/.htaccess
wiki.d/.lastmodwiki.d-temp-utf8/.lastmod
wiki.d/Main/.htaccesswiki.d-temp-utf8/Main/.htaccess
wiki.d/Main/Main.HomePagewiki.d-temp-utf8/Main/Main.HomePage
wiki.d/Main/Main.RecentChangeswiki.d-temp-utf8/Main/Main.RecentChanges
wiki.d/Site/.htaccesswiki.d-temp-utf8/Site/.htaccess
wiki.d/Site/Site.AllRecentChangeswiki.d-temp-utf8/Site/Site.AllRecentChanges
wiki.d/SiteAdmin.Statuswiki.d-temp-utf8/SiteAdmin.Status
wiki.d/SiteAdmin/.htaccesswiki.d-temp-utf8/SiteAdmin/.htaccess
wiki.d/SiteAdmin/SiteAdmin.Statuswiki.d-temp-utf8/SiteAdmin/SiteAdmin.Status

Please preview the above replacements. If you see any errors, you should try different source encodings in the form below, and press "Preview", until you find the correct one.

6. If there are errors, from the drop-down select a different source encoding and press "Preview", then return to 5.

Click to toggle sample mockup form

Control panel
Select source encoding:
 
If some or all of the above recoded samples are wrong, please try different encodings.

Please confirm these statements to continue:

Cookbook:MigrateUTF8

7. If there are no errors, check the 3 checkboxes, then press on the button "Recode (cannot be undone)".

8. You will see all directories and filenames that were recoded and moved.

9. If your wiki has hundreds or thousands of pages, the processing will stop after 10 seconds, then will list the remaining files and you can again launch the recode process from where it stopped.

10. At the end, it will move the original wiki.d and uploads directories with a suffix ",old", and the temporary ones to their names.

11. Browse the wiki, try downloading attachments, check if links to pages with international characters in the names work.

12. If you had uploaded attachments with international characters in the filenames like "voilà.jpg", you may need to enable the Unicode international character patterns. Something like this in config.php:

  $UploadNameChars = "-\\p{L}\\p{N}_. ";
  $MakeUploadNamePatterns = array(
    "/[^$UploadNameChars]/u" => '',
    '/(\\.[^.]*)$/' => 'cb_tolower',
    '/^[^_\\p{L}\\p{N}]+/u' => '',
    '/[^_\\p{L}\\p{N}]+$/u' => ''
  );

Here \\p{L}\\p{N} means any letter-like and number-like character in any language (available since PHP 7.0).

13. If everything is fine, remove or comment out the line $EnableReadOnly = 1;

Otherwise you may wish to restore the wiki from your backups and try again with a different encoding.

You can also contact me at MigrateUTF8-Talk or via email at 5ko [snail] 5ko [period] fr.

Configuration

See the source code for some configuration options. It is recommended to define these variables in config.php and not edit the script itself.

To do / some day / maybe

Change log / Release notes

  • 2024-07-14a : sort the preview lists.
  • 2024-07-14 : Server configuration files .htaccess are now copied instead of moved (so that they still stay in the discarded directories). Add links to PmWiki:UTF-8 and MigrateUTF8-Talk.
  • 2024-07-13 : First public release, ready to be tested.

See also

Cookbook /
MigrateUTF8-Talk  Talk page for MigrateUTF8.
UTF-8  A collection of UTF-8 related tips and fixes
PmWiki /
UTF-8  Enabling UTF-8 Unicode language encoding in your wiki.

Contributors

Recipe written and maintained by Petko. Feel free to support my work on PmWiki core and recipe development and maintenance via this page.

Comments

See discussion at MigrateUTF8-Talk

User notes? : If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.