Unicode for the PHP programmers

This sections will contains code regarding different PHP & MySQL features and contains help about php/mysql issues.

Unicode for the PHP programmers

Postby Web Guru on July 2nd, 2008, 5:07 am

"Hello World" and nearly all the other examples found in popular PHP tutorials and references assume a restricted form of English for their "natural language" communications. But PHP is capable of more. With the right techniques, PHP effectively handles not just the occasional accented character found in English names and loanwords but the characters of the world's most common languages: German, Russian, Chinese, Japanese, and many more.

Run this small PHP program:
Coding Urdu output
Code: Select all
               
     $q = "پاکستان";
     print html_entity_decode($q, ENT_NOQUOTES, 'UTF-8')."\n";



With any luck, the output you see will be پاکستان — Urdu for "Pakistan"

Too often, dealing in PHP with characters other than those of the standard English alphabet has been a matter of luck and even mystery. Even though a great deal has been written on such subjects as character encoding, internationalization, etc., much of it has been wrong, or at least outdated, and most of the rest rather tied to a particular configuration of PHP. The aim of this article is to present only the basics of Unicode handling in PHP, but to do so with enough care and completeness to provide a firm foundation for any "international programmming" you need to do.

This apparently simple two-line program involves a great deal of context. First, I assume PHP V5. While it's possible to manage non-English characters with PHP V4, it generally involves nonstandard extensions, and is almost certainly a misplaced effort in 2007. PHP V6, on the other hand, is scheduled to solve so many character encoding problems as to supersede most of the techniques shown here. With PHP V6, it's hoped that Unicode strings will just work.

Code format
The PHP source code in this article is designed to work for the great majority of developers. As much as possible, it applies to any standard PHP V5 installation.

To maintain a focus on the essentials, the source code is presented without enclosing <?php and ?> boilerplate tags. Output is most often targeted for text/plain.

Coding Urdu output, with more complete tagging
Code: Select all
<?php
                   // The next two lines are necessary only for
                   // unusual configurations, but can only help.
               mb_language('uni');
               mb_internal_encoding('UTF-8');

               $q = "&#1662;&#1575;&#1705;&#1587;&#1578;&#1575;&#1606;";
               print "<html>".html_entity_decode($q, ENT_NOQUOTES, 'UTF-8')."</html>";
?>


Programming considerations
As mentioned, it's possible to make PHP work with Unicode in several ways, including extensions to PHP, different encodings, etc. Unless you're expert, though, I recommend against trying to decide between these many possibilities. You're almost certain to achieve the best results if you focus on this single, consistent target:

* Explicit use of UTF-8, marked with
o "mb_language('uni'); mb_internal_encoding('UTF-8');" at the top of your scripts
o Content-type: text/html; charset=utf-8 in the HTTP header, by way of .htaccess, header() or Web server configuration
o <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> and <orm accept-charset = "utf-8"> in HTML markup
o CREATE DATABASE ... DEFAULT CHARACTER SET utf8 COLLATE utf8 ... ENGINE ... CHARSET=utf8 COLLATE=utf8_unicode_ci is a typical sequence for a MySQL instance, with comparable expressions for other databases
o SET NAMES 'utf8' COLLATE 'utf8_unicode_ci' is a valuable directive for PHP to send MySQL immediately after connecting
o In php.ini, assign default_charset = UTF-8
* Replacement of string functions, such as strlen and strtlower, with mb_strlen and mb_convert_case
* Replacement of mail and colleagues with mb_send_mail, etc.; while Unicode-aware e-mail is an advanced topic beyond the scope of this introduction, the use of mb_send_mail is a good starting point
* Use of multibyte regular expressions functions

An ellipsis function I often use provides a small example of how to work with multibyte string functions. The original version of this function was close to:

Conventional truncation
Code: Select all
                       
    function ell_truncate($string, $permitted_length) {
        if (strlen($string) <= $permitted_length)
            return $string;
        $ellipsis = "...";
        return substr_replace($string, $ellipsis,
                        $permitted_length - strlen($ellipsis));
    }
 


Applied to a long explanation, with a length of 10, the result is a long ..., while increasing the length to 30 returns the original string. This is handy for quick abbreviation of titles, for example.

Following is an illustration of a more Unicode-savvy solution.

Better ellipses
Code: Select all
               
          function mb_ell_truncate($string, $permitted_length) {
              if (strlen($string) <= $permitted_length)
                  return $string;
              $ellipsis = html_entity_decode("&#x2026;",
                                   ENT_NOQUOTES, 'UTF-8');
              return mb_substr($string, 0,
                              $permitted_length -
                                       mb_strlen($ellipsis)).
                     $ellipsis;
          }
     
          $q = "&#1047;&#1076;&#1088;&#1072;&#1074;".
               "&#1089;&#1090;&#1074;&#1091;&#1081;".
               "&#1090;&#1077;";
          $q = html_entity_decode($q, ENT_NOQUOTES,
                       'UTF-8');
          print mb_ell_truncate($q, 8)."\n";


This uses a standard typography for the ellipsis and correctly counts the characters of the string to abbreviate in all combinations of PHP configuration.

All these items constitute only a starting point for Unicode programming. Plenty of larger challenges remain, including:

* Not all languages make the upper-case/lower-case distinction
* In many, "alphabetization" isn't meaningful, so sorting has a different interpretation from in English
* The same two characters might sort in different orders depending on the languages they're writing
* Security multiplies in complexity; what you see as "abc" might be completely different values from the usual English letters, which happen to be printed the same way

These issues are shared by most Unicode-capable computing languages. The point of this article is to ensure that you understand the fundamentals in sufficient depth to have confidence to attack more advanced topics. Remember: If you're having to work hard or do tricky coding in handling Unicode, you're probably doing something wrong. PHP V5 and the tips above are designed to make your Unicode programming simple.
User avatar
Web Guru
 
Posts: 68
Joined: March 24th, 2008, 7:59 am
Location: Lahore, Pakistan

Return to PHP / MySQL / XML

Who is online

Users browsing this forum: No registered users and 0 guests