MultiLingual Computing, Inc., Magazine
menu 1
menu 2
menu 3
menu 4
menu 5
menu 6
menu 7
menu 8
About Us
Magazine
News
Guides
Calendar
Careers
Resources
Downloads
MultiLingual Computing Home Page

MultiLingual Article

Search Articles


Search for keyword:

Search for author:


 
 
Featured Article
Friday, July 30, 2010


Solving Compatibility Issues With Posix

This older locale model remains a viable alternative for internationalizing Win32 applications

BILL HALL


Several years ago when I was reviewing the Win32 API set to learn how locale is handled in Windows NT, I completely overlooked the venerable locale model provided by the Portable Operating System Interface Standard (Posix). In truth, I thought the Windows model was enough for any situation involving Microsoft platforms. But, more recently, I’ve been confronted with interoperability issues that have been most easily solved by using the functionality Posix can provide.

Along the way, I found out my ignorance was widely shared even by experienced UNIX developers. This was a bit of a surprise since on UNIX systems the Posix locale model provides the bulk of internationalization functionality. I suppose it is the same old problem of developers not being educated in school on any aspects of software internationalization. My guess is that their instructors didn’t know anything about the subject either and so just never taught it. That is a pity because the Posix locale is relatively easy to understand, and any ANSI C-compliant compiler supports it. Posix has also been adapted to some scripting languages such as Perl and Python, and you can see vestiges of its influence in Java and in the Windows locale implementation.


Chart of ASCII characters

So, how does the Posix locale model come into play in Windows? After all, Win32 platforms provide very extensive locale support with a rich set of functions to manage the typical locale issues such as formatting date, time, currency, number, calendars, sorting text and supplying locale “bits,” that is, what is the list separator in France, do they use A4 paper in Germany, how are year, month and day ordered in Japanese dates and so on.

One reason that you find Posix in Windows is for interoperability and compatibility with other systems. It was probably only reluctantly added. When Windows NT first appeared on the scene, some degree of Posix compatibility was required for US government certification. But another good reason is to provide a C compiler that meets the ANSI standards for C and C++. Fortunately, although Microsoft’s implementation is not as rich as what can be found on many UNIX systems, it meets basic requirements and can be extremely useful in unexpected ways. Harley Rosnow, a Program Lead in the Microsoft FrontPage group, pointed out to me that since Posix also provides wide character support and because Unicode is the wide character set of Win32, Unicode(ly) challenged systems such as Windows 95/98/Me gain additional functionality over the limited number of Unicode type Win32 APIs in these systems. For example, the Unicode API CharUpperW, which maps a Unicode string to upper case, does not work in Windows 9X. It leaves the string unmodified. However, wcsupr, which is affected by the Posix locale setting, does work and provides a way to map a Unicode string to upper case within certain limits determined by the code page setting (this will make more sense further on when we define the Posix locale more precisely).


Characters from Code Page 932

In this article, we will discuss the Posix model, mostly as it is realized in Win32 systems, using an actual example from operational code. Since the subject can be extensive, we will limit the discussion to practical matters so that you can grasp the basic ideas and, after some study on your own, make use of it fairly quickly. It is also important to understand the Microsoft implementation, which differs somewhat from what you find on Solaris and other UNIX systems. If you would like to see more about how Posix looks on UNIX systems, you should consult Creating Worldwide Software by Bill Tuthill and David Smallberg. This is an excellent general reference for things international as well. For details and examples on Win32 systems, see David A. Schmitt’s International Programming for Microsoft Windows. Microsoft Developer Network (MSDN) references can also be helpful, since it is the only way you will be able to understand which C-Library APIs are affected (but do see my comments further on). Finally, you can find lots of information on Posix including man pages and vendor documentation on the Web using simple search methods.

The Basics

The Posix locale mechanism allows programmers to deal with certain cultural issues in an application without requiring the programmer to know all the specifics about each country where the software is executed. The idea is that you continue to use your favorite set of programming APIs (within a limited subset), but by some kind of magic, they produce results that are correct for the locale where the code is running. The magic is brought about by a call to setlocale, which according to the parameters passed to it, changes the behavior of an extensive group of C-library functions. The functions themselves have no locale parameters, and there is no indication that they might produce results that differ from their original US English/ASCII character set based behavior, which is enshrined in the default Posix locale called “C.” The effect of using setlocale is global, and the mechanism is usually not thread safe. So, it can slow down operations if it is called often to switch from one locale to another, and Posix is best used in single locale operations. But this is a common situation and is often not a limitation.

Contrast this model to the one seen in Win32 and Java where the programmer has access to APIs that explicitly require a locale identifier or instance of a Locale class. For example, GetDateFormat in Windows takes a Win32 Locale ID (LCID) as well as some other parameters to return a date string formatted according to the specified LCID. Similarly, the Java DateFormat class requires a Locale class when instantiated, and that governs its subsequent behavior. But neither of these operations has a global effect no matter what locale is being used. Nevertheless, Posix continues to be useful mainly because of portability and uniformity of approach. Maybe you would like to see an example from a problem found in some actual production code.

An Example

The issue was to replace ASCII curly braces { } by ASCII parentheses ( ) in a string before passing it to another API. The intention was to allow the user some flexibility in forming the string. However, since the receiving API did not understand braces, it was necessary to scan for and replace them by parentheses. The same routine was called from several different platforms and in several places. As soon as it was tried on a Japanese system, some strings worked and others failed mysteriously. When I was first presented with the problem, I asked what Japanese character was causing the problem, looked up its code point in a chart and knew right away what the problem was after examining its trail byte. It is not because I’m so clever; it is just that the same thing happens depressingly often. The original code, someone abstracted, looked like this:


char* ptr = string;   // point to the string
while (*ptr) {     // repeat until the end of the string
    if (*ptr == '{')  // look for a left brace
       *ptr = '('';  // found one, replace it with parenthesis
    else if (*ptr == '}'  // look for right brace
       *ptr = ')';  // replace with right parenthesis
       ptr++;  // move to the next 'character'
}

This is fine if the string is from a single-byte code page but can fail on multibyte strings. The usual problem is that the character pointer in the code is incremented only by one byte as the string is traversed in the loop and therefore can fall in the middle of a character that requires multiple bytes to represent it.

Pictures may help. First, in the chart of the ASCII characters (note Microsoft’s designation as 20127), note that the curly braces have code points 0x7B for { and 0x7D for }. Now, look at the set of characters with lead byte 0x89 from Code Page 932 (Japanese Shift-JIS).


Proper handling of multibyte characters

Run your eye across the line at 8970 until you reach columns 0B and 0D. If these values are added to the base 0x8970, the results are 0x897B and 0x897D; each represents two sequential bytes of which the second (trail byte) just happens to be respectively the ASCII codes for { and }. As a result, these two kanji characters will be trashed by the code snippet. Here is why: first, the pointer hits the lead byte of 0x89 and skips over it since it is not equal to { (0x7B) or } (0x7D). On the next pass through the loop, the pointer is staring at the trail byte. If it has the same value as } or { that value will get replaced by ( (0x28) or ) (0x29).


The wrong way to parse a multibyte string

A new character will thus have been created; in this case an illegal one because no trail byte in this code page lies below 0x40. In other instances of similarly bad code, a Japanese character can be transformed into another. No wonder the original code failed so miserably, not to mention mysteriously. After all, most of the characters go through the code unscathed. Of course, for each lead byte portion (which runs from 0x81 to 0x9F and 0xE0 to 0xFC), you will find two characters that will fail the same way when replacing braces with parentheses.

I have included a vivid example of proper handling of multibyte characters. Compare the test string to the output string. If the parsing is done properly, the characters will be converted correctly. Note also that the Posix locale has been set to work with Code Page 932. You will see how this helps further on.

But if character boundaries are ignored, the code converts the original characters to ones that don’t exist in the code page, and the display puts up the font’s default symbol (•). Note that in this case the Posix locale is the default C locale.

It is easy to fix up the bad code in Windows. Just replace the last line p++, which increments the pointer by one storage unit (in this case, a byte), with the line:

p = CharNext(p);

If the proper system locale has been set on the platform (in this case Japanese), then CharNext knows how to skip to the next character boundary. You can also use CharNextExA for direct control on the code page being analyzed since a code page can be specified explicitly (p = CharNextExA(932, 0, p)), and this will work even on a non-Japanese system. But neither API is portable to other platforms, and the client had to implement the solution in a uniform way. Here is what worked:


void replacebrk(const char *in, char *out)
{
    int inc;       // how much to increment the pointer
    while (*in) {
       if (*in == '{')    // look for a left brace
              *out = '(';    // replace with left parenthesis
       else if (*in == '}')    // do the same for right brace
              *out = ')';
       inc = mblen(in, MB_CUR_MAX);    // let mblen calculate increment
       if (inc == -1) // error, just bump up to next byte
              inc = 1;
       in += inc;  // increment pointers to next character boundary
       out += inc;
    }
    *out = 0;
}

The key API is the function mblen, which knows how many bytes are required to be added to the pointer to get to the next character boundary. But, as you can see, mblen has no explicit locale or code page setting; it only wants to know what the first character is in the string and the limit value (usually 2 on Windows systems but the value is system dependent). Rather it is the runtime Posix locale that governs the behavior of mblen. In the incorrect-parsing example it was the default Posix locale (called C), which does not provide the needed multibyte support. In the correct example, the locale was set to Japanese with Code Page 932 (Japanese_Japan.932). Since mblen is a part of the standard C-Library (it is defined in stdlib.h), the routine will be found on any ANSI-conforming compiler.

Defining and Setting the Posix Locale in Windows

If you are a UNIX head, the rules for Posix in Windows will seem strange. The basics are the same: a string must be created in the proper format and passed to the basic locale function setlocale. The string has the form:

Locale_string = language_territory.code_page@variant_tag

But how the language, territory, code page and variant are defined is entirely up to the compiler manufacturer. Most UNIX systems use ISO-639 and ISO-3166 values for the first language and territory. The variant tag is seldom seen. Typical examples are en_US.USASCII for English, United States, ASCII character set, and de_CH.iso8859-1 for German, Switzerland and Latin I character sets.

However, Windows uses an older but once-standard style of spelled-out languages and countries. You can see the string for Japanese, Japan, Shift-JIS encoding in the correct parsing example. A couple of other examples are German_Switzerland.1252 for the German speaking locale in Switzerland using the Windows Western European Code Page 1252, and Chinese_Hong Kong.950 for Hong Kong using traditional Chinese encoding. Later, I’ll show you a reliable way to create these strings in Windows. You don’t have to memorize them or look them up but can construct them from a Windows LCID.

After you have the string, you can set the locale explicitly in a command shell in UNIX by just using the LANG environment variable. In Windows, this does not work in a command shell. Programmatically on all systems, the call is

new_locale = setlocale(LC_XXX, locale_string);

You can also inherit the current locale (and this is what is most frequently done) by setting locale_string = "", the empty string. (Don’t use NULL. That is not the same as ""; that just returns the current setting.) In Windows, it is important to note that you get a Posix locale based on the current Windows User locale. It is not inherited from the Windows System locale. This is useful to know when testing since you can just change the user locale from the Regional Settings dialog without restarting on Windows NT/2000. A system locale setting requires a reboot, but you may have to do this to see the results displayed with the right characters.

You must be asking what is LC_XXX? Posix allows you to affect specific subsets of the APIs that are Posix aware or set them all. The categories are explained briefly in the accompanying table “Functions Affected by setlocale” along with a few of the C-Library functions that are affected.

UNIX systems have an additional one called LC_MESSAGES. A program or utility having the appropriate message catalogues available can respond with messages in the appropriate language according to the Posix locale setting. You can experiment with this using the env and setenv commands if you have Linux, Solaris or another UNIX system at hand.

By the way, these same categories live on in a new guise in the C++ <locale> template classes as well; they are known as facets in this presumably richer but certainly much more confusing locale model. It is really something only geeks could love, and that will ensure a lack of interest in its use.


Functions Affected by setlocale
LC_ALL All categories
LC_COLLATE Certain collation functions such as strcoll, wcscoll, strxfrm and so on
LC_CTYPE The character-handling and mapping functions such as isalpha, toupper, isupper, wcstombs, mblen and so on
LC_MONETARY Monetary-formatting information returned by the localeconv function
LC_NUMERIC Decimal-point character for the formatted output routines such as printf, for data-conversion routines, and for the nonmonetary-formatting information returned by localeconv
LC_TIME The strftime and wcsftime functions

The table of functions affected by setlocale can only give a hint as to the range of possible functions affected by the setlocale function. You must do lots of research and experimentation. Documentation can be contradictory, missing or wrong. You also have to watch out for platform differences. Posix only requires a certain minimal level of implementation, but many platforms have made up for deficiencies with additional but possibly nonportable support. On Windows, Microsoft has added a large number of C-Library APIs that are useful but nonstandard to support their bimodal character set and API model. To help you distinguish them, they are usually named with a leading underscore (_). For example, _mbclen (a Microsoft extension) is documented together with mblen, but the first is affected by the Windows System locale and the second by Posix.

Posix Possibilities

Here is an example of another program to show you some of the possibilities when using Posix. The program was written for Windows, and the source code is available with this article. The user can select from the available Posix locales on the system, although we have restricted the code page to the default ANSI (Windows) code page. However, Windows supports the default OEM (Command shell) character set as well in its Posix implementation.

For each locale, the dialog shows a date and time formatted by the strftime function and the lconv structure returned by the localeconv API. On the upper right, you can find an edit box where you can type in a number using the decimal separator for that locale. As you type, the string you create is passed to sscanf to be parsed as a float. Normally, this kind of code would fail unless the decimal separator is a dot (.), but because the Posix locale has been set according to the selected value, sscanf knows to interpret the separator as the radix marker.


Posix locale dialog in Windows

One of the shortcomings of the Microsoft implementation of Posix is that there is not much support for numeric and currency formats. However, such support is not a part of the standard; all that is required is to provide the informational structure lconv, which can be retrieved by using the API localeconv.

You can see the result for the Czech locale. Based on the information in lconv, the programmer can write his or her own formatter. In practice, this is not so easy to do. Most UNIX systems provide additional routines, although they may not be portable. David Schmitt has augmented Microsoft’s Posix offerings in International Programming for Microsoft Windows, and you may find his offerings useful.

Sometimes, Posix does not supply the most accurate results. In the Czech example, the date format should read 3. února 2001 — that is, no leading zero on the 3, and the month name, for February in this case, should be in the genitive case. A Czech speaker would read the date as the third of February, 2001. The a ending added to únor provides the of sense.


Creating a Posix Locale String

  Here is a code fragment that will help you create a Posix locale string from a Windows LCID. The fragment is taken from a callback function used to fill out the drop down list in the Posix application mentioned above. The input parameter is a Windows LCID in string format. Only the relevant parts for building the string are shown.

BOOL CALLBACK CPLocList::LocaleCallBack(LPTSTR lpszLocale)
{

    LCID lcid = _tcstoul(lpszLocale, NULL, 16);
    CString lstr = lcinfo.GetLocaleInfoString(LOCALE_SABBREVLANGNAME);
    CString cstr = lcinfo.GetLocaleInfoString(LOCALE_SABBREVCTRYNAME);
    CString ansicp = lcinfo.GetLocaleInfoString(LOCALE_IDEFAULTANSICODEPAGE);
    CString oemcp = lcinfo.GetLocaleInfoString(LOCALE_IDEFAULTCODEPAGE);
    CString output;
    output.Format(_T("%s_%s.%s"), lstr, cstr, ansicp);
    /* optional if you want to use locales with OEM code pages
    output.Format(_T("%s_%s.%s"), lstr, cstr, oemcp); */
    ... // other code
    return TRUE;
}


Interestingly, the Win32 API GetDateFormat knows how to do this correctly. Windows is one of the few operating systems that manage this grammatical nicety, not only for Czech but also for the other languages (mostly Slavic) that require it. Java, despite its rich locale support, still does not implement this simple addition to cultural correctness.

The lconv structure is itself based on the char data structure and has no wide character implementation. But the program has both Unicode and non-Unicode versions; the presentation you see here is the Unicode version. So, to insert the values from lconv into the dialog (where the controls are expecting a Unicode string), a conversion to a Unicode string or character was required. For me, this was an opportunity to verify the operation of the mbstowcs (multibyte string to wide character string) and its character version mbtowc (multibyte character to wide character). Win32 documentation says Posix affects neither, but this is incorrect.

Another related pair, wcstombs and wctomb, performs conversions in the opposite direction, from a wide character string to the corresponding multibyte one. I have found that many Windows programmers believe that the (implicit, since there is no parameter) code page is determined by the Windows System locale, but this is quite wrong. If the programmer does nothing, the default locale is C and the conversion is from ISO-8859-1, which does not correspond to any Windows code page (although it is a proper subset of Windows 1252). Accompanying this article is a command-line demo program called wcscmd that you can use to explore the behavior of wcstombs with various Posix locale settings. Be sure to read the readme.1st file that goes with the program. It will convince you that Windows 1252 and ISO-8859-1 are not the same and that the differences are more significant than is commonly believed.


Readme for wcscmd.cpp

  The program wcscmd converts an array of Unicode characters to a multibyte string by using the ANSI C Library function wcstombs. The reason for writing this program is that many Windows developers seem to be unaware that the behavior of wcstombs is entirely determined by the Posix locale setting and is not affected by the current Windows system or user locale. Indeed, the Posix locale must be explicitly set as required (usually at program startup). Otherwise, the default Posix locale is “C” and the behavior in Windows is for wcstombs to simply attempt conversion to ISO-8859-1. It does not default to Windows Code Page 1252.

  To use the program, type wcscmd posix_locale_string + series of Unicode characters UC1 UC2 ... The locale string can be "C" (or C), the empty string "", or an acceptable Microsoft-style Posix locale string such as English_US.1252, Japanese, FRA_CHE.850 and so on. More information is available in MSDN documentation; start by looking up information on setlocale. The Unicode entries should be 4 digit hex entries such as 0041, 0300, 4E00, etc.

Example: wcscmd C 0080 0083 generates the output:
Startup Posix locale = C
New LCTYPE Posix locale = C
Input string is: 0080 0083
Output string is: 80 83

Explanation: U+0080 is the first C1 control character in Unicode and U+0083 is the fourth. Both exist in ISO-8859-1 and are located at 0x80 and 0x83, respectively.

Example: wcscmd English 0080 0083 generates the output:
Startup Posix locale = C
New LCTYPE Posix locale = English_United States.1252
Input string is: 0080
Conversion error

Explanation: the conversion error occurs because neither U+0080 nor U+0083 has a target in Windows Code Page 1252.

Example: wcscmd English 20AC 0192 generates the output:
Startup Posix locale = C
New LCTYPE Posix locale = English_United States.1252
Input string is: 20AC 192
Output string is: 80 83
Press any key to continue

Explanation: U+20AC is the Euro symbol. This character lives at 0x80 in Code Page 1252. U+0192 is the Florin symbol. This character exists in Code Page 1252 at 0x83.

Example: wcscmd Japanese 3000 FF80 generates the output:
Startup Posix locale = C
New LCTYPE Posix locale = Japanese_Japan.932
Input string is: 3000 FF80
Output string is: 81 40 C0

Explanation: U+3000 is the Far East space character and it maps to the wide character space at 0x8140 in the double-byte portion of Code Page 932. U+FF80 is the location in Unicode of the Compatibility katakana syllabic character ta. The target is 0xC0, which is the location in the single-byte portion of Code Page 932 of the (so-called) half-width ta.

Example: wcscmd FRS_CHE.850 00c0 generates the output:
Startup Posix locale = C
New LCTYPE Posix locale = French_Switzerland.850
Input string is: 00C0
Output string is: B7

Explanation: FRS_CHE.850 sets the Posix locale to French Switzerland with the DOS Code Page 850 on Windows systems. The Unicode character U+00c0 is the French A with grave accent. It maps to 0xB7 in Code Page 850.


Code for wcscmd.cpp

// wcscmd.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

int main(int argc, char* argv[])
{

  if (argc == 1) {
    fprintf(stderr, "wcscmd converts a string of Unicode characters to a multibyte string\n");
    fprintf(stderr, "according to a specified posix locale\n\n");
    fprintf(stderr, "Usage: wcscmd posix_locale_string + series of Unicode characters UC1 UC2 ...\n");
    fprintf(stderr, "posix_locale_string must be C, the empty string \"\"\n");
    fprintf(stderr, "or an acceptable Posix locale string such as English_US.1252,\n");
    fprintf(stderr, "Japanese, FRA_CHE.850, etc.\n\n");
    fprintf(stderr, "Unicode entries should be 4 digit hex such as 0041, 0300, 4E00, etc.\n");
    fprintf(stderr, "Example: wcscmd Japanese 3000 00A1\n");
    fprintf(stderr, "\n");
    exit(1);

  }

  char *loc = NULL;
  printf("Startup Posix locale = %s\n", setlocale(LC_ALL, NULL));
  loc = setlocale(LC_CTYPE, argv[1]);
  printf("New LCTYPE Posix locale = %s\n", setlocale(LC_CTYPE, loc));

  if (argc == 2) {
    fprintf(stderr, "wcscmd: No Unicode list entered.\n\n");
    exit(2);
  }

 wchar_t *in = new wchar_t[argc];

  printf("Input string is: ");
 for (int i = 2, j = 0; i < argc; i++) {
    in[j++] = (unsigned short)strtol(argv[i], NULL, 16);
    printf("%s ", argv[i]);
  }
  in[j] = 0;
  printf("\n");

  int sizeout = j * 2 + 1;
  char *out = new char[sizeout];
  int len = wcstombs(out, in, sizeout);

  if (len != -1) {
    printf("Output string is: ")
    for (int i = 0; i < len; i++) {
        unsigned char c = out[i];
        printf("%02x ", c);
    }
    printf("\n");
  } else {
    printf("Conversion error\n");
  }

  return 0;
}


Posix Locale Is Still Useful

Even though the Posix locale is now quite an old idea and may eventually be replaced by the C++ <locale> template class, it remains very useful even on systems such as Windows, which has augmented and extended its functionality with a proprietary API set. Personally, Posix locale has been the only viable alternative to some sticky internationalization problems. Especially for Win32 developers, it is worth the trouble to learn how it works and keep in mind when dealing with cross-platform functionality. globe1.gif



Bill Hall is an internationalization consultant and a member of the MultiLingual Computing & Technology editorial board. He can be reached at billhall@mlmassoc.com


This article reprinted from #39 Volume 12 Issue 3 of
MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.

April/May, 2001


 
     

 


webmaster@multilingual.com ©1998-2010, Copyright MultiLingual Computing, Inc. No duplication or reproduction without expressed written permission.