Home  |  AirMac Express  |  milano  |  木島平  |  高知  |  iMac G5  |  ソウルです  |  雪の日  |  TeXType
 

Some rant on Cocoa's so-called 'default string encoding'

Summary
Detailed explanation

One of the extremely often used character in TeX is the backslash \ . However, if you buy a textbook on TeX in Japan, you'll likely to see it using the yen symbol ¥ instead of the backslash \. It has a historical origin, when there was no wide character support. Japanese standardizing committee, who saw the need to include the currency symbol in the 8-bit character code, modified the original ASCII table a little and placed yen at the code point for the backslash, as the \ is relatively less used. Hence, all the major three codes, EUC code used in Unices, SJIS code in MS products, and JIS code used in the Internet before UTF become widespread, used yen in place of the backslash. As the TeX implementation just sees the ASCII code for characters, we are used to use ¥documentclass instead of \documentclass, etc. (A related example is the MS-style path for files. c:\Program Files\Etc. becomes c:¥Program Files¥Etc. Korean used the won symbol ₩ in the code point for \ . So they use ₩documentclass.) If this were the situation in Mac OS X, we would be in a much better position !

In Classic Mac OS before OS X, the preferred Japanese encoding was MacJapanese. It is a slight modification of SJIS, in that which has the yen symbol in the ASCII code 0x5c for the backslash, and has the backslash symbol in a separate position 0x80. So in the classic days, we used ¥documentclass and it worked great out of the box. What changed the day was the introduction of OS X and use of Unicode. At first I wondered why the incorporation of Unicode made the things worse, but things turned out as follows:

In OS X, most of the string related stuff was done using NSString or CFString. It is theoretically contains UTF characters in all cases. The problem is, when you use -[NSString -writeToFile: atomically:], the string is not written to the file in UTF, but an encoding which depends on the runtime configuration in the International Pane in the System Preferences. This is the so-called 'default string encoding' given by +[NSString defaultCStringEncoding] (cf. Apple's documentation). ASCII is used in English environment, and MacJapanese is used in Japanese environment.

The end result is this: when your snippet editor contains @"\\documentclass" or something in the source file, the OS X runtime correctly understands that the first character is the backslash. And when it is written down to a temporary file, the OS X writes the backslash using MacJapanese encoding, and emits the code 0x5c, which TeX does not expect at all.

Another complication is that many keyboard in Japan do not have the backslash. It only has Yen symbol, and you need to press Opt-Yen to input the backslash in the default configuration. It is quite cumbersome, especially you are inputting a mathematical expression. And many Japanese switchers to Macs do not know these Yen-and-Backslash things, and as I said most textbooks on TeX here tell us to use Yen-symbol, never mentioning Backslash. Thus we need to support yen-to-backslash translator in the text field. (I made a NSValueTransformer to facilitate this.)

Most of the TeX snippet editors suffer from this problem, so I had to add the support. If you rely on the default encoding [NSString +defaultCStringEncoding], you'll surely be bitten by the yen-backslash thing ! In any case, all of those API without encoding are now deprecated in 10.4. cf. Apple's documentation on deprecated methods.

This problem also appears when a Cocoa app constructs a shell script on the fly and executes it via system([string cString]) or other braindead methods. Please, please don't do this and use system([string UTF8String]) instead.

(originally written around 2005 Summer, added deprecation info around 2008 Fall. Further corrected in March 2009.)