Filename Encoding

The various character encodings that are possible for filenames can be quite confusing.

Character encodings

Character encoding: a specific byte order represents a specific character. For the classical ASCII character encoding, one byte represents a specific character unambiguously (and when I remember correct, ASCII only specifies the chars 0-127, so only 7 bits are actually needed).

The problem starts, as several languages need special characters not available from ASCII, e.g. to encode the German lower case 'u' with umlaut 'ü'. Obviously many other characters are needed as there are many languages out there.

Some possible encodings of non-ASCII characters:

GLib filenames

Until GLib 2.6, the filenames were kept in the code page encoding. This is easy to implement, but unfortunately the char codes are ambiguous, so there's a problem if you have currently selected a japanese code page and want to read a file with a "french filename".

Since GLib 2.6, the char encoding of the filenames splitted into:

This requires the following changes compared to GLib versions prior to 2.6:

As the GTK+ standard widgets in question (e.g. gtk_file_chooser) will work internally with the correct filename encoding, there's no need to change things here.

Wireshark

Wireshark gets filename input from several points:

I currently don't know if encoding conversions are done properly on all (especially old) *nix versions. - UlfLamping

External references

UTF-8 and Unicode FAQ

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Wikipedia article about Unicode

Wikipedia article about UTF-8

Wikipedia article about UTF-16 and UCS-2

Wikipedia article about UTF-32 and UCS-4

Development/FilenameEncoding (last edited 2008-04-12 17:50:30 by localhost)