Development/Character encodings

Character encodings

A character string is a sequence of "code points" from a character set. It's represented as a sequence of octets (bytes) using a particular encoding for that character set, wherein each character is represented as a 1-or-more-octet subsequence in that sequence.

A given sequence of octets doesn't necessarily correspond to the same character string in all encodings. Most character set encodings are based on ASCII and encode the 128 code points of ASCII as a single octet whose value is the code point value. For example, the code point value for the letter 'A' is decimal 65, or hexadecimal 41 (0x41), so, in most of these character set encodings, the letter 'A' is encoded as a single octet with the value 0x41.

The main exceptions are:

In Wireshark, EBCDIC and the ISO 646 variances are used only when dissecting packets from protocols using those character encodings, so, in places other than dissectors, only ASCII-based encodings are used.

ASCII is sufficient for most strings in US English, but is not sufficient for languages other than English (for example, German, which requires, among other characters, the "lower case 'u' with umlaut" character 'ü), and even for English in other countries (for example, it does not include the '£' character, so it does not suffice for the United Kingdom).

Thus, various extensions were made to ASCII.

Until the development of Unicode, there was no single extended version of ASCII that could be used in all cases. There were, instead, both proprietary extensions of ASCII, such as the HP Roman encodings and the Mac Roman encoding, and standard extensions, such as the ISO 8859 encodings, the JIS X 0208 encoding for Japanese, the GB 2312 encodings for Simplified Chinese, and the KS X 1001 encodings for Korean.

All of those encodings are based on ASCII and encode the 128 code points of ASCII as a single octet whose value is the code point value.

The Unicode project started, as an attempt to devise a single encoding to handle all characters, in the late 1980's, and released the first Unicode standard in 1991. It is also ASCII-based, with the first 128 code points of Unicode having the same value as the ASCII code point values for the same characters. It was originally intended to be a 16-bit character encoding, in which every character was encoded as 2 octets. However, the Unicode developers realized that there would eventually be more than 65536 code points, and came up with a mechanism to allow that, reserving some 16-bit values for use as "surrogate pairs", where a surrogate pair can be used to represent code point values not fitting in 16 bits.

Furthermore, adding support for 16-bit character encodings in systems that had traditionally used character encodings with multi-octet sequences for non-ASCII characters, such as most UN*X systems, would have been difficult, so extensions to ASCII in which ASCII characters are encoded as a single octet with its value being the code point value of that ASCII character were devised; the one currently being used is UTF-8.

The ASCII-based encodings Wireshark has to handle in filenames and other strings supplied by and provided to the underlying operating system, as well as in packets in a network capture, include:

External references

UTF-8 and Unicode FAQ for Unix/Linux (including macOS and the BSDs)

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Wikipedia article about Unicode

Wikipedia article about UTF-8

Wikipedia article about UTF-16 and UCS-2

Wikipedia article about UTF-32 and UCS-4