Development/String handling in dissectors

String handling in dissectors

(Much but not all of the content of this page is taken directly from Guy's email to the wireshark-dev mailing list.)

Character strings can use various encodings to represent characters, such as system code pages, UTF-8, and UTF-16; see Character encodings for details about those encodings.

String handling in many applications is relatively straightforward. A library is used for reading/writing text in the locale-appropriate encoding, and everything is handled in Unicode (usually UTF-8) internally. Wireshark doesn't get things quite so easy.

The primary problem is that Wireshark has to be able to gracefully process and handle invalid strings in all sorts of encodings. If a packet contains a malformed string in some obscure encoding, Wireshark has to be able to flag it as such and then continue processing that packet. We're not there yet.

This page is half proposal, half documentation for how Wireshark's string handling engine does or ought to work. Much of the contents came from discussions on the wireshark-dev mailing list (such as this one or this one) and on bugzilla bugs like this one.

If you have questions, suggestions or ideas on this topic, please send an email to the wireshark-dev@wireshark.org mailing list.

First Principles

A character string is a sequence of code points from a character set. It's represented as a sequence of octets using a particular encoding for that character set, wherein each character is represented as a 1-or-more-octet subsequence in that sequence.

In many of those encodings, not all subsequences of octets correspond to code points in the character set. For example:

etc..

Wireshark String Use Cases

Strings in Wireshark are:

In all of these cases, we need to do something with the invalid octet sequences.

Displayed to Users

In the display case, invalid octet sequences should be displayed as a sequence of \xNN escape sequences, one octet at a time. Non-printable characters are an orthogonal issue; they *can* be represented in our UTF-8 encoding of Unicode, but they shouldn't be displayed in the UI as themselves. They should also be replaced, when displaying, with escape sequences:

(For the future, we might want to have the "value", in a protocol tree, of a string be a combination of the encoding and the raw octets; if some code wants the value for processing purposes, it'd call a routine that converts the value to UTF-8 with REPLACEMENT CHARACTER replacing invalid sequences, and if it wants the value for display purposes, it'd call a routine that converts it to UTF-8 with escape sequences replacing invalid sequences *and* with non-printable characters shown as the appropriate escape sequence.

That raises the question of whether, when building a protocol tree, we need to put the value into the protocol tree item at all if the item is created with proto_tree_create_item(), or whether we just postpone extracting the value until we actually need it. Lazy processing FTW....)

Packet-Matching Expressions

When using a string field in a packet-matching expression:

In addition, there should be a monadic function "valid" which takes a string field and returns a boolean whether or not the string contains any invalid octet sequences.

Again, non-printing characters is an orthogonal question. Users should be able to specify both C-style escapes ("\n", etc) and unicode escapes (\uXXXX) in string comparison constants. This means that if you want to match a literal "\" and you're typing in the shell, you need to type "\\\\" for all the escapes to process correctly. Yuck.

Internal Processing

In the "processed internally" case, if the part of the string that's being looked at contains an invalid octet sequence, the processing should fail, otherwise the processing should still work. For example, an HTTP request beginning with 0x47 0x45 0x54 0x20 0xC0 should be treated as a GET request with the operand being invalid, but an HTTP request beginning with 0x47 0x45 0x54 0xC0 should be treated as an invalid request.

The display filter engine should use an internal string representation that allows working with embedded null bytes (C-style strings are out). Need to check if external tools and dependencies can handle that (PCRE2 does).

Exporting to Other Programs

There seem to be two probable use cases:

These two should cover 99% of cases I can think of with relatively minimal effort on our part. The second should be default, since the most frequent case of "other program" is probably "stdout of a shell" or "text file".

API Design

Invalid Sequences

The functions that get strings from packets should not map invalid octet sequences to a sequence of \xNN escape sequences, as that would interfere with proper handling of the string when doing packet matching and internal processing. For those cases, perhaps a combination of

  1. replacing invalid sequences with REPLACEMENT CHARACTER and
  2. providing a separate indication that this was done

would be the right thing to do. However, this throws away information, so that you can't display that string with the invalid sequences shown as \xNN sequences.

For now, my inclination is to continue with the "replace invalid sequences with REPLACEMENT CHARACTER in tvb_get_string* routines" strategy, but not treat that as the ultimate solution.

Buffer Length

Functions get strings either by length (tvb_get_string) or by stopping on the first null-terminator (tvb_get_stringz). When fetching by length, the function passes along embedded nulls as-is. This leads to a small problem though, since there is no other way to reliably determine the size of the returned buffer (if the input is a non-UTF8 encoding that may include code points beyond the basic set, it is impossible to predict the number of bytes taken by the UTF-8 encoding of that string).

Therefore, the tvb_get_string function should eventually be converted to return a counted string (wmem_strbuf_t).


Imported from https://wiki.wireshark.org/Development/StringHandling on 2020-08-11 23:13:08 UTC