blob: 2479dd47fc455f97e376d350a7f3763045848a2f [file] [log] [blame]
Wyatt Heplerde20d742021-06-02 23:34:14 -07001.. _module-pw_tokenizer-proto:
2
3------------------------------------
4Tokenized fields in protocol buffers
5------------------------------------
6Text may be represented in a few different ways:
7
8- Plain ASCII or UTF-8 text (``This is plain text``)
9- Base64-encoded tokenized message (``$ibafcA==``)
10- Binary-encoded tokenized message (``89 b6 9f 70``)
11- Little-endian 32-bit integer token (``0x709fb689``)
12
13``pw_tokenizer`` provides tools for working with protobuf fields that may
14contain tokenized text.
15
16Tokenized field protobuf option
17===============================
18``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
19This option may be applied to a protobuf field to indicate that it may contain a
20tokenized string. A string that is optionally tokenized is represented with a
21single ``bytes`` field annotated with ``(pw.tokenizer.format) =
22TOKENIZATION_OPTIONAL``.
23
24For example, the following protobuf has one field that may contain a tokenized
25string.
26
27.. code-block:: protobuf
28
29 message MessageWithOptionallyTokenizedField {
30 bytes just_bytes = 1;
31 bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
32 string just_text = 3;
33 }
34
35Decoding optionally tokenized strings
36=====================================
37The encoding used for an optionally tokenized field is not recorded in the
38protobuf. Despite this, the text can reliably be decoded. This is accomplished
39by attempting to decode the field as binary or Base64 tokenized data before
40treating it like plain text.
41
42The following diagram describes the decoding process for optionally tokenized
43fields in detail.
44
45.. mermaid::
46
47 flowchart TD
48 start([Received bytes]) --> binary
49
50 binary[Decode as<br>binary tokenized] --> binary_ok
51 binary_ok{Detokenizes<br>successfully?} -->|no| utf8
52 binary_ok -->|yes| done_binary([Display decoded binary])
53
54 utf8[Decode as UTF-8] --> utf8_ok
55 utf8_ok{Valid UTF-8?} -->|no| base64_encode
56 utf8_ok -->|yes| base64
57
58 base64_encode[Encode as<br>tokenized Base64] --> display
59 display([Display encoded Base64])
60
61 base64[Decode as<br>Base64 tokenized] --> base64_ok
62
63 base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text
64 base64_ok -->|yes| base64_results
65
66 is_plain_text{Text is<br>printable?} -->|no| base64_encode
67 is_plain_text-->|yes| plain_text
68
69 base64_results([Display decoded Base64])
70 plain_text([Display text])
71
72Potential decoding problems
73---------------------------
74The decoding process for optionally tokenized fields will yield correct results
75in almost every situation. In rare circumstances, it is possible for it to fail,
76but these can be avoided with a low-overhead mitigation if desired.
77
78There are two ways in which the decoding process may fail.
79
80Accidentally interpreting plain text as tokenized binary
81^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
82If a plain-text string happens to decode as a binary tokenized message, the
83incorrect message could be displayed. This is very unlikely to occur. While many
84tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
85that a device will happen to log one of these strings as plain text. The
86overwhelming majority of these strings will be nonsense.
87
88If an implementation wishes to guard against this extremely improbable
89situation, it is possible to prevent it. This situation is prevented by
90appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
91that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
92When decoding, if there is an extra 0xFF byte, it is discarded.
93
94Displaying undecoded binary as plain text instead of Base64
95^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
96If a message fails to decode as binary tokenized and it is not valid UTF-8, it
97is displayed as tokenized Base64. This makes it easily recognizable as a
98tokenized message and makes it simple to decode later from the text output (for
99example, with an updated token database).
100
101A binary message for which the token is not known may coincidentally be valid
102UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
103When decoding with an out-of-date token database, it is possible that some
104binary tokenized messages will be displayed as plain text rather than tokenized
105Base64.
106
107This situation is likely to occur, but should be infrequent. Even if it does
108happen, it is not a serious issue. A very small number of strings will be
109displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
110string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
111Updating the token database would resolve the issue, though the non-Base64 logs
112would be difficult decode later from a log file.
113
114This situation can be avoided with the same approach described in
115`Accidentally interpreting plain text as tokenized binary`_. Appending
116an invalid UTF-8 character prevents the undecoded binary message from being
117interpreted as plain text.
118
119Python library
120==============
121The ``pw_tokenizer.proto`` module defines functions that may be used to
122detokenize protobuf objects in Python. The function
123:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
124tokenized, replacing them with their detokenized version. For example:
125
126.. code-block:: python
127
128 my_detokenizer = pw_tokenizer.Detokenizer(some_database)
129
130 my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
131 pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
132
133 assert my_message.tokenized_field == b'The detokenized string! Cool!'
134
135pw_tokenizer.proto
136------------------
137.. automodule:: pw_tokenizer.proto
138 :members: