Wyatt Hepler | de20d74 | 2021-06-02 23:34:14 -0700 | [diff] [blame] | 1 | .. _module-pw_tokenizer-proto: |
| 2 | |
| 3 | ------------------------------------ |
| 4 | Tokenized fields in protocol buffers |
| 5 | ------------------------------------ |
| 6 | Text may be represented in a few different ways: |
| 7 | |
| 8 | - Plain ASCII or UTF-8 text (``This is plain text``) |
| 9 | - Base64-encoded tokenized message (``$ibafcA==``) |
| 10 | - Binary-encoded tokenized message (``89 b6 9f 70``) |
| 11 | - Little-endian 32-bit integer token (``0x709fb689``) |
| 12 | |
| 13 | ``pw_tokenizer`` provides tools for working with protobuf fields that may |
| 14 | contain tokenized text. |
| 15 | |
| 16 | Tokenized field protobuf option |
| 17 | =============================== |
| 18 | ``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option. |
| 19 | This option may be applied to a protobuf field to indicate that it may contain a |
| 20 | tokenized string. A string that is optionally tokenized is represented with a |
| 21 | single ``bytes`` field annotated with ``(pw.tokenizer.format) = |
| 22 | TOKENIZATION_OPTIONAL``. |
| 23 | |
| 24 | For example, the following protobuf has one field that may contain a tokenized |
| 25 | string. |
| 26 | |
| 27 | .. code-block:: protobuf |
| 28 | |
| 29 | message MessageWithOptionallyTokenizedField { |
| 30 | bytes just_bytes = 1; |
| 31 | bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL]; |
| 32 | string just_text = 3; |
| 33 | } |
| 34 | |
| 35 | Decoding optionally tokenized strings |
| 36 | ===================================== |
| 37 | The encoding used for an optionally tokenized field is not recorded in the |
| 38 | protobuf. Despite this, the text can reliably be decoded. This is accomplished |
| 39 | by attempting to decode the field as binary or Base64 tokenized data before |
| 40 | treating it like plain text. |
| 41 | |
| 42 | The following diagram describes the decoding process for optionally tokenized |
| 43 | fields in detail. |
| 44 | |
| 45 | .. mermaid:: |
| 46 | |
| 47 | flowchart TD |
| 48 | start([Received bytes]) --> binary |
| 49 | |
| 50 | binary[Decode as<br>binary tokenized] --> binary_ok |
| 51 | binary_ok{Detokenizes<br>successfully?} -->|no| utf8 |
| 52 | binary_ok -->|yes| done_binary([Display decoded binary]) |
| 53 | |
| 54 | utf8[Decode as UTF-8] --> utf8_ok |
| 55 | utf8_ok{Valid UTF-8?} -->|no| base64_encode |
| 56 | utf8_ok -->|yes| base64 |
| 57 | |
| 58 | base64_encode[Encode as<br>tokenized Base64] --> display |
| 59 | display([Display encoded Base64]) |
| 60 | |
| 61 | base64[Decode as<br>Base64 tokenized] --> base64_ok |
| 62 | |
| 63 | base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text |
| 64 | base64_ok -->|yes| base64_results |
| 65 | |
| 66 | is_plain_text{Text is<br>printable?} -->|no| base64_encode |
| 67 | is_plain_text-->|yes| plain_text |
| 68 | |
| 69 | base64_results([Display decoded Base64]) |
| 70 | plain_text([Display text]) |
| 71 | |
| 72 | Potential decoding problems |
| 73 | --------------------------- |
| 74 | The decoding process for optionally tokenized fields will yield correct results |
| 75 | in almost every situation. In rare circumstances, it is possible for it to fail, |
| 76 | but these can be avoided with a low-overhead mitigation if desired. |
| 77 | |
| 78 | There are two ways in which the decoding process may fail. |
| 79 | |
| 80 | Accidentally interpreting plain text as tokenized binary |
| 81 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 82 | If a plain-text string happens to decode as a binary tokenized message, the |
| 83 | incorrect message could be displayed. This is very unlikely to occur. While many |
| 84 | tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely |
| 85 | that a device will happen to log one of these strings as plain text. The |
| 86 | overwhelming majority of these strings will be nonsense. |
| 87 | |
| 88 | If an implementation wishes to guard against this extremely improbable |
| 89 | situation, it is possible to prevent it. This situation is prevented by |
| 90 | appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data |
| 91 | that happens to be valid UTF-8 (or all binary tokenized messages, if desired). |
| 92 | When decoding, if there is an extra 0xFF byte, it is discarded. |
| 93 | |
| 94 | Displaying undecoded binary as plain text instead of Base64 |
| 95 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 96 | If a message fails to decode as binary tokenized and it is not valid UTF-8, it |
| 97 | is displayed as tokenized Base64. This makes it easily recognizable as a |
| 98 | tokenized message and makes it simple to decode later from the text output (for |
| 99 | example, with an updated token database). |
| 100 | |
| 101 | A binary message for which the token is not known may coincidentally be valid |
| 102 | UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters. |
| 103 | When decoding with an out-of-date token database, it is possible that some |
| 104 | binary tokenized messages will be displayed as plain text rather than tokenized |
| 105 | Base64. |
| 106 | |
| 107 | This situation is likely to occur, but should be infrequent. Even if it does |
| 108 | happen, it is not a serious issue. A very small number of strings will be |
| 109 | displayed incorrectly, but these strings cannot be decoded anyway. One nonsense |
| 110 | string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``). |
| 111 | Updating the token database would resolve the issue, though the non-Base64 logs |
| 112 | would be difficult decode later from a log file. |
| 113 | |
| 114 | This situation can be avoided with the same approach described in |
| 115 | `Accidentally interpreting plain text as tokenized binary`_. Appending |
| 116 | an invalid UTF-8 character prevents the undecoded binary message from being |
| 117 | interpreted as plain text. |
| 118 | |
| 119 | Python library |
| 120 | ============== |
| 121 | The ``pw_tokenizer.proto`` module defines functions that may be used to |
| 122 | detokenize protobuf objects in Python. The function |
| 123 | :func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as |
| 124 | tokenized, replacing them with their detokenized version. For example: |
| 125 | |
| 126 | .. code-block:: python |
| 127 | |
| 128 | my_detokenizer = pw_tokenizer.Detokenizer(some_database) |
| 129 | |
| 130 | my_message = SomeMessage(tokenized_field=b'$YS1EMQ==') |
| 131 | pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message) |
| 132 | |
| 133 | assert my_message.tokenized_field == b'The detokenized string! Cool!' |
| 134 | |
| 135 | pw_tokenizer.proto |
| 136 | ------------------ |
| 137 | .. automodule:: pw_tokenizer.proto |
| 138 | :members: |