blob: dc908ea4b7f76664f2d452190f36cafb7aa94be2 [file] [log] [blame]
Wyatt Heplerf9fb90f2020-09-30 18:59:33 -07001.. _module-pw_tokenizer:
Keir Mierlebc5a2692020-05-21 16:52:25 -07002
Alexei Frolov44d54732020-01-10 14:45:43 -08003------------
4pw_tokenizer
5------------
Armando Montanezbcc194b2020-03-10 10:23:18 -07006Logging is critical, but developers are often forced to choose between
Wyatt Heplerd32daea2020-03-26 13:55:47 -07007additional logging or saving crucial flash space. The ``pw_tokenizer`` module
8helps address this by replacing printf-style strings with binary tokens during
9compilation. This enables extensive logging with substantially less memory
10usage.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080011
12.. note::
13 This usage of the term "tokenizer" is not related to parsing! The
14 module is called tokenizer because it replaces a whole string literal with an
15 integer token. It does not parse strings into separate tokens.
16
Wyatt Heplerd32daea2020-03-26 13:55:47 -070017The most common application of ``pw_tokenizer`` is binary logging, and it is
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080018designed to integrate easily into existing logging systems. However, the
Wyatt Heplerd32daea2020-03-26 13:55:47 -070019tokenizer is general purpose and can be used to tokenize any strings, with or
20without printf-style arguments.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080021
22**Why tokenize strings?**
23
24 * Dramatically reduce binary size by removing string literals from binaries.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080025 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
Wyatt Heplerd32daea2020-03-26 13:55:47 -070026 tokens instead of strings. We've seen over 50% reduction in encoded log
27 contents.
28 * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080029 * Remove potentially sensitive log, assert, and other strings from binaries.
30
Wyatt Heplerd32daea2020-03-26 13:55:47 -070031Basic overview
32==============
33There are two sides to ``pw_tokenizer``, which we call tokenization and
34detokenization.
Armando Montanezbcc194b2020-03-10 10:23:18 -070035
Wyatt Heplerd32daea2020-03-26 13:55:47 -070036 * **Tokenization** converts string literals in the source code to
37 binary tokens at compile time. If the string has printf-style arguments,
38 these are encoded to compact binary form at runtime.
39 * **Detokenization** converts tokenized strings back to the original
40 human-readable strings.
41
42Here's an overview of what happens when ``pw_tokenizer`` is used:
43
44 1. During compilation, the ``pw_tokenizer`` module hashes string literals to
45 generate stable 32-bit tokens.
46 2. The tokenization macro removes these strings by declaring them in an ELF
47 section that is excluded from the final binary.
48 3. After compilation, strings are extracted from the ELF to build a database
49 of tokenized strings for use by the detokenizer. The ELF file may also be
50 used directly.
51 4. During operation, the device encodes the string token and its arguments, if
52 any.
53 5. The encoded tokenized strings are sent off-device or stored.
54 6. Off-device, the detokenizer tools use the token database to decode the
55 strings to human-readable form.
56
57Example: tokenized logging
58--------------------------
59This example demonstrates using ``pw_tokenizer`` for logging. In this example,
60tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
61size (49 → 15 bytes).
62
63**Before**: plain text logging
Armando Montanezbcc194b2020-03-10 10:23:18 -070064
65+------------------+-------------------------------------------+---------------+
66| Location | Logging Content | Size in bytes |
67+==================+===========================================+===============+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070068| Source contains | ``LOG("Battery state: %s; battery | |
69| | voltage: %d mV", state, voltage);`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070070+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070071| Binary contains | ``"Battery state: %s; battery | 41 |
72| | voltage: %d mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070073+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070074| | (log statement is called with | |
75| | ``"CHARGING"`` and ``3989`` as arguments) | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070076+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070077| Device transmits | ``"Battery state: CHARGING; battery | 49 |
78| | voltage: 3989 mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070079+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070080| When viewed | ``"Battery state: CHARGING; battery | |
81| | voltage: 3989 mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070082+------------------+-------------------------------------------+---------------+
83
Wyatt Heplerd32daea2020-03-26 13:55:47 -070084**After**: tokenized logging
Armando Montanezbcc194b2020-03-10 10:23:18 -070085
Wyatt Heplerd32daea2020-03-26 13:55:47 -070086+------------------+-----------------------------------------------------------+---------+
87| Location | Logging Content | Size in |
88| | | bytes |
89+==================+===========================================================+=========+
90| Source contains | ``LOG("Battery state: %s; battery | |
91| | voltage: %d mV", state, voltage);`` | |
92+------------------+-----------------------------------------------------------+---------+
93| Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 |
94+------------------+-----------------------------------------------------------+---------+
95| | (log statement is called with | |
96| | ``"CHARGING"`` and ``3989`` as arguments) | |
97+------------------+-----------------------------------------------------------+---------+
98| Device transmits | =============== ============================== ========== | 15 |
99| | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | |
100| | --------------- ------------------------------ ---------- | |
101| | Token ``"CHARGING"`` argument ``3989``, | |
102| | as | |
103| | varint | |
104| | =============== ============================== ========== | |
105+------------------+-----------------------------------------------------------+---------+
106| When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | |
107+------------------+-----------------------------------------------------------+---------+
Armando Montanezbcc194b2020-03-10 10:23:18 -0700108
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700109Getting started
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800110===============
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700111Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
112section describes one way ``pw_tokenizer`` might be integrated with a project.
113These steps can be adapted as needed.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800114
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700115 1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
116 are provided. For Make or other build systems, add the files specified in
117 the BUILD.gn's ``pw_tokenizer`` target to the build.
118 2. Use the tokenization macros in your code. See `Tokenization`_.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700119 3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
Wyatt Hepler3a8df982020-11-03 14:06:28 -0800120 linker script. In GN and CMake, this step is done automatically.
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700121 4. Compile your code to produce an ELF file.
122 5. Run ``database.py create`` on the ELF file to generate a CSV token
123 database. See `Managing token databases`_.
124 6. Commit the token database to your repository. See notes in `Database
125 management`_.
126 7. Integrate a ``database.py add`` command to your build to automatically
Wyatt Hepler3a8df982020-11-03 14:06:28 -0800127 update the committed token database. In GN, use the
128 ``pw_tokenizer_database`` template to do this. See `Update a database`_.
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700129 8. Integrate ``detokenize.py`` or the C++ detokenization library with your
130 tools to decode tokenized logs. See `Detokenization`_.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800131
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800132Tokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800133============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800134Tokenization converts a string literal to a token. If it's a printf-style
135string, its arguments are encoded along with it. The results of tokenization can
136be sent off device or stored in place of a full string.
137
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800138Tokenization macros
139-------------------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800140Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800141``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800142
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800143Tokenize a string literal
144^^^^^^^^^^^^^^^^^^^^^^^^^
145The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
146token.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800147
148.. code-block:: cpp
149
150 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
151
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800152.. admonition:: When to use this macro
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800153
Armando Montanezbcc194b2020-03-10 10:23:18 -0700154 Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800155 %-style arguments.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800156
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800157Tokenize to a handler function
158^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800159``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800160since it takes the fewest arguments. It encodes a tokenized string to a
161buffer on the stack. The size of the buffer is set with
Wyatt Hepler6639c452020-05-06 11:43:07 -0700162``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
163
164This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700165backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
Wyatt Hepler6639c452020-05-06 11:43:07 -0700166C-linkage function.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800167
168.. code-block:: cpp
169
170 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
171
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700172 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
173 size_t size_bytes);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800174
175``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Hepler6639c452020-05-06 11:43:07 -0700176``uintptr_t`` argument to the global handler function. Values like a log level
177can be packed into the ``uintptr_t``.
178
179This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
180facade. The backend for this facade must define the
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700181``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800182
183.. code-block:: cpp
184
185 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
186 format_string_literal,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800187 arguments...);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800188
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700189 void pw_tokenizer_HandleEncodedMessageWithPayload(
190 uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800191
Wyatt Hepler6639c452020-05-06 11:43:07 -0700192.. admonition:: When to use these macros
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800193
194 Use anytime a global handler is sufficient, particularly for widely expanded
195 macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
196 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
197 for tokenizing printf-style strings.
198
199Tokenize to a callback
200^^^^^^^^^^^^^^^^^^^^^^
201``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
202``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
203the call site. The size of the buffer is set with
204``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
205
206.. code-block:: cpp
207
208 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
209
210.. admonition:: When to use this macro
211
212 Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
213 use for another purpose or more flexibility is needed.
214
215Tokenize to a buffer
216^^^^^^^^^^^^^^^^^^^^
217The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
218to a caller-provided buffer.
219
220.. code-block:: cpp
221
222 uint8_t buffer[BUFFER_SIZE];
223 size_t size_bytes = sizeof(buffer);
224 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
225
226While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
227than the other macros, so its per-use code size overhead is larger.
228
229.. admonition:: When to use this macro
230
231 Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
232 other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
233 widely expanded macros, such as a logging macro, because it will result in
234 larger code size than its alternatives.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800235
Wyatt Heplerbcf07352021-04-05 14:44:30 -0700236.. _module-pw_tokenizer-custom-macro:
237
238Tokenize with a custom macro
239^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240Projects may need more flexbility than the standard ``pw_tokenizer`` macros
241provide. To support this, projects may define custom tokenization macros. This
242requires the use of two low-level ``pw_tokenizer`` macros:
243
244.. c:macro:: PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)
245
246 Tokenizes a format string and sets the ``_pw_tokenizer_token`` variable to the
247 token. Must be used in its own scope, since the same variable is used in every
248 invocation.
249
250 The tokenized string uses the specified :ref:`tokenization domain
251 <module-pw_tokenizer-domains>`. Use ``PW_TOKENIZER_DEFAULT_DOMAIN`` for the
252 default. The token also may be masked; use ``UINT32_MAX`` to keep all bits.
253
254.. c:macro:: PW_TOKENIZER_ARG_TYPES(...)
255
256 Converts a series of arguments to a compact format that replaces the format
257 string literal.
258
259Use these two macros within the custom tokenization macro to call a function
260that does the encoding. The following example implements a custom tokenization
261macro for use with :ref:`module-pw_log_tokenized`.
262
263.. code-block:: cpp
264
265 #include "pw_tokenizer/tokenize.h"
266
267 #ifndef __cplusplus
268 extern "C" {
269 #endif
270
271 void EncodeTokenizedMessage(pw_tokenizer_Payload metadata,
272 pw_tokenizer_Token token,
273 pw_tokenizer_ArgTypes types,
274 ...);
275
276 #ifndef __cplusplus
277 } // extern "C"
278 #endif
279
280 #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \
281 do { \
Wyatt Heplerb8476152021-04-06 15:28:32 -0700282 PW_TOKENIZE_FORMAT_STRING( \
Wyatt Heplerbcf07352021-04-05 14:44:30 -0700283 PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \
284 EncodeTokenizedMessage(payload, \
285 _pw_tokenizer_token, \
286 PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
287 PW_COMMA_ARGS(__VA_ARGS__)); \
288 } while (0)
289
290In this example, the ``EncodeTokenizedMessage`` function would handle encoding
291and processing the message. Encoding is done by the
292``pw::tokenizer::EncodedMessage`` class or ``pw::tokenizer::EncodeArgs``
293function from ``pw_tokenizer/encode_args.h``. The encoded message can then be
294transmitted or stored as needed.
295
296.. code-block:: cpp
297
298 #include "pw_log_tokenized/log_tokenized.h"
299 #include "pw_tokenizer/encode_args.h"
300
301 void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
302 std::span<std::byte> message);
303
304 extern "C" void EncodeTokenizedMessage(const pw_tokenizer_Payload metadata,
305 const pw_tokenizer_Token token,
306 const pw_tokenizer_ArgTypes types,
307 ...) {
308 va_list args;
309 va_start(args, types);
310 pw::tokenizer::EncodedMessage encoded_message(token, types, args);
311 va_end(args);
312
313 HandleTokenizedMessage(metadata, encoded_message);
314 }
315
316.. admonition:: When to use a custom macro
317
318 Use existing tokenization macros whenever possible. A custom macro may be
319 needed to support use cases like the following:
320
321 * Variations of ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` that take
322 different arguments.
323 * Supporting global handler macros that use different handler functions.
324
325Binary logging with pw_tokenizer
326--------------------------------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800327String tokenization is perfect for logging. Consider the following log macro,
328which gathers the file, line number, and log message. It calls the ``RecordLog``
329function, which formats the log string, collects a timestamp, and transmits the
330result.
331
332.. code-block:: cpp
333
334 #define LOG_INFO(format, ...) \
335 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
336
337 void RecordLog(LogLevel level, const char* file, int line, const char* format,
338 ...) {
339 if (level < current_log_level) {
340 return;
341 }
342
343 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
344
345 va_list args;
346 va_start(args, format);
347 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
348 va_end(args);
349
350 TransmitLog(TimeSinceBootMillis(), buffer, size);
351 }
352
353It is trivial to convert this to a binary log using the tokenizer. The
354``RecordLog`` call is replaced with a
355``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700356``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800357timestamp and transmits the message with ``TransmitLog``.
358
359.. code-block:: cpp
360
361 #define LOG_INFO(format, ...) \
362 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700363 (pw_tokenizer_Payload)LogLevel_INFO, \
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800364 __FILE_NAME__ ":%d " format, \
365 __LINE__, \
366 __VA_ARGS__); \
367
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700368 extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
Wyatt Hepler6639c452020-05-06 11:43:07 -0700369 uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
370 if (static_cast<LogLevel>(level) >= current_log_level) {
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800371 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
372 }
373 }
374
375Note that the ``__FILE_NAME__`` string is directly included in the log format
376string. Since the string is tokenized, this has no effect on binary size. A
377``%d`` for the line number is added to the format string, so that changing the
378line of the log message does not generate a new token. There is no overhead for
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800379additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800380duplicate log lines.
381
Wyatt Hepler7e587232020-08-28 07:51:29 -0700382Tokenizing function names
383-------------------------
384The string literal tokenization functions support tokenizing string literals or
385constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
386special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
387as ``static constexpr char[]`` in C++ instead of the standard ``static const
388char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
389tokenized while compiling C++ with GCC or Clang.
390
391.. code-block:: cpp
392
393 // Tokenize the special function name variables.
394 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
395 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
396
397 // Tokenize the function name variables to a handler function.
398 PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
399 PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
400
401Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
402They are defined as static character arrays, so they cannot be implicitly
403concatentated with string literals. For example, ``printf(__func__ ": %d",
404123);`` will not compile.
405
Wyatt Heplerf5e984a2020-04-15 18:15:09 -0700406Tokenization in Python
407----------------------
408The Python ``pw_tokenizer.encode`` module has limited support for encoding
409tokenized messages with the ``encode_token_and_args`` function.
410
411.. autofunction:: pw_tokenizer.encode.encode_token_and_args
412
Armando Montanez36a1ef72022-01-13 17:23:45 -0800413This function requires a string's token is already calculated. Typically these
414tokens are provided by a database, but they can be manually created using the
415tokenizer hash.
416
417.. autofunction:: pw_tokenizer.tokens.pw_tokenizer_65599_hash
418
419This is particularly useful for offline token database generation in cases where
420tokenized strings in a binary cannot be embedded as parsable pw_tokenizer
421entries.
422
423.. note::
424 In C, the hash length of a string has a fixed limit controlled by
425 ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed
426 to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching
427 hash length limit. When creating an offline database, it's a good idea to
428 generate tokens for both, and merge the databases.
429
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800430Encoding
431--------
432The token is a 32-bit hash calculated during compilation. The string is encoded
433little-endian with the token followed by arguments, if any. For example, the
43431-byte string ``You can go about your business.`` hashes to 0xdac9a244.
435This is encoded as 4 bytes: ``44 a2 c9 da``.
436
437Arguments are encoded as follows:
438
439 * **Integers** (1--10 bytes) --
440 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
441 similarly to Protocol Buffers. Smaller values take fewer bytes.
442 * **Floating point numbers** (4 bytes) -- Single precision floating point.
443 * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
Wyatt Heplerf5e984a2020-04-15 18:15:09 -0700444 The top bit of the length whether the string was truncated or
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800445 not. The remaining 7 bits encode the string length, with a maximum of 127
446 bytes.
447
448.. TODO: insert diagram here!
449
450.. tip::
451 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
452 short or avoid encoding them as strings (e.g. encode an enum as an integer
453 instead of a string). See also `Tokenized strings as %s arguments`_.
454
455Token generation: fixed length hashing at compile time
456------------------------------------------------------
457String tokens are generated using a modified version of the x65599 hash used by
458the SDBM project. All hashing is done at compile time.
459
460In C code, strings are hashed with a preprocessor macro. For compatibility with
461macros, the hash must be limited to a fixed maximum number of characters. This
Wyatt Heplereb020a12020-10-28 14:01:51 -0700462value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
463``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
464the complexity of the hashing macros.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800465
Wyatt Heplereb020a12020-10-28 14:01:51 -0700466C++ macros use a constexpr function instead of a macro. This function works with
467any length of string and has lower compilation time impact than the C macros.
468For consistency, C++ tokenization uses the same hash algorithm, but the
469calculated values will differ between C and C++ for strings longer than
470``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800471
Wyatt Heplerbcf07352021-04-05 14:44:30 -0700472.. _module-pw_tokenizer-domains:
473
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700474Tokenization domains
475--------------------
Wyatt Heplereb020a12020-10-28 14:01:51 -0700476``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
477string label associated with each tokenized string. This allows projects to keep
478tokens from different sources separate. Potential use cases include the
479following:
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700480
481* Keep large sets of tokenized strings separate to avoid collisions.
482* Create a separate database for a small number of strings that use truncated
483 tokens, for example only 10 or 16 bits instead of the full 32 bits.
484
Wyatt Heplereb020a12020-10-28 14:01:51 -0700485If no domain is specified, the domain is empty (``""``). For many projects, this
486default domain is sufficient, so no additional configuration is required.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700487
488.. code-block:: cpp
489
Wyatt Heplereb020a12020-10-28 14:01:51 -0700490 // Tokenizes this string to the default ("") domain.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700491 PW_TOKENIZE_STRING("Hello, world!");
492
493 // Tokenizes this string to the "my_custom_domain" domain.
494 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
495
Wyatt Hepler23f831d2020-05-12 13:53:30 -0700496The database and detokenization command line tools default to reading from the
497default domain. The domain may be specified for ELF files by appending
498``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
499example, the following reads strings in ``some_domain`` from ``my_image.elf``.
500
501.. code-block:: sh
502
503 ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
504
505See `Managing token databases`_ for information about the ``database.py``
506command line tool.
507
Wyatt Hepler4b62b892021-03-04 10:03:43 -0800508Smaller tokens with masking
509---------------------------
510``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
511fewer than 32 bits does not improve runtime or code size efficiency. However,
512when tokens are packed into data structures or stored in arrays, the size of the
513token directly affects memory usage. In those cases, every bit counts, and it
514may be desireable to use fewer bits for the token.
515
516``pw_tokenizer`` allows users to provide a mask to apply to the token. This
517masked token is used in both the token database and the code. The masked token
518is not a masked version of the full 32-bit token, the masked token is the token.
519This makes it trivial to decode tokens that use fewer than 32 bits.
520
521Masking functionality is provided through the ``*_MASK`` versions of the macros.
522For example, the following generates 16-bit tokens and packs them into an
523existing value.
524
525.. code-block:: cpp
526
527 constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
528 uint32_t packed_word = (other_bits << 16) | token;
529
530Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
531used for tokens, the more likely two strings are to hash to the same token. See
532`token collisions`_.
533
534Token collisions
535----------------
536Tokens are calculated with a hash function. It is possible for different
537strings to hash to the same token. When this happens, multiple strings will have
538the same token in the database, and it may not be possible to unambiguously
539decode a token.
540
541The detokenization tools attempt to resolve collisions automatically. Collisions
542are resolved based on two things:
543
544 - whether the tokenized data matches the strings arguments' (if any), and
545 - if / when the string was marked as having been removed from the database.
546
547Working with collisions
548^^^^^^^^^^^^^^^^^^^^^^^
549Collisions may occur occasionally. Run the command
550``python -m pw_tokenizer.database report <database>`` to see information about a
551token database, including any collisions.
552
553If there are collisions, take the following steps to resolve them.
554
555 - Change one of the colliding strings slightly to give it a new token.
556 - In C (not C++), artificial collisions may occur if strings longer than
557 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
558 consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
559 See ``pw_tokenizer/public/pw_tokenizer/config.h``.
560 - Run the ``mark_removed`` command with the latest version of the build
561 artifacts to mark missing strings as removed. This deprioritizes them in
562 collision resolution.
563
564 .. code-block:: sh
565
566 python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
567
568 The ``purge`` command may be used to delete these tokens from the database.
569
570Probability of collisions
571^^^^^^^^^^^^^^^^^^^^^^^^^
572Hashes of any size have a collision risk. The probability of one at least
573one collision occurring for a given number of strings is unintuitively high
574(this is known as the `birthday problem
575<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
576used for tokens, the probability of collisions increases substantially.
577
578This table shows the approximate number of strings that can be hashed to have a
5791% or 50% probability of at least one collision (assuming a uniform, random
580hash).
581
582+-------+---------------------------------------+
583| Token | Collision probability by string count |
584| bits +--------------------+------------------+
585| | 50% | 1% |
586+=======+====================+==================+
587| 32 | 77000 | 9300 |
588+-------+--------------------+------------------+
589| 31 | 54000 | 6600 |
590+-------+--------------------+------------------+
591| 24 | 4800 | 580 |
592+-------+--------------------+------------------+
593| 16 | 300 | 36 |
594+-------+--------------------+------------------+
595| 8 | 19 | 3 |
596+-------+--------------------+------------------+
597
598Keep this table in mind when masking tokens (see `Smaller tokens with
599masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
600such as module names, but won't be suitable for large sets of strings, like log
601messages.
602
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800603Token databases
604===============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800605Token databases store a mapping of tokens to the strings they represent. An ELF
606file can be used as a token database, but it only contains the strings for its
607exact build. A token database file aggregates tokens from multiple ELF files, so
608that a single database can decode tokenized strings from any known ELF.
609
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800610Token databases contain the token, removal date (if any), and string for each
611tokenized string. Two token database formats are supported: CSV and binary.
612
613CSV database format
614-------------------
615The CSV database format has three columns: the token in hexadecimal, the removal
616date (if any) in year-month-day format, and the string literal, surrounded by
617quotes. Quote characters within the string are represented as two quote
618characters.
619
620This example database contains six strings, three of which have removal dates.
621
622.. code-block::
623
624 141c35d5, ,"The answer: ""%s"""
625 2e668cd6,2019-12-25,"Jello, world!"
626 7b940e2a, ,"Hello %s! %hd %e"
627 851beeb6, ,"%u %d"
628 881436a0,2020-01-01,"The answer is: %s"
629 e13b0f94,2020-04-01,"%llu"
630
631Binary database format
632----------------------
633The binary database format is comprised of a 16-byte header followed by a series
634of 8-byte entries. Each entry stores the token and the removal date, which is
6350xFFFFFFFF if there is none. The string literals are stored next in the same
636order as the entries. Strings are stored with null terminators. See
Rob Mohr81244f02021-05-21 15:46:01 -0700637`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800638for full details.
639
640The binary form of the CSV database is shown below. It contains the same
641information, but in a more compact and easily processed form. It takes 141 B
642compared with the CSV database's 211 B.
643
644.. code-block:: text
645
646 [header]
647 0x00: 454b4f54 0000534e TOKENS..
648 0x08: 00000006 00000000 ........
649
650 [entries]
651 0x10: 141c35d5 ffffffff .5......
652 0x18: 2e668cd6 07e30c19 ..f.....
653 0x20: 7b940e2a ffffffff *..{....
654 0x28: 851beeb6 ffffffff ........
655 0x30: 881436a0 07e40101 .6......
656 0x38: e13b0f94 07e40401 ..;.....
657
658 [string table]
659 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
660 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
661 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
662 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
663 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
664
Armando Montanez1bf0fb32022-01-11 09:53:24 -0800665
666JSON support
667------------
668While pw_tokenizer doesn't specify a JSON database format, a token database can
669be created from a JSON formatted array of strings. This is useful for side-band
670token database generation for strings that are not embedded as parsable tokens
671in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
672instructions on generating a token database from a JSON file.
673
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800674Managing token databases
675------------------------
676Token databases are managed with the ``database.py`` script. This script can be
677used to extract tokens from compilation artifacts and manage database files.
678Invoke ``database.py`` with ``-h`` for full usage information.
679
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700680An example ELF file with tokenized logs is provided at
Wyatt Hepler23f831d2020-05-12 13:53:30 -0700681``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700682file to experiment with the ``database.py`` commands.
683
Armando Montanez1bf0fb32022-01-11 09:53:24 -0800684.. _module-pw_tokenizer-database-creation:
685
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800686Create a database
687^^^^^^^^^^^^^^^^^
688The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
Armando Montanez1bf0fb32022-01-11 09:53:24 -0800689etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
690containing an array of strings.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800691
692.. code-block:: sh
693
694 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
695
Armando Montanez1bf0fb32022-01-11 09:53:24 -0800696Two database output formats are supported: CSV and binary. Provide
697``--type binary`` to ``create`` to generate a binary database instead of the
698default CSV. CSV databases are great for checking into a source control or for
699human review. Binary databases are more compact and simpler to parse. The C++
700detokenizer library only supports binary databases currently.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800701
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800702Update a database
703^^^^^^^^^^^^^^^^^
704As new tokenized strings are added, update the database with the ``add``
705command.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800706
707.. code-block:: sh
708
709 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
710
711A CSV token database can be checked into a source repository and updated as code
712changes are made. The build system can invoke ``database.py`` to update the
713database after each build.
714
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700715GN integration
716^^^^^^^^^^^^^^
Wyatt Hepler676b1d22020-08-13 12:32:00 -0700717Token databases may be updated or created as part of a GN build. The
Wyatt Hepler2868e072021-03-10 15:50:03 -0800718``pw_tokenizer_database`` template provided by
719``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
720strings database or creates a new database with artifacts from one or more GN
721targets or other database files.
Wyatt Hepler676b1d22020-08-13 12:32:00 -0700722
723To create a new database, set the ``create`` variable to the desired database
724type (``"csv"`` or ``"binary"``). The database will be created in the output
725directory. To update an existing database, provide the path to the database with
726the ``database`` variable.
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700727
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700728.. code-block::
729
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700730 import("//build_overrides/pigweed.gni")
731
732 import("$dir_pw_tokenizer/database.gni")
733
734 pw_tokenizer_database("my_database") {
735 database = "database_in_the_source_tree.csv"
736 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
737 input_databases = [ "other_database.csv" ]
738 }
739
Wyatt Hepler2868e072021-03-10 15:50:03 -0800740Instead of specifying GN targets, paths or globs to output files may be provided
741with the ``paths`` option.
742
743.. code-block::
744
745 pw_tokenizer_database("my_database") {
746 database = "database_in_the_source_tree.csv"
747 deps = [ ":apps" ]
Wyatt Hepler7faecc92021-03-12 11:24:37 -0800748 optional_paths = [ "$root_build_dir/**/*.elf" ]
Wyatt Hepler2868e072021-03-10 15:50:03 -0800749 }
750
Wyatt Hepler7faecc92021-03-12 11:24:37 -0800751.. note::
752
753 The ``paths`` and ``optional_targets`` arguments do not add anything to
754 ``deps``, so there is no guarantee that the referenced artifacts will exist
755 when the database is updated. Provide ``targets`` or ``deps`` or build other
756 GN targets first if this is a concern.
757
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800758Detokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800759==============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800760Detokenization is the process of expanding a token to the string it represents
761and decoding its arguments. This module provides Python and C++ detokenization
762libraries.
763
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800764**Example: decoding tokenized logs**
765
766A project might tokenize its log messages with the `Base64 format`_. Consider
767the following log file, which has four tokenized logs and one plain text log:
768
769.. code-block:: text
770
771 20200229 14:38:58 INF $HL2VHA==
772 20200229 14:39:00 DBG $5IhTKg==
773 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
774 20200229 14:39:21 INF $EgFj8lVVAUI=
775 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
776
777The project's log strings are stored in a database like the following:
778
779.. code-block::
780
781 1c95bd1c, ,"Initiating retrieval process for recovery object"
782 2a5388e4, ,"Determining optimal approach and coordinating vectors"
783 3743540c, ,"Recovery object retrieval failed with status %s"
784 f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
785
786Using the detokenizing tools with the database, the logs can be decoded:
787
788.. code-block:: text
789
790 20200229 14:38:58 INF Initiating retrieval process for recovery object
791 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
792 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
793 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
794 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
795
796.. note::
797
798 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
799 much space as the default binary format when encoded. For projects that wish
800 to interleave tokenized with plain text, using Base64 is a worthwhile
801 tradeoff.
802
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800803Python
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800804------
805To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
806package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800807
808.. code-block:: python
809
810 import pw_tokenizer
811
812 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
813
814 def process_log_message(log_message):
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800815 result = detokenizer.detokenize(log_message.payload)
816 self._log(str(result))
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800817
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800818The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800819class, which can be used in place of the standard ``Detokenizer``. This class
820monitors database files for changes and automatically reloads them when they
821change. This is helpful for long-running tools that use detokenization.
822
Armando Montanezdd646712021-07-14 16:10:36 -0700823For messages that are optionally tokenized and may be encoded as binary,
824Base64, or plaintext UTF-8, use
825:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to
826determine the correct method to detokenize and always provide a printable
827string. For more information on this feature, see
828:ref:`module-pw_tokenizer-proto`.
829
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800830C++
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800831---
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800832The C++ detokenization libraries can be used in C++ or any language that can
833call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800834Java Native Interface (JNI) implementation is provided.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800835
836The C++ detokenization library uses binary-format token databases (created with
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800837``database.py create --type binary``). Read a binary format database from a
838file or include it in the source code. Pass the database array to
839``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800840
841.. code-block:: cpp
842
843 Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
844
845 std::string ProcessLog(span<uint8_t> log_data) {
846 return detokenizer.Detokenize(log_data).BestString();
847 }
848
849The ``TokenDatabase`` class verifies that its data is valid before using it. If
850it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
851``ok()`` returns false. If the token database is included in the source code,
852this check can be done at compile time.
853
854.. code-block:: cpp
855
856 // This line fails to compile with a static_assert if the database is invalid.
857 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
858
859 Detokenizer OpenDatabase(std::string_view path) {
860 std::vector<uint8_t> data = ReadWholeFile(path);
861
862 TokenDatabase database = TokenDatabase::Create(data);
863
864 // This checks if the file contained a valid database. It is safe to use a
865 // TokenDatabase that failed to load (it will be empty), but it may be
866 // desirable to provide a default database or otherwise handle the error.
867 if (database.ok()) {
868 return Detokenizer(database);
869 }
870 return Detokenizer(kDefaultDatabase);
871 }
872
Wyatt Heplerde20d742021-06-02 23:34:14 -0700873Protocol buffers
874----------------
875``pw_tokenizer`` provides utilities for handling tokenized fields in protobufs.
876See :ref:`module-pw_tokenizer-proto` for details.
877
878.. toctree::
879 :hidden:
880
881 proto.rst
882
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800883Base64 format
884=============
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800885The tokenizer encodes messages to a compact binary representation. Applications
886may desire a textual representation of tokenized strings. This makes it easy to
887use tokenized messages alongside plain text messages, but comes at a small
888efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
889as binary messages.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800890
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800891The Base64 format is comprised of a ``$`` character followed by the
892Base64-encoded contents of the tokenized message. For example, consider
893tokenizing the string ``This is an example: %d!`` with the argument -1. The
894string's token is 0x4b016e66.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800895
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800896.. code-block:: text
897
898 Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
899
900 Plain text: This is an example: -1! [23 bytes]
901
902 Binary: 66 6e 01 4b 01 [ 5 bytes]
903
904 Base64: $Zm4BSwE= [ 9 bytes]
905
906Encoding
907--------
908To encode with the Base64 format, add a call to
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700909``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800910in the tokenizer handler function. For example,
911
912.. code-block:: cpp
913
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700914 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800915 size_t size_bytes) {
916 char base64_buffer[64];
917 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
918 pw::span(encoded_message, size_bytes), base64_buffer);
919
920 TransmitLogMessage(base64_buffer, base64_size);
921 }
922
923Decoding
924--------
Wyatt Hepler5f53d272021-06-01 21:58:25 -0700925The Python ``Detokenizer`` class supprts decoding and detokenizing prefixed
926Base64 messages with ``detokenize_base64`` and related methods.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800927
928.. tip::
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800929 The Python detokenization tools support recursive detokenization for prefixed
930 Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800931 prefixed Base64 messages can be passed as ``%s`` arguments.
932
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800933 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
934 passed as an argument to the printf-style string ``Nested message: %s``, which
935 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
936 as follows:
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800937
938 ::
939
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800940 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800941
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800942Base64 decoding is supported in C++ or C with the
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700943``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800944functions.
945
946.. code-block:: cpp
947
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700948 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800949 size_t size_bytes) {
950 char base64_buffer[64];
951 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
952 pw::span(encoded_message, size_bytes), base64_buffer);
953
954 TransmitLogMessage(base64_buffer, base64_size);
955 }
956
Wyatt Hepler03d14dc2022-02-16 16:13:38 -0800957Investigating undecoded messages
958--------------------------------
959Tokenized messages cannot be decoded if the token is not recognized. The Python
960package includes the ``parse_message`` tool, which parses tokenized Base64
961messages without looking up the token in a database. This tool attempts to guess
962the types of the arguments and displays potential ways to decode them.
963
964This tool can be used to extract argument information from an otherwise unusable
965message. It could help identify which statement in the code produced the
966message. This tool is not particularly helpful for tokenized messages without
967arguments, since all it can do is show the value of the unknown token.
968
969The tool is executed by passing Base64 tokenized messages, with or without the
970``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to
971see full usage information.
972
973Example
974^^^^^^^
975.. code-block::
976
977 $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d
978
979 INF Decoding arguments for '$329JMwA='
980 INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
981 INF Token: 0x33496fdf
982 INF Args: b'\x00' [00] (1 bytes)
983 INF Decoding with up to 8 %s or %d arguments
984 INF Attempt 1: [%s]
985 INF Attempt 2: [%d] 0
986
987 INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
988 INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
989 INF Token: 0xe7a58492
990 INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
991 INF Decoding with up to 8 %s or %d arguments
992 INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
993 INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK
994
Wyatt Hepler63e6cfd2020-08-04 11:31:55 -0700995Command line utilities
996^^^^^^^^^^^^^^^^^^^^^^
997``pw_tokenizer`` provides two standalone command line utilities for detokenizing
998Base64-encoded tokenized strings.
999
1000* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
1001 stdin.
Keir Mierlede972b82021-05-18 09:27:09 -07001002* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a
Wyatt Hepler63e6cfd2020-08-04 11:31:55 -07001003 connected serial device.
1004
1005If the ``pw_tokenizer`` Python package is installed, these tools may be executed
1006as runnable modules. For example:
1007
1008.. code-block::
1009
1010 # Detokenize Base64-encoded strings in a file
1011 python -m pw_tokenizer.detokenize -i input_file.txt
1012
1013 # Detokenize Base64-encoded strings in output from a serial device
Keir Mierlede972b82021-05-18 09:27:09 -07001014 python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0
Wyatt Hepler63e6cfd2020-08-04 11:31:55 -07001015
1016See the ``--help`` options for these tools for full usage information.
1017
Keir Mierle086ef1c2020-03-19 02:03:51 -07001018Deployment war story
1019====================
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001020The tokenizer module was developed to bring tokenized logging to an
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001021in-development product. The product already had an established text-based
1022logging system. Deploying tokenization was straightforward and had substantial
1023benefits.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001024
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001025Results
1026-------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001027 * Log contents shrunk by over 50%, even with Base64 encoding.
1028
1029 * Significant size savings for encoded logs, even using the less-efficient
1030 Base64 encoding required for compatibility with the existing log system.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001031 * Freed valuable communication bandwidth.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001032 * Allowed storing many more logs in crash dumps.
1033
1034 * Substantial flash savings.
1035
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001036 * Reduced the size firmware images by up to 18%.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001037
1038 * Simpler logging code.
1039
1040 * Removed CPU-heavy ``snprintf`` calls.
1041 * Removed complex code for forwarding log arguments to a low-priority task.
1042
1043This section describes the tokenizer deployment process and highlights key
1044insights.
1045
1046Firmware deployment
1047-------------------
1048 * In the project's logging macro, calls to the underlying logging function
1049 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
1050 invocation.
1051 * The log level was passed as the payload argument to facilitate runtime log
1052 level control.
1053 * For this project, it was necessary to encode the log messages as text. In
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -07001054 ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001055 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
1056 messages.
1057 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
1058
1059.. attention::
1060 Do not encode line numbers in tokenized strings. This results in a huge
1061 number of lines being added to the database, since every time code moves,
Wyatt Heplerebbce4c2021-06-03 17:34:00 -07001062 new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line
1063 numbers are encoded in the log metadata. Line numbers may also be included by
1064 by adding ``"%d"`` to the format string and passing ``__LINE__``.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001065
1066Database management
1067-------------------
1068 * The token database was stored as a CSV file in the project's Git repo.
1069 * The token database was automatically updated as part of the build, and
1070 developers were expected to check in the database changes alongside their
1071 code changes.
1072 * A presubmit check verified that all strings added by a change were added to
1073 the token database.
1074 * The token database included logs and asserts for all firmware images in the
1075 project.
1076 * No strings were purged from the token database.
1077
1078.. tip::
1079 Merge conflicts may be a frequent occurrence with an in-source database. If
1080 the database is in-source, make sure there is a simple script to resolve any
1081 merge conflicts. The script could either keep both sets of lines or discard
1082 local changes and regenerate the database.
1083
1084Decoding tooling deployment
1085---------------------------
1086 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
1087
1088 * Product-specific Python command line tools, using
1089 ``pw_tokenizer.Detokenizer``.
1090 * Standalone script for decoding prefixed Base64 tokens in files or
1091 live output (e.g. from ``adb``), using ``detokenize.py``'s command line
1092 interface.
1093
1094 * The C++ detokenizer library was deployed to two Android apps with a Java
1095 Native Interface (JNI) layer.
1096
1097 * The binary token database was included as a raw resource in the APK.
1098 * In one app, the built-in token database could be overridden by copying a
1099 file to the phone.
1100
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001101.. tip::
1102 Make the tokenized logging tools simple to use for your project.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001103
1104 * Provide simple wrapper shell scripts that fill in arguments for the
1105 project. For example, point ``detokenize.py`` to the project's token
1106 databases.
Wyatt Hepler5f53d272021-06-01 21:58:25 -07001107 * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001108 continuously-running tools, so that users don't have to restart the tool
1109 when the token database updates.
1110 * Integrate detokenization everywhere it is needed. Integrating the tools
1111 takes just a few lines of code, and token databases can be embedded in
1112 APKs or binaries.
1113
1114Limitations and future work
1115===========================
1116
1117GCC bug: tokenization in template functions
1118-------------------------------------------
1119GCC incorrectly ignores the section attribute for template
1120`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
1121`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
1122bug, tokenized strings in template functions may be emitted into ``.rodata``
1123instead of the special tokenized string section. This causes two problems:
1124
1125 1. Tokenized strings will not be discovered by the token database tools.
1126 2. Tokenized strings may not be removed from the final binary.
1127
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001128clang does **not** have this issue! Use clang to avoid this.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001129
1130It is possible to work around this bug in GCC. One approach would be to tag
1131format strings so that the database tools can find them in ``.rodata``. Then, to
1132remove the strings, compile two binaries: one metadata binary with all tokenized
1133strings and a second, final binary that removes the strings. The strings could
1134be removed by providing the appropriate linker flags or by removing the ``used``
1135attribute from the tokenized string character array declaration.
1136
113764-bit tokenization
1138-------------------
1139The Python and C++ detokenizing libraries currently assume that strings were
1140tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
1141``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
1142device performed the tokenization.
1143
1144Supporting detokenization of strings tokenized on 64-bit targets would be
1145simple. This could be done by adding an option to switch the 32-bit types to
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700114664-bit. The tokenizer stores the sizes of these types in the
Wyatt Heplereb020a12020-10-28 14:01:51 -07001147``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
Wyatt Heplerd58eef92020-05-08 10:39:56 -07001148by checking the ELF file, if necessary.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001149
1150Tokenization in headers
1151-----------------------
1152Tokenizing code in header files (inline functions or templates) may trigger
1153warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
1154is because tokenization requires declaring a character array for each tokenized
1155string. If the tokenized string includes macros that change value, the size of
1156this character array changes, which means the same static variable is defined
1157with different sizes. It should be safe to suppress these warnings, but, when
1158possible, code that tokenizes strings with macros that can change value should
1159be moved to source files rather than headers.
1160
1161Tokenized strings as ``%s`` arguments
1162-------------------------------------
1163Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
1164encoded 1:1, with no tokenization. It would be better to send a tokenized string
1165literal as an integer instead of a string argument, but this is not yet
1166supported.
1167
1168A string token could be sent by marking an integer % argument in a way
1169recognized by the detokenization tools. The detokenizer would expand the
1170argument to the string represented by the integer.
1171
1172.. code-block:: cpp
1173
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001174 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001175
1176 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
1177
Wyatt Heplera46bf7d2020-01-08 18:10:25 -08001178 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001179
1180Strings with arguments could be encoded to a buffer, but since printf strings
1181are null-terminated, a binary encoding would not work. These strings can be
1182prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
1183
1184Another possibility: encode strings with arguments to a ``uint64_t`` and send
1185them as an integer. This would be efficient and simple, but only support a small
1186number of arguments.
1187
Wyatt Heplereb020a12020-10-28 14:01:51 -07001188Legacy tokenized string ELF format
1189==================================
1190The original version of ``pw_tokenizer`` stored tokenized stored as plain C
1191strings in the ELF file instead of structured tokenized string entries. Strings
1192in different domains were stored in different linker sections. The Python script
1193that parsed the ELF file would re-calculate the tokens.
1194
1195In the current version of ``pw_tokenizer``, tokenized strings are stored in a
1196structured entry containing a token, domain, and length-delimited string. This
1197has several advantages over the legacy format:
1198
1199* The Python script does not have to recalculate the token, so any hash
1200 algorithm may be used in the firmware.
1201* In C++, the tokenization hash no longer has a length limitation.
1202* Strings with null terminators in them are properly handled.
1203* Only one linker section is required in the linker script, instead of a
1204 separate section for each domain.
1205
1206To migrate to the new format, all that is required is update the linker sections
1207to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
1208``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
1209The Python tooling continues to support the legacy tokenized string ELF format.
1210
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001211Compatibility
1212=============
1213 * C11
Wyatt Hepler0132da52021-10-11 16:20:11 -07001214 * C++14
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001215 * Python 3
1216
1217Dependencies
1218============
Armando Montanez0054a9b2020-03-13 13:06:24 -07001219 * ``pw_varint`` module
1220 * ``pw_preprocessor`` module
1221 * ``pw_span`` module