blob: a738414196ab4b88519ddc74393f678058f01511 [file] [log] [blame]
Wyatt Heplerf9fb90f2020-09-30 18:59:33 -07001.. _module-pw_tokenizer:
Keir Mierlebc5a2692020-05-21 16:52:25 -07002
Alexei Frolov44d54732020-01-10 14:45:43 -08003------------
4pw_tokenizer
5------------
Armando Montanezbcc194b2020-03-10 10:23:18 -07006Logging is critical, but developers are often forced to choose between
Wyatt Heplerd32daea2020-03-26 13:55:47 -07007additional logging or saving crucial flash space. The ``pw_tokenizer`` module
8helps address this by replacing printf-style strings with binary tokens during
9compilation. This enables extensive logging with substantially less memory
10usage.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080011
12.. note::
13 This usage of the term "tokenizer" is not related to parsing! The
14 module is called tokenizer because it replaces a whole string literal with an
15 integer token. It does not parse strings into separate tokens.
16
Wyatt Heplerd32daea2020-03-26 13:55:47 -070017The most common application of ``pw_tokenizer`` is binary logging, and it is
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080018designed to integrate easily into existing logging systems. However, the
Wyatt Heplerd32daea2020-03-26 13:55:47 -070019tokenizer is general purpose and can be used to tokenize any strings, with or
20without printf-style arguments.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080021
22**Why tokenize strings?**
23
24 * Dramatically reduce binary size by removing string literals from binaries.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080025 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
Wyatt Heplerd32daea2020-03-26 13:55:47 -070026 tokens instead of strings. We've seen over 50% reduction in encoded log
27 contents.
28 * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080029 * Remove potentially sensitive log, assert, and other strings from binaries.
30
Wyatt Heplerd32daea2020-03-26 13:55:47 -070031Basic overview
32==============
33There are two sides to ``pw_tokenizer``, which we call tokenization and
34detokenization.
Armando Montanezbcc194b2020-03-10 10:23:18 -070035
Wyatt Heplerd32daea2020-03-26 13:55:47 -070036 * **Tokenization** converts string literals in the source code to
37 binary tokens at compile time. If the string has printf-style arguments,
38 these are encoded to compact binary form at runtime.
39 * **Detokenization** converts tokenized strings back to the original
40 human-readable strings.
41
42Here's an overview of what happens when ``pw_tokenizer`` is used:
43
44 1. During compilation, the ``pw_tokenizer`` module hashes string literals to
45 generate stable 32-bit tokens.
46 2. The tokenization macro removes these strings by declaring them in an ELF
47 section that is excluded from the final binary.
48 3. After compilation, strings are extracted from the ELF to build a database
49 of tokenized strings for use by the detokenizer. The ELF file may also be
50 used directly.
51 4. During operation, the device encodes the string token and its arguments, if
52 any.
53 5. The encoded tokenized strings are sent off-device or stored.
54 6. Off-device, the detokenizer tools use the token database to decode the
55 strings to human-readable form.
56
57Example: tokenized logging
58--------------------------
59This example demonstrates using ``pw_tokenizer`` for logging. In this example,
60tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
61size (49 → 15 bytes).
62
63**Before**: plain text logging
Armando Montanezbcc194b2020-03-10 10:23:18 -070064
65+------------------+-------------------------------------------+---------------+
66| Location | Logging Content | Size in bytes |
67+==================+===========================================+===============+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070068| Source contains | ``LOG("Battery state: %s; battery | |
69| | voltage: %d mV", state, voltage);`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070070+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070071| Binary contains | ``"Battery state: %s; battery | 41 |
72| | voltage: %d mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070073+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070074| | (log statement is called with | |
75| | ``"CHARGING"`` and ``3989`` as arguments) | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070076+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070077| Device transmits | ``"Battery state: CHARGING; battery | 49 |
78| | voltage: 3989 mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070079+------------------+-------------------------------------------+---------------+
Wyatt Heplerd32daea2020-03-26 13:55:47 -070080| When viewed | ``"Battery state: CHARGING; battery | |
81| | voltage: 3989 mV"`` | |
Armando Montanezbcc194b2020-03-10 10:23:18 -070082+------------------+-------------------------------------------+---------------+
83
Wyatt Heplerd32daea2020-03-26 13:55:47 -070084**After**: tokenized logging
Armando Montanezbcc194b2020-03-10 10:23:18 -070085
Wyatt Heplerd32daea2020-03-26 13:55:47 -070086+------------------+-----------------------------------------------------------+---------+
87| Location | Logging Content | Size in |
88| | | bytes |
89+==================+===========================================================+=========+
90| Source contains | ``LOG("Battery state: %s; battery | |
91| | voltage: %d mV", state, voltage);`` | |
92+------------------+-----------------------------------------------------------+---------+
93| Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 |
94+------------------+-----------------------------------------------------------+---------+
95| | (log statement is called with | |
96| | ``"CHARGING"`` and ``3989`` as arguments) | |
97+------------------+-----------------------------------------------------------+---------+
98| Device transmits | =============== ============================== ========== | 15 |
99| | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | |
100| | --------------- ------------------------------ ---------- | |
101| | Token ``"CHARGING"`` argument ``3989``, | |
102| | as | |
103| | varint | |
104| | =============== ============================== ========== | |
105+------------------+-----------------------------------------------------------+---------+
106| When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | |
107+------------------+-----------------------------------------------------------+---------+
Armando Montanezbcc194b2020-03-10 10:23:18 -0700108
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700109Getting started
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800110===============
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700111Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
112section describes one way ``pw_tokenizer`` might be integrated with a project.
113These steps can be adapted as needed.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800114
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700115 1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
116 are provided. For Make or other build systems, add the files specified in
117 the BUILD.gn's ``pw_tokenizer`` target to the build.
118 2. Use the tokenization macros in your code. See `Tokenization`_.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700119 3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
Wyatt Hepler3a8df982020-11-03 14:06:28 -0800120 linker script. In GN and CMake, this step is done automatically.
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700121 4. Compile your code to produce an ELF file.
122 5. Run ``database.py create`` on the ELF file to generate a CSV token
123 database. See `Managing token databases`_.
124 6. Commit the token database to your repository. See notes in `Database
125 management`_.
126 7. Integrate a ``database.py add`` command to your build to automatically
Wyatt Hepler3a8df982020-11-03 14:06:28 -0800127 update the committed token database. In GN, use the
128 ``pw_tokenizer_database`` template to do this. See `Update a database`_.
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700129 8. Integrate ``detokenize.py`` or the C++ detokenization library with your
130 tools to decode tokenized logs. See `Detokenization`_.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800131
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800132Tokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800133============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800134Tokenization converts a string literal to a token. If it's a printf-style
135string, its arguments are encoded along with it. The results of tokenization can
136be sent off device or stored in place of a full string.
137
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800138Tokenization macros
139-------------------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800140Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800141``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800142
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800143Tokenize a string literal
144^^^^^^^^^^^^^^^^^^^^^^^^^
145The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
146token.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800147
148.. code-block:: cpp
149
150 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
151
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800152.. admonition:: When to use this macro
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800153
Armando Montanezbcc194b2020-03-10 10:23:18 -0700154 Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800155 %-style arguments.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800156
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800157Tokenize to a handler function
158^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800159``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800160since it takes the fewest arguments. It encodes a tokenized string to a
161buffer on the stack. The size of the buffer is set with
Wyatt Hepler6639c452020-05-06 11:43:07 -0700162``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
163
164This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700165backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
Wyatt Hepler6639c452020-05-06 11:43:07 -0700166C-linkage function.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800167
168.. code-block:: cpp
169
170 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
171
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700172 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
173 size_t size_bytes);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800174
175``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Hepler6639c452020-05-06 11:43:07 -0700176``uintptr_t`` argument to the global handler function. Values like a log level
177can be packed into the ``uintptr_t``.
178
179This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
180facade. The backend for this facade must define the
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700181``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800182
183.. code-block:: cpp
184
185 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
186 format_string_literal,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800187 arguments...);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800188
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700189 void pw_tokenizer_HandleEncodedMessageWithPayload(
190 uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800191
Wyatt Hepler6639c452020-05-06 11:43:07 -0700192.. admonition:: When to use these macros
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800193
194 Use anytime a global handler is sufficient, particularly for widely expanded
195 macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
196 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
197 for tokenizing printf-style strings.
198
199Tokenize to a callback
200^^^^^^^^^^^^^^^^^^^^^^
201``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
202``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
203the call site. The size of the buffer is set with
204``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
205
206.. code-block:: cpp
207
208 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
209
210.. admonition:: When to use this macro
211
212 Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
213 use for another purpose or more flexibility is needed.
214
215Tokenize to a buffer
216^^^^^^^^^^^^^^^^^^^^
217The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
218to a caller-provided buffer.
219
220.. code-block:: cpp
221
222 uint8_t buffer[BUFFER_SIZE];
223 size_t size_bytes = sizeof(buffer);
224 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
225
226While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
227than the other macros, so its per-use code size overhead is larger.
228
229.. admonition:: When to use this macro
230
231 Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
232 other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
233 widely expanded macros, such as a logging macro, because it will result in
234 larger code size than its alternatives.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800235
236Example: binary logging
237^^^^^^^^^^^^^^^^^^^^^^^
238String tokenization is perfect for logging. Consider the following log macro,
239which gathers the file, line number, and log message. It calls the ``RecordLog``
240function, which formats the log string, collects a timestamp, and transmits the
241result.
242
243.. code-block:: cpp
244
245 #define LOG_INFO(format, ...) \
246 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
247
248 void RecordLog(LogLevel level, const char* file, int line, const char* format,
249 ...) {
250 if (level < current_log_level) {
251 return;
252 }
253
254 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
255
256 va_list args;
257 va_start(args, format);
258 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
259 va_end(args);
260
261 TransmitLog(TimeSinceBootMillis(), buffer, size);
262 }
263
264It is trivial to convert this to a binary log using the tokenizer. The
265``RecordLog`` call is replaced with a
266``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700267``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800268timestamp and transmits the message with ``TransmitLog``.
269
270.. code-block:: cpp
271
272 #define LOG_INFO(format, ...) \
273 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700274 (pw_tokenizer_Payload)LogLevel_INFO, \
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800275 __FILE_NAME__ ":%d " format, \
276 __LINE__, \
277 __VA_ARGS__); \
278
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700279 extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
Wyatt Hepler6639c452020-05-06 11:43:07 -0700280 uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
281 if (static_cast<LogLevel>(level) >= current_log_level) {
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800282 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
283 }
284 }
285
286Note that the ``__FILE_NAME__`` string is directly included in the log format
287string. Since the string is tokenized, this has no effect on binary size. A
288``%d`` for the line number is added to the format string, so that changing the
289line of the log message does not generate a new token. There is no overhead for
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800290additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800291duplicate log lines.
292
Wyatt Hepler7e587232020-08-28 07:51:29 -0700293Tokenizing function names
294-------------------------
295The string literal tokenization functions support tokenizing string literals or
296constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
297special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
298as ``static constexpr char[]`` in C++ instead of the standard ``static const
299char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
300tokenized while compiling C++ with GCC or Clang.
301
302.. code-block:: cpp
303
304 // Tokenize the special function name variables.
305 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
306 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
307
308 // Tokenize the function name variables to a handler function.
309 PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
310 PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
311
312Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
313They are defined as static character arrays, so they cannot be implicitly
314concatentated with string literals. For example, ``printf(__func__ ": %d",
315123);`` will not compile.
316
Wyatt Heplerf5e984a2020-04-15 18:15:09 -0700317Tokenization in Python
318----------------------
319The Python ``pw_tokenizer.encode`` module has limited support for encoding
320tokenized messages with the ``encode_token_and_args`` function.
321
322.. autofunction:: pw_tokenizer.encode.encode_token_and_args
323
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800324Encoding
325--------
326The token is a 32-bit hash calculated during compilation. The string is encoded
327little-endian with the token followed by arguments, if any. For example, the
32831-byte string ``You can go about your business.`` hashes to 0xdac9a244.
329This is encoded as 4 bytes: ``44 a2 c9 da``.
330
331Arguments are encoded as follows:
332
333 * **Integers** (1--10 bytes) --
334 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
335 similarly to Protocol Buffers. Smaller values take fewer bytes.
336 * **Floating point numbers** (4 bytes) -- Single precision floating point.
337 * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
Wyatt Heplerf5e984a2020-04-15 18:15:09 -0700338 The top bit of the length whether the string was truncated or
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800339 not. The remaining 7 bits encode the string length, with a maximum of 127
340 bytes.
341
342.. TODO: insert diagram here!
343
344.. tip::
345 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
346 short or avoid encoding them as strings (e.g. encode an enum as an integer
347 instead of a string). See also `Tokenized strings as %s arguments`_.
348
349Token generation: fixed length hashing at compile time
350------------------------------------------------------
351String tokens are generated using a modified version of the x65599 hash used by
352the SDBM project. All hashing is done at compile time.
353
354In C code, strings are hashed with a preprocessor macro. For compatibility with
355macros, the hash must be limited to a fixed maximum number of characters. This
Wyatt Heplereb020a12020-10-28 14:01:51 -0700356value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
357``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
358the complexity of the hashing macros.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800359
Wyatt Heplereb020a12020-10-28 14:01:51 -0700360C++ macros use a constexpr function instead of a macro. This function works with
361any length of string and has lower compilation time impact than the C macros.
362For consistency, C++ tokenization uses the same hash algorithm, but the
363calculated values will differ between C and C++ for strings longer than
364``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800365
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700366Tokenization domains
367--------------------
Wyatt Heplereb020a12020-10-28 14:01:51 -0700368``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
369string label associated with each tokenized string. This allows projects to keep
370tokens from different sources separate. Potential use cases include the
371following:
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700372
373* Keep large sets of tokenized strings separate to avoid collisions.
374* Create a separate database for a small number of strings that use truncated
375 tokens, for example only 10 or 16 bits instead of the full 32 bits.
376
Wyatt Heplereb020a12020-10-28 14:01:51 -0700377If no domain is specified, the domain is empty (``""``). For many projects, this
378default domain is sufficient, so no additional configuration is required.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700379
380.. code-block:: cpp
381
Wyatt Heplereb020a12020-10-28 14:01:51 -0700382 // Tokenizes this string to the default ("") domain.
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700383 PW_TOKENIZE_STRING("Hello, world!");
384
385 // Tokenizes this string to the "my_custom_domain" domain.
386 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
387
Wyatt Hepler23f831d2020-05-12 13:53:30 -0700388The database and detokenization command line tools default to reading from the
389default domain. The domain may be specified for ELF files by appending
390``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
391example, the following reads strings in ``some_domain`` from ``my_image.elf``.
392
393.. code-block:: sh
394
395 ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
396
397See `Managing token databases`_ for information about the ``database.py``
398command line tool.
399
Wyatt Hepler4b62b892021-03-04 10:03:43 -0800400Smaller tokens with masking
401---------------------------
402``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
403fewer than 32 bits does not improve runtime or code size efficiency. However,
404when tokens are packed into data structures or stored in arrays, the size of the
405token directly affects memory usage. In those cases, every bit counts, and it
406may be desireable to use fewer bits for the token.
407
408``pw_tokenizer`` allows users to provide a mask to apply to the token. This
409masked token is used in both the token database and the code. The masked token
410is not a masked version of the full 32-bit token, the masked token is the token.
411This makes it trivial to decode tokens that use fewer than 32 bits.
412
413Masking functionality is provided through the ``*_MASK`` versions of the macros.
414For example, the following generates 16-bit tokens and packs them into an
415existing value.
416
417.. code-block:: cpp
418
419 constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
420 uint32_t packed_word = (other_bits << 16) | token;
421
422Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
423used for tokens, the more likely two strings are to hash to the same token. See
424`token collisions`_.
425
426Token collisions
427----------------
428Tokens are calculated with a hash function. It is possible for different
429strings to hash to the same token. When this happens, multiple strings will have
430the same token in the database, and it may not be possible to unambiguously
431decode a token.
432
433The detokenization tools attempt to resolve collisions automatically. Collisions
434are resolved based on two things:
435
436 - whether the tokenized data matches the strings arguments' (if any), and
437 - if / when the string was marked as having been removed from the database.
438
439Working with collisions
440^^^^^^^^^^^^^^^^^^^^^^^
441Collisions may occur occasionally. Run the command
442``python -m pw_tokenizer.database report <database>`` to see information about a
443token database, including any collisions.
444
445If there are collisions, take the following steps to resolve them.
446
447 - Change one of the colliding strings slightly to give it a new token.
448 - In C (not C++), artificial collisions may occur if strings longer than
449 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
450 consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
451 See ``pw_tokenizer/public/pw_tokenizer/config.h``.
452 - Run the ``mark_removed`` command with the latest version of the build
453 artifacts to mark missing strings as removed. This deprioritizes them in
454 collision resolution.
455
456 .. code-block:: sh
457
458 python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
459
460 The ``purge`` command may be used to delete these tokens from the database.
461
462Probability of collisions
463^^^^^^^^^^^^^^^^^^^^^^^^^
464Hashes of any size have a collision risk. The probability of one at least
465one collision occurring for a given number of strings is unintuitively high
466(this is known as the `birthday problem
467<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
468used for tokens, the probability of collisions increases substantially.
469
470This table shows the approximate number of strings that can be hashed to have a
4711% or 50% probability of at least one collision (assuming a uniform, random
472hash).
473
474+-------+---------------------------------------+
475| Token | Collision probability by string count |
476| bits +--------------------+------------------+
477| | 50% | 1% |
478+=======+====================+==================+
479| 32 | 77000 | 9300 |
480+-------+--------------------+------------------+
481| 31 | 54000 | 6600 |
482+-------+--------------------+------------------+
483| 24 | 4800 | 580 |
484+-------+--------------------+------------------+
485| 16 | 300 | 36 |
486+-------+--------------------+------------------+
487| 8 | 19 | 3 |
488+-------+--------------------+------------------+
489
490Keep this table in mind when masking tokens (see `Smaller tokens with
491masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
492such as module names, but won't be suitable for large sets of strings, like log
493messages.
494
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800495Token databases
496===============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800497Token databases store a mapping of tokens to the strings they represent. An ELF
498file can be used as a token database, but it only contains the strings for its
499exact build. A token database file aggregates tokens from multiple ELF files, so
500that a single database can decode tokenized strings from any known ELF.
501
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800502Token databases contain the token, removal date (if any), and string for each
503tokenized string. Two token database formats are supported: CSV and binary.
504
505CSV database format
506-------------------
507The CSV database format has three columns: the token in hexadecimal, the removal
508date (if any) in year-month-day format, and the string literal, surrounded by
509quotes. Quote characters within the string are represented as two quote
510characters.
511
512This example database contains six strings, three of which have removal dates.
513
514.. code-block::
515
516 141c35d5, ,"The answer: ""%s"""
517 2e668cd6,2019-12-25,"Jello, world!"
518 7b940e2a, ,"Hello %s! %hd %e"
519 851beeb6, ,"%u %d"
520 881436a0,2020-01-01,"The answer is: %s"
521 e13b0f94,2020-04-01,"%llu"
522
523Binary database format
524----------------------
525The binary database format is comprised of a 16-byte header followed by a series
526of 8-byte entries. Each entry stores the token and the removal date, which is
5270xFFFFFFFF if there is none. The string literals are stored next in the same
528order as the entries. Strings are stored with null terminators. See
529`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
530for full details.
531
532The binary form of the CSV database is shown below. It contains the same
533information, but in a more compact and easily processed form. It takes 141 B
534compared with the CSV database's 211 B.
535
536.. code-block:: text
537
538 [header]
539 0x00: 454b4f54 0000534e TOKENS..
540 0x08: 00000006 00000000 ........
541
542 [entries]
543 0x10: 141c35d5 ffffffff .5......
544 0x18: 2e668cd6 07e30c19 ..f.....
545 0x20: 7b940e2a ffffffff *..{....
546 0x28: 851beeb6 ffffffff ........
547 0x30: 881436a0 07e40101 .6......
548 0x38: e13b0f94 07e40401 ..;.....
549
550 [string table]
551 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
552 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
553 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
554 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
555 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
556
557Managing token databases
558------------------------
559Token databases are managed with the ``database.py`` script. This script can be
560used to extract tokens from compilation artifacts and manage database files.
561Invoke ``database.py`` with ``-h`` for full usage information.
562
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700563An example ELF file with tokenized logs is provided at
Wyatt Hepler23f831d2020-05-12 13:53:30 -0700564``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
Wyatt Heplerd32daea2020-03-26 13:55:47 -0700565file to experiment with the ``database.py`` commands.
566
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800567Create a database
568^^^^^^^^^^^^^^^^^
569The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
570etc.), archives (.a), or existing token databases (CSV or binary).
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800571
572.. code-block:: sh
573
574 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
575
576Two database formats are supported: CSV and binary. Provide ``--type binary`` to
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800577``create`` to generate a binary database instead of the default CSV. CSV
578databases are great for checking into a source control or for human review.
579Binary databases are more compact and simpler to parse. The C++ detokenizer
580library only supports binary databases currently.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800581
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800582Update a database
583^^^^^^^^^^^^^^^^^
584As new tokenized strings are added, update the database with the ``add``
585command.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800586
587.. code-block:: sh
588
589 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
590
591A CSV token database can be checked into a source repository and updated as code
592changes are made. The build system can invoke ``database.py`` to update the
593database after each build.
594
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700595GN integration
596^^^^^^^^^^^^^^
Wyatt Hepler676b1d22020-08-13 12:32:00 -0700597Token databases may be updated or created as part of a GN build. The
Wyatt Hepler2868e072021-03-10 15:50:03 -0800598``pw_tokenizer_database`` template provided by
599``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
600strings database or creates a new database with artifacts from one or more GN
601targets or other database files.
Wyatt Hepler676b1d22020-08-13 12:32:00 -0700602
603To create a new database, set the ``create`` variable to the desired database
604type (``"csv"`` or ``"binary"``). The database will be created in the output
605directory. To update an existing database, provide the path to the database with
606the ``database`` variable.
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700607
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700608.. code-block::
609
Wyatt Heplera74f7b02020-07-23 14:10:56 -0700610 import("//build_overrides/pigweed.gni")
611
612 import("$dir_pw_tokenizer/database.gni")
613
614 pw_tokenizer_database("my_database") {
615 database = "database_in_the_source_tree.csv"
616 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
617 input_databases = [ "other_database.csv" ]
618 }
619
Wyatt Hepler2868e072021-03-10 15:50:03 -0800620Instead of specifying GN targets, paths or globs to output files may be provided
621with the ``paths`` option.
622
623.. code-block::
624
625 pw_tokenizer_database("my_database") {
626 database = "database_in_the_source_tree.csv"
627 deps = [ ":apps" ]
628 paths = [ "$root_build_dir/**/*.elf" ]
629 }
630
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800631Detokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800632==============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800633Detokenization is the process of expanding a token to the string it represents
634and decoding its arguments. This module provides Python and C++ detokenization
635libraries.
636
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800637**Example: decoding tokenized logs**
638
639A project might tokenize its log messages with the `Base64 format`_. Consider
640the following log file, which has four tokenized logs and one plain text log:
641
642.. code-block:: text
643
644 20200229 14:38:58 INF $HL2VHA==
645 20200229 14:39:00 DBG $5IhTKg==
646 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
647 20200229 14:39:21 INF $EgFj8lVVAUI=
648 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
649
650The project's log strings are stored in a database like the following:
651
652.. code-block::
653
654 1c95bd1c, ,"Initiating retrieval process for recovery object"
655 2a5388e4, ,"Determining optimal approach and coordinating vectors"
656 3743540c, ,"Recovery object retrieval failed with status %s"
657 f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
658
659Using the detokenizing tools with the database, the logs can be decoded:
660
661.. code-block:: text
662
663 20200229 14:38:58 INF Initiating retrieval process for recovery object
664 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
665 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
666 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
667 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
668
669.. note::
670
671 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
672 much space as the default binary format when encoded. For projects that wish
673 to interleave tokenized with plain text, using Base64 is a worthwhile
674 tradeoff.
675
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800676Python
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800677------
678To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
679package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800680
681.. code-block:: python
682
683 import pw_tokenizer
684
685 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
686
687 def process_log_message(log_message):
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800688 result = detokenizer.detokenize(log_message.payload)
689 self._log(str(result))
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800690
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800691The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800692class, which can be used in place of the standard ``Detokenizer``. This class
693monitors database files for changes and automatically reloads them when they
694change. This is helpful for long-running tools that use detokenization.
695
696C++
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800697---
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800698The C++ detokenization libraries can be used in C++ or any language that can
699call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800700Java Native Interface (JNI) implementation is provided.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800701
702The C++ detokenization library uses binary-format token databases (created with
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800703``database.py create --type binary``). Read a binary format database from a
704file or include it in the source code. Pass the database array to
705``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800706
707.. code-block:: cpp
708
709 Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
710
711 std::string ProcessLog(span<uint8_t> log_data) {
712 return detokenizer.Detokenize(log_data).BestString();
713 }
714
715The ``TokenDatabase`` class verifies that its data is valid before using it. If
716it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
717``ok()`` returns false. If the token database is included in the source code,
718this check can be done at compile time.
719
720.. code-block:: cpp
721
722 // This line fails to compile with a static_assert if the database is invalid.
723 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
724
725 Detokenizer OpenDatabase(std::string_view path) {
726 std::vector<uint8_t> data = ReadWholeFile(path);
727
728 TokenDatabase database = TokenDatabase::Create(data);
729
730 // This checks if the file contained a valid database. It is safe to use a
731 // TokenDatabase that failed to load (it will be empty), but it may be
732 // desirable to provide a default database or otherwise handle the error.
733 if (database.ok()) {
734 return Detokenizer(database);
735 }
736 return Detokenizer(kDefaultDatabase);
737 }
738
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800739Base64 format
740=============
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800741The tokenizer encodes messages to a compact binary representation. Applications
742may desire a textual representation of tokenized strings. This makes it easy to
743use tokenized messages alongside plain text messages, but comes at a small
744efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
745as binary messages.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800746
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800747The Base64 format is comprised of a ``$`` character followed by the
748Base64-encoded contents of the tokenized message. For example, consider
749tokenizing the string ``This is an example: %d!`` with the argument -1. The
750string's token is 0x4b016e66.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800751
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800752.. code-block:: text
753
754 Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
755
756 Plain text: This is an example: -1! [23 bytes]
757
758 Binary: 66 6e 01 4b 01 [ 5 bytes]
759
760 Base64: $Zm4BSwE= [ 9 bytes]
761
762Encoding
763--------
764To encode with the Base64 format, add a call to
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700765``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800766in the tokenizer handler function. For example,
767
768.. code-block:: cpp
769
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700770 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800771 size_t size_bytes) {
772 char base64_buffer[64];
773 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
774 pw::span(encoded_message, size_bytes), base64_buffer);
775
776 TransmitLogMessage(base64_buffer, base64_size);
777 }
778
779Decoding
780--------
781Base64 decoding and detokenizing is supported in the Python detokenizer through
782the ``detokenize_base64`` and related functions.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800783
784.. tip::
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800785 The Python detokenization tools support recursive detokenization for prefixed
786 Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800787 prefixed Base64 messages can be passed as ``%s`` arguments.
788
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800789 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
790 passed as an argument to the printf-style string ``Nested message: %s``, which
791 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
792 as follows:
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800793
794 ::
795
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800796 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800797
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800798Base64 decoding is supported in C++ or C with the
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700799``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800800functions.
801
802.. code-block:: cpp
803
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700804 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800805 size_t size_bytes) {
806 char base64_buffer[64];
807 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
808 pw::span(encoded_message, size_bytes), base64_buffer);
809
810 TransmitLogMessage(base64_buffer, base64_size);
811 }
812
Wyatt Hepler63e6cfd2020-08-04 11:31:55 -0700813Command line utilities
814^^^^^^^^^^^^^^^^^^^^^^
815``pw_tokenizer`` provides two standalone command line utilities for detokenizing
816Base64-encoded tokenized strings.
817
818* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
819 stdin.
820* ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a
821 connected serial device.
822
823If the ``pw_tokenizer`` Python package is installed, these tools may be executed
824as runnable modules. For example:
825
826.. code-block::
827
828 # Detokenize Base64-encoded strings in a file
829 python -m pw_tokenizer.detokenize -i input_file.txt
830
831 # Detokenize Base64-encoded strings in output from a serial device
832 python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0
833
834See the ``--help`` options for these tools for full usage information.
835
Keir Mierle086ef1c2020-03-19 02:03:51 -0700836Deployment war story
837====================
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800838The tokenizer module was developed to bring tokenized logging to an
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800839in-development product. The product already had an established text-based
840logging system. Deploying tokenization was straightforward and had substantial
841benefits.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800842
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800843Results
844-------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800845 * Log contents shrunk by over 50%, even with Base64 encoding.
846
847 * Significant size savings for encoded logs, even using the less-efficient
848 Base64 encoding required for compatibility with the existing log system.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800849 * Freed valuable communication bandwidth.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800850 * Allowed storing many more logs in crash dumps.
851
852 * Substantial flash savings.
853
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800854 * Reduced the size firmware images by up to 18%.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800855
856 * Simpler logging code.
857
858 * Removed CPU-heavy ``snprintf`` calls.
859 * Removed complex code for forwarding log arguments to a low-priority task.
860
861This section describes the tokenizer deployment process and highlights key
862insights.
863
864Firmware deployment
865-------------------
866 * In the project's logging macro, calls to the underlying logging function
867 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
868 invocation.
869 * The log level was passed as the payload argument to facilitate runtime log
870 level control.
871 * For this project, it was necessary to encode the log messages as text. In
Wyatt Hepler7a5e4d62020-08-31 08:39:16 -0700872 ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800873 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
874 messages.
875 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
876
877.. attention::
878 Do not encode line numbers in tokenized strings. This results in a huge
879 number of lines being added to the database, since every time code moves,
880 new strings are tokenized. If line numbers are desired in a tokenized
881 string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
882
883Database management
884-------------------
885 * The token database was stored as a CSV file in the project's Git repo.
886 * The token database was automatically updated as part of the build, and
887 developers were expected to check in the database changes alongside their
888 code changes.
889 * A presubmit check verified that all strings added by a change were added to
890 the token database.
891 * The token database included logs and asserts for all firmware images in the
892 project.
893 * No strings were purged from the token database.
894
895.. tip::
896 Merge conflicts may be a frequent occurrence with an in-source database. If
897 the database is in-source, make sure there is a simple script to resolve any
898 merge conflicts. The script could either keep both sets of lines or discard
899 local changes and regenerate the database.
900
901Decoding tooling deployment
902---------------------------
903 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
904
905 * Product-specific Python command line tools, using
906 ``pw_tokenizer.Detokenizer``.
907 * Standalone script for decoding prefixed Base64 tokens in files or
908 live output (e.g. from ``adb``), using ``detokenize.py``'s command line
909 interface.
910
911 * The C++ detokenizer library was deployed to two Android apps with a Java
912 Native Interface (JNI) layer.
913
914 * The binary token database was included as a raw resource in the APK.
915 * In one app, the built-in token database could be overridden by copying a
916 file to the phone.
917
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800918.. tip::
919 Make the tokenized logging tools simple to use for your project.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800920
921 * Provide simple wrapper shell scripts that fill in arguments for the
922 project. For example, point ``detokenize.py`` to the project's token
923 databases.
924 * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
925 continuously-running tools, so that users don't have to restart the tool
926 when the token database updates.
927 * Integrate detokenization everywhere it is needed. Integrating the tools
928 takes just a few lines of code, and token databases can be embedded in
929 APKs or binaries.
930
931Limitations and future work
932===========================
933
934GCC bug: tokenization in template functions
935-------------------------------------------
936GCC incorrectly ignores the section attribute for template
937`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
938`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
939bug, tokenized strings in template functions may be emitted into ``.rodata``
940instead of the special tokenized string section. This causes two problems:
941
942 1. Tokenized strings will not be discovered by the token database tools.
943 2. Tokenized strings may not be removed from the final binary.
944
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800945clang does **not** have this issue! Use clang to avoid this.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800946
947It is possible to work around this bug in GCC. One approach would be to tag
948format strings so that the database tools can find them in ``.rodata``. Then, to
949remove the strings, compile two binaries: one metadata binary with all tokenized
950strings and a second, final binary that removes the strings. The strings could
951be removed by providing the appropriate linker flags or by removing the ``used``
952attribute from the tokenized string character array declaration.
953
95464-bit tokenization
955-------------------
956The Python and C++ detokenizing libraries currently assume that strings were
957tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
958``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
959device performed the tokenization.
960
961Supporting detokenization of strings tokenized on 64-bit targets would be
962simple. This could be done by adding an option to switch the 32-bit types to
Wyatt Heplerd58eef92020-05-08 10:39:56 -070096364-bit. The tokenizer stores the sizes of these types in the
Wyatt Heplereb020a12020-10-28 14:01:51 -0700964``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
Wyatt Heplerd58eef92020-05-08 10:39:56 -0700965by checking the ELF file, if necessary.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800966
967Tokenization in headers
968-----------------------
969Tokenizing code in header files (inline functions or templates) may trigger
970warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
971is because tokenization requires declaring a character array for each tokenized
972string. If the tokenized string includes macros that change value, the size of
973this character array changes, which means the same static variable is defined
974with different sizes. It should be safe to suppress these warnings, but, when
975possible, code that tokenizes strings with macros that can change value should
976be moved to source files rather than headers.
977
978Tokenized strings as ``%s`` arguments
979-------------------------------------
980Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
981encoded 1:1, with no tokenization. It would be better to send a tokenized string
982literal as an integer instead of a string argument, but this is not yet
983supported.
984
985A string token could be sent by marking an integer % argument in a way
986recognized by the detokenization tools. The detokenizer would expand the
987argument to the string represented by the integer.
988
989.. code-block:: cpp
990
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800991 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800992
993 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
994
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800995 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800996
997Strings with arguments could be encoded to a buffer, but since printf strings
998are null-terminated, a binary encoding would not work. These strings can be
999prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
1000
1001Another possibility: encode strings with arguments to a ``uint64_t`` and send
1002them as an integer. This would be efficient and simple, but only support a small
1003number of arguments.
1004
Wyatt Heplereb020a12020-10-28 14:01:51 -07001005Legacy tokenized string ELF format
1006==================================
1007The original version of ``pw_tokenizer`` stored tokenized stored as plain C
1008strings in the ELF file instead of structured tokenized string entries. Strings
1009in different domains were stored in different linker sections. The Python script
1010that parsed the ELF file would re-calculate the tokens.
1011
1012In the current version of ``pw_tokenizer``, tokenized strings are stored in a
1013structured entry containing a token, domain, and length-delimited string. This
1014has several advantages over the legacy format:
1015
1016* The Python script does not have to recalculate the token, so any hash
1017 algorithm may be used in the firmware.
1018* In C++, the tokenization hash no longer has a length limitation.
1019* Strings with null terminators in them are properly handled.
1020* Only one linker section is required in the linker script, instead of a
1021 separate section for each domain.
1022
1023To migrate to the new format, all that is required is update the linker sections
1024to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
1025``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
1026The Python tooling continues to support the legacy tokenized string ELF format.
1027
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001028Compatibility
1029=============
1030 * C11
Wyatt Heplera6d5cc62020-01-17 14:15:40 -08001031 * C++11
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001032 * Python 3
1033
1034Dependencies
1035============
Armando Montanez0054a9b2020-03-13 13:06:24 -07001036 * ``pw_varint`` module
1037 * ``pw_preprocessor`` module
1038 * ``pw_span`` module