blob: eff728093ccc5217279514c261ba6041c60fa2fc [file] [log] [blame]
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001.. _chapter-tokenizer:
2
3.. default-domain:: cpp
4
5.. highlight:: sh
6
Alexei Frolov44d54732020-01-10 14:45:43 -08007------------
8pw_tokenizer
9------------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080010The tokenizer module provides facilities for converting strings to binary
11tokens. String literals are replaced with integer tokens in the firmware image,
12which can be decoded off-device to restore the original string. Strings may be
13printf-style format strings and include arguments, such as ``"My name is %s"``.
14Arguments are encoded into compact binary form at runtime.
15
16.. note::
17 This usage of the term "tokenizer" is not related to parsing! The
18 module is called tokenizer because it replaces a whole string literal with an
19 integer token. It does not parse strings into separate tokens.
20
21The most common application of the tokenizer module is binary logging, and its
22designed to integrate easily into existing logging systems. However, the
23tokenizer is general purpose and can be used to tokenize any strings.
24
25**Why tokenize strings?**
26
27 * Dramatically reduce binary size by removing string literals from binaries.
28 * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
29 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
30 tokens instead of strings.
31 * Remove potentially sensitive log, assert, and other strings from binaries.
32
33Basic operation
34===============
35 1. In C or C++ code, strings are hashed to generate a stable 32-bit token.
36 2. The tokenization macro removes the string literal by placing it in an ELF
37 section that is excluded from the final binary.
38 3. Strings are extracted from an ELF to build a database of tokenized strings
39 for use by the detokenizer. The ELF file may also be used directly.
40 4. During operation, the device encodes the string token and its arguments, if
41 any.
42 5. The encoded tokenized strings are sent off-device or stored.
43 6. Off-device, the detokenizer tools use the token database or ELF files to
44 detokenize the strings to human-readable form.
45
46Module usage
47============
48There are two sides to tokenization: tokenizing strings in the source code and
49detokenizing these strings back to human-readable form.
50
51Tokenization
52------------
53Tokenization converts a string literal to a token. If it's a printf-style
54string, its arguments are encoded along with it. The results of tokenization can
55be sent off device or stored in place of a full string.
56
57Adding tokenization to a project is simple. To tokenize a string, include
58``pw_tokenizer/tokenize.h`` and invoke a ``PW_TOKENIZE_`` macro.
59
60To tokenize a string literal, invoke ``PW_TOKENIZE_STRING``. This macro returns
61a ``uint32_t`` token.
62
63.. code-block:: cpp
64
65 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
66
67Format strings are tokenized into a fixed-size buffer. The buffer contains the
68``uint32_t`` token followed by the encoded form of the arguments, if any. The
69most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes to
70a caller-provided buffer.
71
72.. code-block:: cpp
73
74 uint8_t buffer[BUFFER_SIZE];
75 size_t size_bytes = sizeof(buffer);
76 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, args...);
77
78While ``PW_TOKENIZE_TO_BUFFER`` is flexible, its per-use code size overhead is
79larger than its alternatives. ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer
80on the stack and calls a ``void(const uint8_t* buffer, size_t buffer_size)``
81callback that is provided at the call site. The size of the buffer is set with
82``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
83
84.. code-block:: cpp
85
86 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arg);
87
88``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
89since it takes the fewest arguments. Like the callback form, it encodes to a
90buffer on the stack. It then calls the C-linkage function
91``pw_TokenizerHandleEncodedMessage``, which must be defined by the project.
92
93.. code-block:: cpp
94
95 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
96
97 void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
98 size_t size_bytes);
99
100``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
101``void*`` argument to the global handler function. This can be used to pass a
102log level or other metadata along with the tokenized string.
103
104.. code-block:: cpp
105
106 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
107 format_string_literal,
108 args...);
109
110 void pw_TokenizerHandleEncodedMessageWithPayload(void* payload,
111 const uint8_t encoded_message[],
112 size_t size_bytes);
113
114.. tip::
115 ``%s`` arguments are inefficient to encode and can quickly fill a tokenization
116 buffer. Keep ``%s`` arguments short or avoid encoding them as strings if
117 possible. See `Tokenized strings as %s arguments`_.
118
119Example: binary logging
120^^^^^^^^^^^^^^^^^^^^^^^
121String tokenization is perfect for logging. Consider the following log macro,
122which gathers the file, line number, and log message. It calls the ``RecordLog``
123function, which formats the log string, collects a timestamp, and transmits the
124result.
125
126.. code-block:: cpp
127
128 #define LOG_INFO(format, ...) \
129 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
130
131 void RecordLog(LogLevel level, const char* file, int line, const char* format,
132 ...) {
133 if (level < current_log_level) {
134 return;
135 }
136
137 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
138
139 va_list args;
140 va_start(args, format);
141 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
142 va_end(args);
143
144 TransmitLog(TimeSinceBootMillis(), buffer, size);
145 }
146
147It is trivial to convert this to a binary log using the tokenizer. The
148``RecordLog`` call is replaced with a
149``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
150``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the
151timestamp and transmits the message with ``TransmitLog``.
152
153.. code-block:: cpp
154
155 #define LOG_INFO(format, ...) \
156 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
157 (void*)LogLevel_INFO, \
158 __FILE_NAME__ ":%d " format, \
159 __LINE__, \
160 __VA_ARGS__); \
161
162 extern "C" void pw_TokenizerHandleEncodedMessageWithPayload(
163 void* level, const uint8_t encoded_message[], size_t size_bytes) {
164 if (reinterpret_cast<LogLevel>(level) >= current_log_level) {
165 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
166 }
167 }
168
169Note that the ``__FILE_NAME__`` string is directly included in the log format
170string. Since the string is tokenized, this has no effect on binary size. A
171``%d`` for the line number is added to the format string, so that changing the
172line of the log message does not generate a new token. There is no overhead for
173additional tokens, but it may not be desireable to fill a token database with
174duplicate log lines.
175
176Database management
177^^^^^^^^^^^^^^^^^^^
178Token databases store a mapping of tokens to the strings they represent. An ELF
179file can be used as a token database, but it only contains the strings for its
180exact build. A token database file aggregates tokens from multiple ELF files, so
181that a single database can decode tokenized strings from any known ELF.
182
183Creating and maintaining a token database is simple. Token databases are managed
184with the ``database.py`` script. The ``create`` command makes a new token
185database from ELF files or other databases.
186
187.. code-block:: sh
188
189 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
190
191Two database formats are supported: CSV and binary. Provide ``--type binary`` to
192``create`` generate a binary database instead of the default CSV. CSV databases
193are great for checking into a source control or for human review. Binary
194databases are more compact and simpler to parse. The C++ detokenizer library
195only supports binary databases currently.
196
197As new tokenized strings are added, update the database with the add command.
198
199.. code-block:: sh
200
201 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
202
203A CSV token database can be checked into a source repository and updated as code
204changes are made. The build system can invoke ``database.py`` to update the
205database after each build.
206
207Detokenization
208--------------
209Detokenization is the process of expanding a token to the string it represents
210and decoding its arguments. This module provides Python and C++ detokenization
211libraries.
212
213Python
214^^^^^^
215To detokenize in Python, import Detokenizer from the ``pw_tokenizer`` package,
216and instantiate it with paths to token databases or ELF files.
217
218.. code-block:: python
219
220 import pw_tokenizer
221
222 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
223
224 def process_log_message(log_message):
225 result = detokenizer.detokenize(log_message.payload)
226 self._log(str(result))
227
228The ``pw_tokenizer`` pacakge also provices the ``AutoUpdatingDetokenizer``
229class, which can be used in place of the standard ``Detokenizer``. This class
230monitors database files for changes and automatically reloads them when they
231change. This is helpful for long-running tools that use detokenization.
232
233C++
234^^^
235The C++ detokenization libraries can be used in C++ or any language that can
236call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
237Android Java JNI is provided.
238
239The C++ detokenization library uses binary-format token databases (created with
240``--type binary``). Read a binary format database from a file or include it in
241the source code. Pass the database array to ``TokenDatabase::Create``, and
242construct a detokenizer.
243
244.. code-block:: cpp
245
246 Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
247
248 std::string ProcessLog(span<uint8_t> log_data) {
249 return detokenizer.Detokenize(log_data).BestString();
250 }
251
252The ``TokenDatabase`` class verifies that its data is valid before using it. If
253it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
254``ok()`` returns false. If the token database is included in the source code,
255this check can be done at compile time.
256
257.. code-block:: cpp
258
259 // This line fails to compile with a static_assert if the database is invalid.
260 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
261
262 Detokenizer OpenDatabase(std::string_view path) {
263 std::vector<uint8_t> data = ReadWholeFile(path);
264
265 TokenDatabase database = TokenDatabase::Create(data);
266
267 // This checks if the file contained a valid database. It is safe to use a
268 // TokenDatabase that failed to load (it will be empty), but it may be
269 // desirable to provide a default database or otherwise handle the error.
270 if (database.ok()) {
271 return Detokenizer(database);
272 }
273 return Detokenizer(kDefaultDatabase);
274 }
275
276Token generation: fixed length hashing at compile time
277======================================================
278String tokens are generated using a modified version of the x65599 hash used by
279the SDBM project. All hashing is done at compile time.
280
281In C code, strings are hashed with a preprocessor macro. For compatibility with
282macros, the hash must be limited to a fixed maximum number of characters. This
283value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``.
284
285Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C
286due to the complexity of the hashing macros. C++ macros use a constexpr
287function instead of a macro, so the compilation time impact is minimal. Projects
288primarily in C++ should use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH``
289(perhaps even ``std::numeric_limits<size_t>::max()``).
290
291Base64 format
292=============
293The tokenizer defaults to a compact binary representation of tokenized messages.
294Applications may desire a textual representation of tokenized strings. This
295makes it easy to use tokenized messages alongside plain text messages, but comes
296at an efficiency cost.
297
298The tokenizer module supports prefixed Base64-encoded messages: a single
299character ($) followed by the Base64-encoded message. For example, the token
3000xabcdef01 followed by the argument 0x05 would be encoded as ``01 ef cd ab 05``
301in binary and ``$Ae/NqwU=`` in Base64.
302
303Base64 decoding is supported in the Python detokenizer through the
304``detokenize_base64`` and related functions. Base64 encoding and decoding are
305not yet supprted in C++, but it is straightforward to add Base64 encoding with
306any Base64 library.
307
308.. tip::
309 The detokenization tools support recursive detokenization for prefixed Base64
310 text. Tokenized strings found in detokenized text are detokenized, so
311 prefixed Base64 messages can be passed as ``%s`` arguments.
312
313 For example, the message ``"$d4ffJaRn`` might be passed as the argument to a
314 ``"Nested message: %s"`` string. The detokenizer would decode the message in
315 two steps:
316
317 ::
318
319 "$alRZyuk2J3v=" → "Nested message: $d4ffJaRn" → "Nested message: Wow!"
320
321War story: deploying tokenized logging to an existing product
322=============================================================
323The tokenizer module was developed to bring tokenized logging to an
324in-development product. The product is complex, with several interacting
325microcontrollers. It already had an established text-based logging system.
326Deploying tokenization was straightforward and had substantial benefits.
327
328**Results**
329 * Log contents shrunk by over 50%, even with Base64 encoding.
330
331 * Significant size savings for encoded logs, even using the less-efficient
332 Base64 encoding required for compatibility with the existing log system.
333 * Freed valueable communication bandwidth.
334 * Allowed storing many more logs in crash dumps.
335
336 * Substantial flash savings.
337
338 * Reduced the size of 115 KB and 172 KB firmware images by over 20 KB each.
339 * Shaved over 100 KB from a large 2 MB image.
340
341 * Simpler logging code.
342
343 * Removed CPU-heavy ``snprintf`` calls.
344 * Removed complex code for forwarding log arguments to a low-priority task.
345
346This section describes the tokenizer deployment process and highlights key
347insights.
348
349Firmware deployment
350-------------------
351 * In the project's logging macro, calls to the underlying logging function
352 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
353 invocation.
354 * The log level was passed as the payload argument to facilitate runtime log
355 level control.
356 * For this project, it was necessary to encode the log messages as text. In
357 ``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were
358 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
359 messages.
360 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
361
362.. attention::
363 Do not encode line numbers in tokenized strings. This results in a huge
364 number of lines being added to the database, since every time code moves,
365 new strings are tokenized. If line numbers are desired in a tokenized
366 string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
367
368Database management
369-------------------
370 * The token database was stored as a CSV file in the project's Git repo.
371 * The token database was automatically updated as part of the build, and
372 developers were expected to check in the database changes alongside their
373 code changes.
374 * A presubmit check verified that all strings added by a change were added to
375 the token database.
376 * The token database included logs and asserts for all firmware images in the
377 project.
378 * No strings were purged from the token database.
379
380.. tip::
381 Merge conflicts may be a frequent occurrence with an in-source database. If
382 the database is in-source, make sure there is a simple script to resolve any
383 merge conflicts. The script could either keep both sets of lines or discard
384 local changes and regenerate the database.
385
386Decoding tooling deployment
387---------------------------
388 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
389
390 * Product-specific Python command line tools, using
391 ``pw_tokenizer.Detokenizer``.
392 * Standalone script for decoding prefixed Base64 tokens in files or
393 live output (e.g. from ``adb``), using ``detokenize.py``'s command line
394 interface.
395
396 * The C++ detokenizer library was deployed to two Android apps with a Java
397 Native Interface (JNI) layer.
398
399 * The binary token database was included as a raw resource in the APK.
400 * In one app, the built-in token database could be overridden by copying a
401 file to the phone.
402
403.. tip:: Make the tokenized logging tools simple to use.
404
405 * Provide simple wrapper shell scripts that fill in arguments for the
406 project. For example, point ``detokenize.py`` to the project's token
407 databases.
408 * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
409 continuously-running tools, so that users don't have to restart the tool
410 when the token database updates.
411 * Integrate detokenization everywhere it is needed. Integrating the tools
412 takes just a few lines of code, and token databases can be embedded in
413 APKs or binaries.
414
415Limitations and future work
416===========================
417
418GCC bug: tokenization in template functions
419-------------------------------------------
420GCC incorrectly ignores the section attribute for template
421`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
422`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
423bug, tokenized strings in template functions may be emitted into ``.rodata``
424instead of the special tokenized string section. This causes two problems:
425
426 1. Tokenized strings will not be discovered by the token database tools.
427 2. Tokenized strings may not be removed from the final binary.
428
429clang does **not** have this issue! Use clang if you can.
430
431It is possible to work around this bug in GCC. One approach would be to tag
432format strings so that the database tools can find them in ``.rodata``. Then, to
433remove the strings, compile two binaries: one metadata binary with all tokenized
434strings and a second, final binary that removes the strings. The strings could
435be removed by providing the appropriate linker flags or by removing the ``used``
436attribute from the tokenized string character array declaration.
437
43864-bit tokenization
439-------------------
440The Python and C++ detokenizing libraries currently assume that strings were
441tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
442``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
443device performed the tokenization.
444
445Supporting detokenization of strings tokenized on 64-bit targets would be
446simple. This could be done by adding an option to switch the 32-bit types to
44764-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info``
448ELF section, so the sizes of these types can be verified by checking the ELF
449file, if necessary.
450
451Tokenization in headers
452-----------------------
453Tokenizing code in header files (inline functions or templates) may trigger
454warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
455is because tokenization requires declaring a character array for each tokenized
456string. If the tokenized string includes macros that change value, the size of
457this character array changes, which means the same static variable is defined
458with different sizes. It should be safe to suppress these warnings, but, when
459possible, code that tokenizes strings with macros that can change value should
460be moved to source files rather than headers.
461
462Tokenized strings as ``%s`` arguments
463-------------------------------------
464Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
465encoded 1:1, with no tokenization. It would be better to send a tokenized string
466literal as an integer instead of a string argument, but this is not yet
467supported.
468
469A string token could be sent by marking an integer % argument in a way
470recognized by the detokenization tools. The detokenizer would expand the
471argument to the string represented by the integer.
472
473.. code-block:: cpp
474
475 #define PW_TOKEN_ARG "TOKEN<([\\%" PRIx32 "/])>END_TOKEN"
476
477 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
478
479 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: " PW_TOKEN_ARG "?", answer_token);
480
481Strings with arguments could be encoded to a buffer, but since printf strings
482are null-terminated, a binary encoding would not work. These strings can be
483prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
484
485Another possibility: encode strings with arguments to a ``uint64_t`` and send
486them as an integer. This would be efficient and simple, but only support a small
487number of arguments.
488
489Compatibility
490=============
491 * C11
Wyatt Heplera6d5cc62020-01-17 14:15:40 -0800492 * C++11
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800493 * Python 3
494
495Dependencies
496============
497 * pw_varint module
498 * pw_preprocessor module
499 * pw_span module