blob: 1f5005c484175ce94ddc405058654b0f1a8897d8 [file] [log] [blame]
Wyatt Hepler80c6ee52020-01-03 09:54:58 -08001.. _chapter-tokenizer:
2
3.. default-domain:: cpp
4
5.. highlight:: sh
6
Alexei Frolov44d54732020-01-10 14:45:43 -08007------------
8pw_tokenizer
9------------
Armando Montanezbcc194b2020-03-10 10:23:18 -070010
11Logging is critical, but developers are often forced to choose between
Armando Montanez0054a9b2020-03-13 13:06:24 -070012additional logging or saving crucial flash space. ``pw_tokenizer`` helps
13ameliorate this issue by providing facilities to convert strings to integer
14tokens that can be decoded off-device, enabling extensive logging and debugging
15with significantly less memory usage. Printf-style format strings such as ``"My
16name is %s"`` are also supported; ``pw_tokenizer`` encodes the arguments into
17compact binary form at runtime. We’ve seen over 50% optimization in log contents
18and substantial savings in flash size, with additional benefits such as
19minimizing communication bandwidth and reducing CPU usage.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080020
21.. note::
22 This usage of the term "tokenizer" is not related to parsing! The
23 module is called tokenizer because it replaces a whole string literal with an
24 integer token. It does not parse strings into separate tokens.
25
Wyatt Heplera46bf7d2020-01-08 18:10:25 -080026The most common application of the tokenizer module is binary logging, and it is
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080027designed to integrate easily into existing logging systems. However, the
28tokenizer is general purpose and can be used to tokenize any strings.
29
30**Why tokenize strings?**
31
32 * Dramatically reduce binary size by removing string literals from binaries.
33 * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
34 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
35 tokens instead of strings.
36 * Remove potentially sensitive log, assert, and other strings from binaries.
37
Armando Montanezbcc194b2020-03-10 10:23:18 -070038Example
39=======
40
41Before: With plain text logging
42
43+------------------+-------------------------------------------+---------------+
44| Location | Logging Content | Size in bytes |
45+==================+===========================================+===============+
46| Source contains | LOG_INFO("Battery state: %s; battery | |
47| | voltage: %d mV", state, voltage); | |
48+------------------+-------------------------------------------+---------------+
49| Binary contains | "Battery state: %s; battery | 41 |
50| | voltage: %d mV" | |
51+------------------+-------------------------------------------+---------------+
52| | (log statement is called with "CHARGING" | |
53| | and 3989 as arguments) | |
54+------------------+-------------------------------------------+---------------+
55| Device transmits | "Battery state: CHARGING; battery | 49 |
56| | voltage: 3989 mV" | |
57+------------------+-------------------------------------------+---------------+
58| When viewed | "Battery state: CHARGING; battery | |
59| | voltage: 3989 mV" | |
60+------------------+-------------------------------------------+---------------+
61
62After: With tokenized logging
63
64+------------------+-------------------------------------------------+---------+
65| Location | Logging Content | Size in |
66| | | bytes |
67+==================+=================================================+=========+
68| Source contains | LOG_INFO("Battery state: %s; battery | |
69| | voltage: %d mV", state, voltage); | |
70+------------------+-------------------------------------------------+---------+
71| Binary contains | 0x8e4728d9 | 4 |
72+------------------+-------------------------------------------------+---------+
73| | (log statement is called with "CHARGING" | |
74| | and 3989 as arguments) | |
75+------------------+-------------------------------------------------+---------+
76| Device transmits | =========== ========================== ====== | 15 |
77| | d9 28 47 8e 08 43 48 41 52 47 49 4E 47 aa 3e | |
78| | ----------- -------------------------- ------ | |
79| | Token "CHARGING" argument 3989, | |
80| | as | |
81| | varint | |
82| | =========== ========================== ====== | |
83+------------------+-------------------------------------------------+---------+
84| When viewed | "Battery state: CHARGING; battery | |
85| | voltage: 3989 mV" | |
86+------------------+-------------------------------------------------+---------+
87
88In the above logging example, we achieve a savings of ~90% in binary size (41 →
Armando Montanez0054a9b2020-03-13 13:06:24 -0700894 bytes) and 70% in bandwidth (49 → 15 bytes).
Armando Montanezbcc194b2020-03-10 10:23:18 -070090
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080091Basic operation
92===============
Wyatt Heplera46bf7d2020-01-08 18:10:25 -080093There are two sides to tokenization: tokenizing strings in the source code and
94detokenizing these strings back to human-readable form.
95
Wyatt Hepler80c6ee52020-01-03 09:54:58 -080096 1. In C or C++ code, strings are hashed to generate a stable 32-bit token.
97 2. The tokenization macro removes the string literal by placing it in an ELF
98 section that is excluded from the final binary.
99 3. Strings are extracted from an ELF to build a database of tokenized strings
100 for use by the detokenizer. The ELF file may also be used directly.
101 4. During operation, the device encodes the string token and its arguments, if
102 any.
103 5. The encoded tokenized strings are sent off-device or stored.
104 6. Off-device, the detokenizer tools use the token database or ELF files to
105 detokenize the strings to human-readable form.
106
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800107Tokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800108============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800109Tokenization converts a string literal to a token. If it's a printf-style
110string, its arguments are encoded along with it. The results of tokenization can
111be sent off device or stored in place of a full string.
112
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800113Tokenization macros
114-------------------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800115Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800116``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800117
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800118Tokenize a string literal
119^^^^^^^^^^^^^^^^^^^^^^^^^
120The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
121token.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800122
123.. code-block:: cpp
124
125 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
126
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800127.. admonition:: When to use this macro
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800128
Armando Montanezbcc194b2020-03-10 10:23:18 -0700129 Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800130 %-style arguments.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800131
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800132Tokenize to a handler function
133^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800134``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800135since it takes the fewest arguments. It encodes a tokenized string to a
136buffer on the stack. The size of the buffer is set with
137``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``. It then calls the C-linkage
138function ``pw_TokenizerHandleEncodedMessage``, which must be defined by the
139project.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800140
141.. code-block:: cpp
142
143 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
144
145 void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
146 size_t size_bytes);
147
148``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800149``void*`` argument to the global handler function. Values like a log level can
150be packed into the ``void*``.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800151
152.. code-block:: cpp
153
154 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
155 format_string_literal,
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800156 arguments...);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800157
158 void pw_TokenizerHandleEncodedMessageWithPayload(void* payload,
159 const uint8_t encoded_message[],
160 size_t size_bytes);
161
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800162.. admonition:: When to use this macro
163
164 Use anytime a global handler is sufficient, particularly for widely expanded
165 macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
166 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
167 for tokenizing printf-style strings.
168
169Tokenize to a callback
170^^^^^^^^^^^^^^^^^^^^^^
171``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
172``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
173the call site. The size of the buffer is set with
174``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
175
176.. code-block:: cpp
177
178 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
179
180.. admonition:: When to use this macro
181
182 Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
183 use for another purpose or more flexibility is needed.
184
185Tokenize to a buffer
186^^^^^^^^^^^^^^^^^^^^
187The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
188to a caller-provided buffer.
189
190.. code-block:: cpp
191
192 uint8_t buffer[BUFFER_SIZE];
193 size_t size_bytes = sizeof(buffer);
194 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
195
196While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
197than the other macros, so its per-use code size overhead is larger.
198
199.. admonition:: When to use this macro
200
201 Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
202 other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
203 widely expanded macros, such as a logging macro, because it will result in
204 larger code size than its alternatives.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800205
206Example: binary logging
207^^^^^^^^^^^^^^^^^^^^^^^
208String tokenization is perfect for logging. Consider the following log macro,
209which gathers the file, line number, and log message. It calls the ``RecordLog``
210function, which formats the log string, collects a timestamp, and transmits the
211result.
212
213.. code-block:: cpp
214
215 #define LOG_INFO(format, ...) \
216 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
217
218 void RecordLog(LogLevel level, const char* file, int line, const char* format,
219 ...) {
220 if (level < current_log_level) {
221 return;
222 }
223
224 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
225
226 va_list args;
227 va_start(args, format);
228 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
229 va_end(args);
230
231 TransmitLog(TimeSinceBootMillis(), buffer, size);
232 }
233
234It is trivial to convert this to a binary log using the tokenizer. The
235``RecordLog`` call is replaced with a
236``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
237``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the
238timestamp and transmits the message with ``TransmitLog``.
239
240.. code-block:: cpp
241
242 #define LOG_INFO(format, ...) \
243 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
244 (void*)LogLevel_INFO, \
245 __FILE_NAME__ ":%d " format, \
246 __LINE__, \
247 __VA_ARGS__); \
248
249 extern "C" void pw_TokenizerHandleEncodedMessageWithPayload(
250 void* level, const uint8_t encoded_message[], size_t size_bytes) {
251 if (reinterpret_cast<LogLevel>(level) >= current_log_level) {
252 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
253 }
254 }
255
256Note that the ``__FILE_NAME__`` string is directly included in the log format
257string. Since the string is tokenized, this has no effect on binary size. A
258``%d`` for the line number is added to the format string, so that changing the
259line of the log message does not generate a new token. There is no overhead for
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800260additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800261duplicate log lines.
262
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800263Encoding
264--------
265The token is a 32-bit hash calculated during compilation. The string is encoded
266little-endian with the token followed by arguments, if any. For example, the
26731-byte string ``You can go about your business.`` hashes to 0xdac9a244.
268This is encoded as 4 bytes: ``44 a2 c9 da``.
269
270Arguments are encoded as follows:
271
272 * **Integers** (1--10 bytes) --
273 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
274 similarly to Protocol Buffers. Smaller values take fewer bytes.
275 * **Floating point numbers** (4 bytes) -- Single precision floating point.
276 * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
277 The top bit of the length byte indicates whether the string was truncated or
278 not. The remaining 7 bits encode the string length, with a maximum of 127
279 bytes.
280
281.. TODO: insert diagram here!
282
283.. tip::
284 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
285 short or avoid encoding them as strings (e.g. encode an enum as an integer
286 instead of a string). See also `Tokenized strings as %s arguments`_.
287
288Token generation: fixed length hashing at compile time
289------------------------------------------------------
290String tokens are generated using a modified version of the x65599 hash used by
291the SDBM project. All hashing is done at compile time.
292
293In C code, strings are hashed with a preprocessor macro. For compatibility with
294macros, the hash must be limited to a fixed maximum number of characters. This
295value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``.
296
297Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C
298due to the complexity of the hashing macros. C++ macros use a constexpr
299function instead of a macro, so the compilation time impact is minimal. Projects
300primarily in C++ may use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH``
301(perhaps even ``std::numeric_limits<size_t>::max()``).
302
303Token databases
304===============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800305Token databases store a mapping of tokens to the strings they represent. An ELF
306file can be used as a token database, but it only contains the strings for its
307exact build. A token database file aggregates tokens from multiple ELF files, so
308that a single database can decode tokenized strings from any known ELF.
309
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800310Token databases contain the token, removal date (if any), and string for each
311tokenized string. Two token database formats are supported: CSV and binary.
312
313CSV database format
314-------------------
315The CSV database format has three columns: the token in hexadecimal, the removal
316date (if any) in year-month-day format, and the string literal, surrounded by
317quotes. Quote characters within the string are represented as two quote
318characters.
319
320This example database contains six strings, three of which have removal dates.
321
322.. code-block::
323
324 141c35d5, ,"The answer: ""%s"""
325 2e668cd6,2019-12-25,"Jello, world!"
326 7b940e2a, ,"Hello %s! %hd %e"
327 851beeb6, ,"%u %d"
328 881436a0,2020-01-01,"The answer is: %s"
329 e13b0f94,2020-04-01,"%llu"
330
331Binary database format
332----------------------
333The binary database format is comprised of a 16-byte header followed by a series
334of 8-byte entries. Each entry stores the token and the removal date, which is
3350xFFFFFFFF if there is none. The string literals are stored next in the same
336order as the entries. Strings are stored with null terminators. See
337`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
338for full details.
339
340The binary form of the CSV database is shown below. It contains the same
341information, but in a more compact and easily processed form. It takes 141 B
342compared with the CSV database's 211 B.
343
344.. code-block:: text
345
346 [header]
347 0x00: 454b4f54 0000534e TOKENS..
348 0x08: 00000006 00000000 ........
349
350 [entries]
351 0x10: 141c35d5 ffffffff .5......
352 0x18: 2e668cd6 07e30c19 ..f.....
353 0x20: 7b940e2a ffffffff *..{....
354 0x28: 851beeb6 ffffffff ........
355 0x30: 881436a0 07e40101 .6......
356 0x38: e13b0f94 07e40401 ..;.....
357
358 [string table]
359 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
360 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
361 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
362 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
363 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
364
365Managing token databases
366------------------------
367Token databases are managed with the ``database.py`` script. This script can be
368used to extract tokens from compilation artifacts and manage database files.
369Invoke ``database.py`` with ``-h`` for full usage information.
370
371Create a database
372^^^^^^^^^^^^^^^^^
373The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
374etc.), archives (.a), or existing token databases (CSV or binary).
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800375
376.. code-block:: sh
377
378 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
379
380Two database formats are supported: CSV and binary. Provide ``--type binary`` to
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800381``create`` to generate a binary database instead of the default CSV. CSV
382databases are great for checking into a source control or for human review.
383Binary databases are more compact and simpler to parse. The C++ detokenizer
384library only supports binary databases currently.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800385
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800386Update a database
387^^^^^^^^^^^^^^^^^
388As new tokenized strings are added, update the database with the ``add``
389command.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800390
391.. code-block:: sh
392
393 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
394
395A CSV token database can be checked into a source repository and updated as code
396changes are made. The build system can invoke ``database.py`` to update the
397database after each build.
398
399Detokenization
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800400==============
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800401Detokenization is the process of expanding a token to the string it represents
402and decoding its arguments. This module provides Python and C++ detokenization
403libraries.
404
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800405**Example: decoding tokenized logs**
406
407A project might tokenize its log messages with the `Base64 format`_. Consider
408the following log file, which has four tokenized logs and one plain text log:
409
410.. code-block:: text
411
412 20200229 14:38:58 INF $HL2VHA==
413 20200229 14:39:00 DBG $5IhTKg==
414 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
415 20200229 14:39:21 INF $EgFj8lVVAUI=
416 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
417
418The project's log strings are stored in a database like the following:
419
420.. code-block::
421
422 1c95bd1c, ,"Initiating retrieval process for recovery object"
423 2a5388e4, ,"Determining optimal approach and coordinating vectors"
424 3743540c, ,"Recovery object retrieval failed with status %s"
425 f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
426
427Using the detokenizing tools with the database, the logs can be decoded:
428
429.. code-block:: text
430
431 20200229 14:38:58 INF Initiating retrieval process for recovery object
432 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
433 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
434 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
435 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
436
437.. note::
438
439 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
440 much space as the default binary format when encoded. For projects that wish
441 to interleave tokenized with plain text, using Base64 is a worthwhile
442 tradeoff.
443
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800444Python
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800445------
446To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
447package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800448
449.. code-block:: python
450
451 import pw_tokenizer
452
453 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
454
455 def process_log_message(log_message):
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800456 result = detokenizer.detokenize(log_message.payload)
457 self._log(str(result))
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800458
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800459The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800460class, which can be used in place of the standard ``Detokenizer``. This class
461monitors database files for changes and automatically reloads them when they
462change. This is helpful for long-running tools that use detokenization.
463
464C++
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800465---
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800466The C++ detokenization libraries can be used in C++ or any language that can
467call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800468Java Native Interface (JNI) implementation is provided.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800469
470The C++ detokenization library uses binary-format token databases (created with
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800471``database.py create --type binary``). Read a binary format database from a
472file or include it in the source code. Pass the database array to
473``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800474
475.. code-block:: cpp
476
477 Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
478
479 std::string ProcessLog(span<uint8_t> log_data) {
480 return detokenizer.Detokenize(log_data).BestString();
481 }
482
483The ``TokenDatabase`` class verifies that its data is valid before using it. If
484it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
485``ok()`` returns false. If the token database is included in the source code,
486this check can be done at compile time.
487
488.. code-block:: cpp
489
490 // This line fails to compile with a static_assert if the database is invalid.
491 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
492
493 Detokenizer OpenDatabase(std::string_view path) {
494 std::vector<uint8_t> data = ReadWholeFile(path);
495
496 TokenDatabase database = TokenDatabase::Create(data);
497
498 // This checks if the file contained a valid database. It is safe to use a
499 // TokenDatabase that failed to load (it will be empty), but it may be
500 // desirable to provide a default database or otherwise handle the error.
501 if (database.ok()) {
502 return Detokenizer(database);
503 }
504 return Detokenizer(kDefaultDatabase);
505 }
506
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800507Base64 format
508=============
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800509The tokenizer encodes messages to a compact binary representation. Applications
510may desire a textual representation of tokenized strings. This makes it easy to
511use tokenized messages alongside plain text messages, but comes at a small
512efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
513as binary messages.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800514
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800515The Base64 format is comprised of a ``$`` character followed by the
516Base64-encoded contents of the tokenized message. For example, consider
517tokenizing the string ``This is an example: %d!`` with the argument -1. The
518string's token is 0x4b016e66.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800519
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800520.. code-block:: text
521
522 Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
523
524 Plain text: This is an example: -1! [23 bytes]
525
526 Binary: 66 6e 01 4b 01 [ 5 bytes]
527
528 Base64: $Zm4BSwE= [ 9 bytes]
529
530Encoding
531--------
532To encode with the Base64 format, add a call to
533``pw::tokenizer::PrefixedBase64Encode`` or ``pw_TokenizerPrefixedBase64Encode``
534in the tokenizer handler function. For example,
535
536.. code-block:: cpp
537
538 void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
539 size_t size_bytes) {
540 char base64_buffer[64];
541 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
542 pw::span(encoded_message, size_bytes), base64_buffer);
543
544 TransmitLogMessage(base64_buffer, base64_size);
545 }
546
547Decoding
548--------
549Base64 decoding and detokenizing is supported in the Python detokenizer through
550the ``detokenize_base64`` and related functions.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800551
552.. tip::
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800553 The Python detokenization tools support recursive detokenization for prefixed
554 Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800555 prefixed Base64 messages can be passed as ``%s`` arguments.
556
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800557 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
558 passed as an argument to the printf-style string ``Nested message: %s``, which
559 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
560 as follows:
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800561
562 ::
563
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800564 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800565
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800566Base64 decoding is supported in C++ or C with the
567``pw::tokenizer::PrefixedBase64Decode`` or ``pw_TokenizerPrefixedBase64Decode``
568functions.
569
570.. code-block:: cpp
571
572 void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
573 size_t size_bytes) {
574 char base64_buffer[64];
575 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
576 pw::span(encoded_message, size_bytes), base64_buffer);
577
578 TransmitLogMessage(base64_buffer, base64_size);
579 }
580
Keir Mierle086ef1c2020-03-19 02:03:51 -0700581Deployment war story
582====================
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800583The tokenizer module was developed to bring tokenized logging to an
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800584in-development product. The product already had an established text-based
585logging system. Deploying tokenization was straightforward and had substantial
586benefits.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800587
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800588Results
589-------
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800590 * Log contents shrunk by over 50%, even with Base64 encoding.
591
592 * Significant size savings for encoded logs, even using the less-efficient
593 Base64 encoding required for compatibility with the existing log system.
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800594 * Freed valuable communication bandwidth.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800595 * Allowed storing many more logs in crash dumps.
596
597 * Substantial flash savings.
598
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800599 * Reduced the size firmware images by up to 18%.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800600
601 * Simpler logging code.
602
603 * Removed CPU-heavy ``snprintf`` calls.
604 * Removed complex code for forwarding log arguments to a low-priority task.
605
606This section describes the tokenizer deployment process and highlights key
607insights.
608
609Firmware deployment
610-------------------
611 * In the project's logging macro, calls to the underlying logging function
612 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
613 invocation.
614 * The log level was passed as the payload argument to facilitate runtime log
615 level control.
616 * For this project, it was necessary to encode the log messages as text. In
617 ``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were
618 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
619 messages.
620 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
621
622.. attention::
623 Do not encode line numbers in tokenized strings. This results in a huge
624 number of lines being added to the database, since every time code moves,
625 new strings are tokenized. If line numbers are desired in a tokenized
626 string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
627
628Database management
629-------------------
630 * The token database was stored as a CSV file in the project's Git repo.
631 * The token database was automatically updated as part of the build, and
632 developers were expected to check in the database changes alongside their
633 code changes.
634 * A presubmit check verified that all strings added by a change were added to
635 the token database.
636 * The token database included logs and asserts for all firmware images in the
637 project.
638 * No strings were purged from the token database.
639
640.. tip::
641 Merge conflicts may be a frequent occurrence with an in-source database. If
642 the database is in-source, make sure there is a simple script to resolve any
643 merge conflicts. The script could either keep both sets of lines or discard
644 local changes and regenerate the database.
645
646Decoding tooling deployment
647---------------------------
648 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
649
650 * Product-specific Python command line tools, using
651 ``pw_tokenizer.Detokenizer``.
652 * Standalone script for decoding prefixed Base64 tokens in files or
653 live output (e.g. from ``adb``), using ``detokenize.py``'s command line
654 interface.
655
656 * The C++ detokenizer library was deployed to two Android apps with a Java
657 Native Interface (JNI) layer.
658
659 * The binary token database was included as a raw resource in the APK.
660 * In one app, the built-in token database could be overridden by copying a
661 file to the phone.
662
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800663.. tip::
664 Make the tokenized logging tools simple to use for your project.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800665
666 * Provide simple wrapper shell scripts that fill in arguments for the
667 project. For example, point ``detokenize.py`` to the project's token
668 databases.
669 * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
670 continuously-running tools, so that users don't have to restart the tool
671 when the token database updates.
672 * Integrate detokenization everywhere it is needed. Integrating the tools
673 takes just a few lines of code, and token databases can be embedded in
674 APKs or binaries.
675
676Limitations and future work
677===========================
678
679GCC bug: tokenization in template functions
680-------------------------------------------
681GCC incorrectly ignores the section attribute for template
682`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
683`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
684bug, tokenized strings in template functions may be emitted into ``.rodata``
685instead of the special tokenized string section. This causes two problems:
686
687 1. Tokenized strings will not be discovered by the token database tools.
688 2. Tokenized strings may not be removed from the final binary.
689
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800690clang does **not** have this issue! Use clang to avoid this.
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800691
692It is possible to work around this bug in GCC. One approach would be to tag
693format strings so that the database tools can find them in ``.rodata``. Then, to
694remove the strings, compile two binaries: one metadata binary with all tokenized
695strings and a second, final binary that removes the strings. The strings could
696be removed by providing the appropriate linker flags or by removing the ``used``
697attribute from the tokenized string character array declaration.
698
69964-bit tokenization
700-------------------
701The Python and C++ detokenizing libraries currently assume that strings were
702tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
703``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
704device performed the tokenization.
705
706Supporting detokenization of strings tokenized on 64-bit targets would be
707simple. This could be done by adding an option to switch the 32-bit types to
70864-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info``
709ELF section, so the sizes of these types can be verified by checking the ELF
710file, if necessary.
711
712Tokenization in headers
713-----------------------
714Tokenizing code in header files (inline functions or templates) may trigger
715warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
716is because tokenization requires declaring a character array for each tokenized
717string. If the tokenized string includes macros that change value, the size of
718this character array changes, which means the same static variable is defined
719with different sizes. It should be safe to suppress these warnings, but, when
720possible, code that tokenizes strings with macros that can change value should
721be moved to source files rather than headers.
722
723Tokenized strings as ``%s`` arguments
724-------------------------------------
725Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
726encoded 1:1, with no tokenization. It would be better to send a tokenized string
727literal as an integer instead of a string argument, but this is not yet
728supported.
729
730A string token could be sent by marking an integer % argument in a way
731recognized by the detokenization tools. The detokenizer would expand the
732argument to the string represented by the integer.
733
734.. code-block:: cpp
735
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800736 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800737
738 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
739
Wyatt Heplera46bf7d2020-01-08 18:10:25 -0800740 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800741
742Strings with arguments could be encoded to a buffer, but since printf strings
743are null-terminated, a binary encoding would not work. These strings can be
744prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
745
746Another possibility: encode strings with arguments to a ``uint64_t`` and send
747them as an integer. This would be efficient and simple, but only support a small
748number of arguments.
749
750Compatibility
751=============
752 * C11
Wyatt Heplera6d5cc62020-01-17 14:15:40 -0800753 * C++11
Wyatt Hepler80c6ee52020-01-03 09:54:58 -0800754 * Python 3
755
756Dependencies
757============
Armando Montanez0054a9b2020-03-13 13:06:24 -0700758 * ``pw_varint`` module
759 * ``pw_preprocessor`` module
760 * ``pw_span`` module