Blame - pw_tokenizer/docs.rst - platform/external/pigweed

blob: a738414196ab4b88519ddc74393f678058f01511 [file] [log] [blame]

Wyatt Hepler	f9fb90f	2020-09-30 18:59:33 -0700	[diff] [blame]	1	.. _module-pw_tokenizer:
Keir Mierle	bc5a269	2020-05-21 16:52:25 -0700	[diff] [blame]	2
Alexei Frolov	44d5473	2020-01-10 14:45:43 -0800	[diff] [blame]	3	------------
				4	pw_tokenizer
				5	------------
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	6	Logging is critical, but developers are often forced to choose between
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	7	additional logging or saving crucial flash space. The ``pw_tokenizer`` module
				8	helps address this by replacing printf-style strings with binary tokens during
				9	compilation. This enables extensive logging with substantially less memory
				10	usage.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	11
				12	.. note::
				13	This usage of the term "tokenizer" is not related to parsing! The
				14	module is called tokenizer because it replaces a whole string literal with an
				15	integer token. It does not parse strings into separate tokens.
				16
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	17	The most common application of ``pw_tokenizer`` is binary logging, and it is
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	18	designed to integrate easily into existing logging systems. However, the
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	19	tokenizer is general purpose and can be used to tokenize any strings, with or
				20	without printf-style arguments.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	21
				22	Why tokenize strings?
				23
				24	* Dramatically reduce binary size by removing string literals from binaries.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	25	* Reduce I/O traffic, RAM, and flash usage by sending and storing compact
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	26	tokens instead of strings. We've seen over 50% reduction in encoded log
				27	contents.
				28	* Reduce CPU usage by replacing snprintf calls with simple tokenization code.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	29	* Remove potentially sensitive log, assert, and other strings from binaries.
				30
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	31	Basic overview
				32	==============
				33	There are two sides to ``pw_tokenizer``, which we call tokenization and
				34	detokenization.
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	35
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	36	* Tokenization converts string literals in the source code to
				37	binary tokens at compile time. If the string has printf-style arguments,
				38	these are encoded to compact binary form at runtime.
				39	* Detokenization converts tokenized strings back to the original
				40	human-readable strings.
				41
				42	Here's an overview of what happens when ``pw_tokenizer`` is used:
				43
				44	1. During compilation, the ``pw_tokenizer`` module hashes string literals to
				45	generate stable 32-bit tokens.
				46	2. The tokenization macro removes these strings by declaring them in an ELF
				47	section that is excluded from the final binary.
				48	3. After compilation, strings are extracted from the ELF to build a database
				49	of tokenized strings for use by the detokenizer. The ELF file may also be
				50	used directly.
				51	4. During operation, the device encodes the string token and its arguments, if
				52	any.
				53	5. The encoded tokenized strings are sent off-device or stored.
				54	6. Off-device, the detokenizer tools use the token database to decode the
				55	strings to human-readable form.
				56
				57	Example: tokenized logging
				58	--------------------------
				59	This example demonstrates using ``pw_tokenizer`` for logging. In this example,
				60	tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
				61	size (49 → 15 bytes).
				62
				63	Before: plain text logging
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	64
				65	+------------------+-------------------------------------------+---------------+
				66	\| Location \| Logging Content \| Size in bytes \|
				67	+==================+===========================================+===============+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	68	\| Source contains \| ``LOG("Battery state: %s; battery \| \|
				69	\| \| voltage: %d mV", state, voltage);`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	70	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	71	\| Binary contains \| ``"Battery state: %s; battery \| 41 \|
				72	\| \| voltage: %d mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	73	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	74	\| \| (log statement is called with \| \|
				75	\| \| ``"CHARGING"`` and ``3989`` as arguments) \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	76	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	77	\| Device transmits \| ``"Battery state: CHARGING; battery \| 49 \|
				78	\| \| voltage: 3989 mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	79	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	80	\| When viewed \| ``"Battery state: CHARGING; battery \| \|
				81	\| \| voltage: 3989 mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	82	+------------------+-------------------------------------------+---------------+
				83
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	84	After: tokenized logging
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	85
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	86	+------------------+-----------------------------------------------------------+---------+
				87	\| Location \| Logging Content \| Size in \|
				88	\| \| \| bytes \|
				89	+==================+===========================================================+=========+
				90	\| Source contains \| ``LOG("Battery state: %s; battery \| \|
				91	\| \| voltage: %d mV", state, voltage);`` \| \|
				92	+------------------+-----------------------------------------------------------+---------+
				93	\| Binary contains \| ``d9 28 47 8e`` (0x8e4728d9) \| 4 \|
				94	+------------------+-----------------------------------------------------------+---------+
				95	\| \| (log statement is called with \| \|
				96	\| \| ``"CHARGING"`` and ``3989`` as arguments) \| \|
				97	+------------------+-----------------------------------------------------------+---------+
				98	\| Device transmits \| =============== ============================== ========== \| 15 \|
				99	\| \| ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` \| \|
				100	\| \| --------------- ------------------------------ ---------- \| \|
				101	\| \| Token ``"CHARGING"`` argument ``3989``, \| \|
				102	\| \| as \| \|
				103	\| \| varint \| \|
				104	\| \| =============== ============================== ========== \| \|
				105	+------------------+-----------------------------------------------------------+---------+
				106	\| When viewed \| ``"Battery state: CHARGING; battery voltage: 3989 mV"`` \| \|
				107	+------------------+-----------------------------------------------------------+---------+
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	108
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	109	Getting started
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	110	===============
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	111	Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
				112	section describes one way ``pw_tokenizer`` might be integrated with a project.
				113	These steps can be adapted as needed.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	114
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	115	1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
				116	are provided. For Make or other build systems, add the files specified in
				117	the BUILD.gn's ``pw_tokenizer`` target to the build.
				118	2. Use the tokenization macros in your code. See `Tokenization`_.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	119	3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
Wyatt Hepler	3a8df98	2020-11-03 14:06:28 -0800	[diff] [blame]	120	linker script. In GN and CMake, this step is done automatically.
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	121	4. Compile your code to produce an ELF file.
				122	5. Run ``database.py create`` on the ELF file to generate a CSV token
				123	database. See `Managing token databases`_.
				124	6. Commit the token database to your repository. See notes in `Database
				125	management`_.
				126	7. Integrate a ``database.py add`` command to your build to automatically
Wyatt Hepler	3a8df98	2020-11-03 14:06:28 -0800	[diff] [blame]	127	update the committed token database. In GN, use the
				128	``pw_tokenizer_database`` template to do this. See `Update a database`_.
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	129	8. Integrate ``detokenize.py`` or the C++ detokenization library with your
				130	tools to decode tokenized logs. See `Detokenization`_.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	131
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	132	Tokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	133	============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	134	Tokenization converts a string literal to a token. If it's a printf-style
				135	string, its arguments are encoded along with it. The results of tokenization can
				136	be sent off device or stored in place of a full string.
				137
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	138	Tokenization macros
				139	-------------------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	140	Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	141	``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	142
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	143	Tokenize a string literal
				144	^^^^^^^^^^^^^^^^^^^^^^^^^
				145	The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
				146	token.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	147
				148	.. code-block:: cpp
				149
				150	constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
				151
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	152	.. admonition:: When to use this macro
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	153
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	154	Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	155	%-style arguments.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	156
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	157	Tokenize to a handler function
				158	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	159	``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	160	since it takes the fewest arguments. It encodes a tokenized string to a
				161	buffer on the stack. The size of the buffer is set with
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	162	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				163
				164	This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	165	backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	166	C-linkage function.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	167
				168	.. code-block:: cpp
				169
				170	PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
				171
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	172	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
				173	size_t size_bytes);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	174
				175	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	176	``uintptr_t`` argument to the global handler function. Values like a log level
				177	can be packed into the ``uintptr_t``.
				178
				179	This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
				180	facade. The backend for this facade must define the
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	181	``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	182
				183	.. code-block:: cpp
				184
				185	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
				186	format_string_literal,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	187	arguments...);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	188
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	189	void pw_tokenizer_HandleEncodedMessageWithPayload(
				190	uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	191
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	192	.. admonition:: When to use these macros
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	193
				194	Use anytime a global handler is sufficient, particularly for widely expanded
				195	macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
				196	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
				197	for tokenizing printf-style strings.
				198
				199	Tokenize to a callback
				200	^^^^^^^^^^^^^^^^^^^^^^
				201	``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
				202	``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
				203	the call site. The size of the buffer is set with
				204	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				205
				206	.. code-block:: cpp
				207
				208	PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
				209
				210	.. admonition:: When to use this macro
				211
				212	Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
				213	use for another purpose or more flexibility is needed.
				214
				215	Tokenize to a buffer
				216	^^^^^^^^^^^^^^^^^^^^
				217	The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
				218	to a caller-provided buffer.
				219
				220	.. code-block:: cpp
				221
				222	uint8_t buffer[BUFFER_SIZE];
				223	size_t size_bytes = sizeof(buffer);
				224	PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
				225
				226	While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
				227	than the other macros, so its per-use code size overhead is larger.
				228
				229	.. admonition:: When to use this macro
				230
				231	Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
				232	other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
				233	widely expanded macros, such as a logging macro, because it will result in
				234	larger code size than its alternatives.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	235
				236	Example: binary logging
				237	^^^^^^^^^^^^^^^^^^^^^^^
				238	String tokenization is perfect for logging. Consider the following log macro,
				239	which gathers the file, line number, and log message. It calls the ``RecordLog``
				240	function, which formats the log string, collects a timestamp, and transmits the
				241	result.
				242
				243	.. code-block:: cpp
				244
				245	#define LOG_INFO(format, ...) \
				246	RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
				247
				248	void RecordLog(LogLevel level, const char* file, int line, const char* format,
				249	...) {
				250	if (level < current_log_level) {
				251	return;
				252	}
				253
				254	int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
				255
				256	va_list args;
				257	va_start(args, format);
				258	bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
				259	va_end(args);
				260
				261	TransmitLog(TimeSinceBootMillis(), buffer, size);
				262	}
				263
				264	It is trivial to convert this to a binary log using the tokenizer. The
				265	``RecordLog`` call is replaced with a
				266	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	267	``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	268	timestamp and transmits the message with ``TransmitLog``.
				269
				270	.. code-block:: cpp
				271
				272	#define LOG_INFO(format, ...) \
				273	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	274	(pw_tokenizer_Payload)LogLevel_INFO, \
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	275	__FILE_NAME__ ":%d " format, \
				276	__LINE__, \
				277	__VA_ARGS__); \
				278
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	279	extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	280	uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
				281	if (static_cast<LogLevel>(level) >= current_log_level) {
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	282	TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
				283	}
				284	}
				285
				286	Note that the ``__FILE_NAME__`` string is directly included in the log format
				287	string. Since the string is tokenized, this has no effect on binary size. A
				288	``%d`` for the line number is added to the format string, so that changing the
				289	line of the log message does not generate a new token. There is no overhead for
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	290	additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	291	duplicate log lines.
				292
Wyatt Hepler	7e58723	2020-08-28 07:51:29 -0700	[diff] [blame]	293	Tokenizing function names
				294	-------------------------
				295	The string literal tokenization functions support tokenizing string literals or
				296	constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
				297	special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
				298	as ``static constexpr char[]`` in C++ instead of the standard ``static const
				299	char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
				300	tokenized while compiling C++ with GCC or Clang.
				301
				302	.. code-block:: cpp
				303
				304	// Tokenize the special function name variables.
				305	constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
				306	constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
				307
				308	// Tokenize the function name variables to a handler function.
				309	PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
				310	PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
				311
				312	Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
				313	They are defined as static character arrays, so they cannot be implicitly
				314	concatentated with string literals. For example, ``printf(__func__ ": %d",
				315	123);`` will not compile.
				316
Wyatt Hepler	f5e984a	2020-04-15 18:15:09 -0700	[diff] [blame]	317	Tokenization in Python
				318	----------------------
				319	The Python ``pw_tokenizer.encode`` module has limited support for encoding
				320	tokenized messages with the ``encode_token_and_args`` function.
				321
				322	.. autofunction:: pw_tokenizer.encode.encode_token_and_args
				323
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	324	Encoding
				325	--------
				326	The token is a 32-bit hash calculated during compilation. The string is encoded
				327	little-endian with the token followed by arguments, if any. For example, the
				328	31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
				329	This is encoded as 4 bytes: ``44 a2 c9 da``.
				330
				331	Arguments are encoded as follows:
				332
				333	* Integers (1--10 bytes) --
				334	`ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
				335	similarly to Protocol Buffers. Smaller values take fewer bytes.
				336	* Floating point numbers (4 bytes) -- Single precision floating point.
				337	* Strings (1--128 bytes) -- Length byte followed by the string contents.
Wyatt Hepler	f5e984a	2020-04-15 18:15:09 -0700	[diff] [blame]	338	The top bit of the length whether the string was truncated or
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	339	not. The remaining 7 bits encode the string length, with a maximum of 127
				340	bytes.
				341
				342	.. TODO: insert diagram here!
				343
				344	.. tip::
				345	``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
				346	short or avoid encoding them as strings (e.g. encode an enum as an integer
				347	instead of a string). See also `Tokenized strings as %s arguments`_.
				348
				349	Token generation: fixed length hashing at compile time
				350	------------------------------------------------------
				351	String tokens are generated using a modified version of the x65599 hash used by
				352	the SDBM project. All hashing is done at compile time.
				353
				354	In C code, strings are hashed with a preprocessor macro. For compatibility with
				355	macros, the hash must be limited to a fixed maximum number of characters. This
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	356	value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
				357	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
				358	the complexity of the hashing macros.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	359
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	360	C++ macros use a constexpr function instead of a macro. This function works with
				361	any length of string and has lower compilation time impact than the C macros.
				362	For consistency, C++ tokenization uses the same hash algorithm, but the
				363	calculated values will differ between C and C++ for strings longer than
				364	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	365
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	366	Tokenization domains
				367	--------------------
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	368	``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
				369	string label associated with each tokenized string. This allows projects to keep
				370	tokens from different sources separate. Potential use cases include the
				371	following:
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	372
				373	* Keep large sets of tokenized strings separate to avoid collisions.
				374	* Create a separate database for a small number of strings that use truncated
				375	tokens, for example only 10 or 16 bits instead of the full 32 bits.
				376
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	377	If no domain is specified, the domain is empty (``""``). For many projects, this
				378	default domain is sufficient, so no additional configuration is required.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	379
				380	.. code-block:: cpp
				381
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	382	// Tokenizes this string to the default ("") domain.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	383	PW_TOKENIZE_STRING("Hello, world!");
				384
				385	// Tokenizes this string to the "my_custom_domain" domain.
				386	PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
				387
Wyatt Hepler	23f831d	2020-05-12 13:53:30 -0700	[diff] [blame]	388	The database and detokenization command line tools default to reading from the
				389	default domain. The domain may be specified for ELF files by appending
				390	``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
				391	example, the following reads strings in ``some_domain`` from ``my_image.elf``.
				392
				393	.. code-block:: sh
				394
				395	./database.py create --database my_db.csv path/to/my_image.elf#some_domain
				396
				397	See `Managing token databases`_ for information about the ``database.py``
				398	command line tool.
				399
Wyatt Hepler	4b62b89	2021-03-04 10:03:43 -0800	[diff] [blame]	400	Smaller tokens with masking
				401	---------------------------
				402	``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
				403	fewer than 32 bits does not improve runtime or code size efficiency. However,
				404	when tokens are packed into data structures or stored in arrays, the size of the
				405	token directly affects memory usage. In those cases, every bit counts, and it
				406	may be desireable to use fewer bits for the token.
				407
				408	``pw_tokenizer`` allows users to provide a mask to apply to the token. This
				409	masked token is used in both the token database and the code. The masked token
				410	is not a masked version of the full 32-bit token, the masked token is the token.
				411	This makes it trivial to decode tokens that use fewer than 32 bits.
				412
				413	Masking functionality is provided through the ``*_MASK`` versions of the macros.
				414	For example, the following generates 16-bit tokens and packs them into an
				415	existing value.
				416
				417	.. code-block:: cpp
				418
				419	constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
				420	uint32_t packed_word = (other_bits << 16) \| token;
				421
				422	Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
				423	used for tokens, the more likely two strings are to hash to the same token. See
				424	`token collisions`_.
				425
				426	Token collisions
				427	----------------
				428	Tokens are calculated with a hash function. It is possible for different
				429	strings to hash to the same token. When this happens, multiple strings will have
				430	the same token in the database, and it may not be possible to unambiguously
				431	decode a token.
				432
				433	The detokenization tools attempt to resolve collisions automatically. Collisions
				434	are resolved based on two things:
				435
				436	- whether the tokenized data matches the strings arguments' (if any), and
				437	- if / when the string was marked as having been removed from the database.
				438
				439	Working with collisions
				440	^^^^^^^^^^^^^^^^^^^^^^^
				441	Collisions may occur occasionally. Run the command
				442	``python -m pw_tokenizer.database report <database>`` to see information about a
				443	token database, including any collisions.
				444
				445	If there are collisions, take the following steps to resolve them.
				446
				447	- Change one of the colliding strings slightly to give it a new token.
				448	- In C (not C++), artificial collisions may occur if strings longer than
				449	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
				450	consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
				451	See ``pw_tokenizer/public/pw_tokenizer/config.h``.
				452	- Run the ``mark_removed`` command with the latest version of the build
				453	artifacts to mark missing strings as removed. This deprioritizes them in
				454	collision resolution.
				455
				456	.. code-block:: sh
				457
				458	python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
				459
				460	The ``purge`` command may be used to delete these tokens from the database.
				461
				462	Probability of collisions
				463	^^^^^^^^^^^^^^^^^^^^^^^^^
				464	Hashes of any size have a collision risk. The probability of one at least
				465	one collision occurring for a given number of strings is unintuitively high
				466	(this is known as the `birthday problem
				467	<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
				468	used for tokens, the probability of collisions increases substantially.
				469
				470	This table shows the approximate number of strings that can be hashed to have a
				471	1% or 50% probability of at least one collision (assuming a uniform, random
				472	hash).
				473
				474	+-------+---------------------------------------+
				475	\| Token \| Collision probability by string count \|
				476	\| bits +--------------------+------------------+
				477	\| \| 50% \| 1% \|
				478	+=======+====================+==================+
				479	\| 32 \| 77000 \| 9300 \|
				480	+-------+--------------------+------------------+
				481	\| 31 \| 54000 \| 6600 \|
				482	+-------+--------------------+------------------+
				483	\| 24 \| 4800 \| 580 \|
				484	+-------+--------------------+------------------+
				485	\| 16 \| 300 \| 36 \|
				486	+-------+--------------------+------------------+
				487	\| 8 \| 19 \| 3 \|
				488	+-------+--------------------+------------------+
				489
				490	Keep this table in mind when masking tokens (see `Smaller tokens with
				491	masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
				492	such as module names, but won't be suitable for large sets of strings, like log
				493	messages.
				494
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	495	Token databases
				496	===============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	497	Token databases store a mapping of tokens to the strings they represent. An ELF
				498	file can be used as a token database, but it only contains the strings for its
				499	exact build. A token database file aggregates tokens from multiple ELF files, so
				500	that a single database can decode tokenized strings from any known ELF.
				501
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	502	Token databases contain the token, removal date (if any), and string for each
				503	tokenized string. Two token database formats are supported: CSV and binary.
				504
				505	CSV database format
				506	-------------------
				507	The CSV database format has three columns: the token in hexadecimal, the removal
				508	date (if any) in year-month-day format, and the string literal, surrounded by
				509	quotes. Quote characters within the string are represented as two quote
				510	characters.
				511
				512	This example database contains six strings, three of which have removal dates.
				513
				514	.. code-block::
				515
				516	141c35d5, ,"The answer: ""%s"""
				517	2e668cd6,2019-12-25,"Jello, world!"
				518	7b940e2a, ,"Hello %s! %hd %e"
				519	851beeb6, ,"%u %d"
				520	881436a0,2020-01-01,"The answer is: %s"
				521	e13b0f94,2020-04-01,"%llu"
				522
				523	Binary database format
				524	----------------------
				525	The binary database format is comprised of a 16-byte header followed by a series
				526	of 8-byte entries. Each entry stores the token and the removal date, which is
				527	0xFFFFFFFF if there is none. The string literals are stored next in the same
				528	order as the entries. Strings are stored with null terminators. See
				529	`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
				530	for full details.
				531
				532	The binary form of the CSV database is shown below. It contains the same
				533	information, but in a more compact and easily processed form. It takes 141 B
				534	compared with the CSV database's 211 B.
				535
				536	.. code-block:: text
				537
				538	[header]
				539	0x00: 454b4f54 0000534e TOKENS..
				540	0x08: 00000006 00000000 ........
				541
				542	[entries]
				543	0x10: 141c35d5 ffffffff .5......
				544	0x18: 2e668cd6 07e30c19 ..f.....
				545	0x20: 7b940e2a ffffffff *..{....
				546	0x28: 851beeb6 ffffffff ........
				547	0x30: 881436a0 07e40101 .6......
				548	0x38: e13b0f94 07e40401 ..;.....
				549
				550	[string table]
				551	0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
				552	0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
				553	0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
				554	0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
				555	0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
				556
				557	Managing token databases
				558	------------------------
				559	Token databases are managed with the ``database.py`` script. This script can be
				560	used to extract tokens from compilation artifacts and manage database files.
				561	Invoke ``database.py`` with ``-h`` for full usage information.
				562
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	563	An example ELF file with tokenized logs is provided at
Wyatt Hepler	23f831d	2020-05-12 13:53:30 -0700	[diff] [blame]	564	``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	565	file to experiment with the ``database.py`` commands.
				566
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	567	Create a database
				568	^^^^^^^^^^^^^^^^^
				569	The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
				570	etc.), archives (.a), or existing token databases (CSV or binary).
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	571
				572	.. code-block:: sh
				573
				574	./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				575
				576	Two database formats are supported: CSV and binary. Provide ``--type binary`` to
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	577	``create`` to generate a binary database instead of the default CSV. CSV
				578	databases are great for checking into a source control or for human review.
				579	Binary databases are more compact and simpler to parse. The C++ detokenizer
				580	library only supports binary databases currently.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	581
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	582	Update a database
				583	^^^^^^^^^^^^^^^^^
				584	As new tokenized strings are added, update the database with the ``add``
				585	command.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	586
				587	.. code-block:: sh
				588
				589	./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				590
				591	A CSV token database can be checked into a source repository and updated as code
				592	changes are made. The build system can invoke ``database.py`` to update the
				593	database after each build.
				594
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	595	GN integration
				596	^^^^^^^^^^^^^^
Wyatt Hepler	676b1d2	2020-08-13 12:32:00 -0700	[diff] [blame]	597	Token databases may be updated or created as part of a GN build. The
Wyatt Hepler	2868e07	2021-03-10 15:50:03 -0800	[diff] [blame]	598	``pw_tokenizer_database`` template provided by
				599	``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
				600	strings database or creates a new database with artifacts from one or more GN
				601	targets or other database files.
Wyatt Hepler	676b1d2	2020-08-13 12:32:00 -0700	[diff] [blame]	602
				603	To create a new database, set the ``create`` variable to the desired database
				604	type (``"csv"`` or ``"binary"``). The database will be created in the output
				605	directory. To update an existing database, provide the path to the database with
				606	the ``database`` variable.
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	607
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	608	.. code-block::
				609
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	610	import("//build_overrides/pigweed.gni")
				611
				612	import("$dir_pw_tokenizer/database.gni")
				613
				614	pw_tokenizer_database("my_database") {
				615	database = "database_in_the_source_tree.csv"
				616	targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
				617	input_databases = [ "other_database.csv" ]
				618	}
				619
Wyatt Hepler	2868e07	2021-03-10 15:50:03 -0800	[diff] [blame]	620	Instead of specifying GN targets, paths or globs to output files may be provided
				621	with the ``paths`` option.
				622
				623	.. code-block::
				624
				625	pw_tokenizer_database("my_database") {
				626	database = "database_in_the_source_tree.csv"
				627	deps = [ ":apps" ]
				628	paths = [ "$root_build_dir/*/.elf" ]
				629	}
				630
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	631	Detokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	632	==============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	633	Detokenization is the process of expanding a token to the string it represents
				634	and decoding its arguments. This module provides Python and C++ detokenization
				635	libraries.
				636
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	637	Example: decoding tokenized logs
				638
				639	A project might tokenize its log messages with the `Base64 format`_. Consider
				640	the following log file, which has four tokenized logs and one plain text log:
				641
				642	.. code-block:: text
				643
				644	20200229 14:38:58 INF $HL2VHA==
				645	20200229 14:39:00 DBG $5IhTKg==
				646	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				647	20200229 14:39:21 INF $EgFj8lVVAUI=
				648	20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
				649
				650	The project's log strings are stored in a database like the following:
				651
				652	.. code-block::
				653
				654	1c95bd1c, ,"Initiating retrieval process for recovery object"
				655	2a5388e4, ,"Determining optimal approach and coordinating vectors"
				656	3743540c, ,"Recovery object retrieval failed with status %s"
				657	f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
				658
				659	Using the detokenizing tools with the database, the logs can be decoded:
				660
				661	.. code-block:: text
				662
				663	20200229 14:38:58 INF Initiating retrieval process for recovery object
				664	20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
				665	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				666	20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
				667	20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
				668
				669	.. note::
				670
				671	This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
				672	much space as the default binary format when encoded. For projects that wish
				673	to interleave tokenized with plain text, using Base64 is a worthwhile
				674	tradeoff.
				675
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	676	Python
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	677	------
				678	To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
				679	package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	680
				681	.. code-block:: python
				682
				683	import pw_tokenizer
				684
				685	detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
				686
				687	def process_log_message(log_message):
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	688	result = detokenizer.detokenize(log_message.payload)
				689	self._log(str(result))
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	690
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	691	The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	692	class, which can be used in place of the standard ``Detokenizer``. This class
				693	monitors database files for changes and automatically reloads them when they
				694	change. This is helpful for long-running tools that use detokenization.
				695
				696	C++
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	697	---
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	698	The C++ detokenization libraries can be used in C++ or any language that can
				699	call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	700	Java Native Interface (JNI) implementation is provided.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	701
				702	The C++ detokenization library uses binary-format token databases (created with
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	703	``database.py create --type binary``). Read a binary format database from a
				704	file or include it in the source code. Pass the database array to
				705	``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	706
				707	.. code-block:: cpp
				708
				709	Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
				710
				711	std::string ProcessLog(span<uint8_t> log_data) {
				712	return detokenizer.Detokenize(log_data).BestString();
				713	}
				714
				715	The ``TokenDatabase`` class verifies that its data is valid before using it. If
				716	it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
				717	``ok()`` returns false. If the token database is included in the source code,
				718	this check can be done at compile time.
				719
				720	.. code-block:: cpp
				721
				722	// This line fails to compile with a static_assert if the database is invalid.
				723	constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
				724
				725	Detokenizer OpenDatabase(std::string_view path) {
				726	std::vector<uint8_t> data = ReadWholeFile(path);
				727
				728	TokenDatabase database = TokenDatabase::Create(data);
				729
				730	// This checks if the file contained a valid database. It is safe to use a
				731	// TokenDatabase that failed to load (it will be empty), but it may be
				732	// desirable to provide a default database or otherwise handle the error.
				733	if (database.ok()) {
				734	return Detokenizer(database);
				735	}
				736	return Detokenizer(kDefaultDatabase);
				737	}
				738
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	739	Base64 format
				740	=============
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	741	The tokenizer encodes messages to a compact binary representation. Applications
				742	may desire a textual representation of tokenized strings. This makes it easy to
				743	use tokenized messages alongside plain text messages, but comes at a small
				744	efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
				745	as binary messages.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	746
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	747	The Base64 format is comprised of a ``$`` character followed by the
				748	Base64-encoded contents of the tokenized message. For example, consider
				749	tokenizing the string ``This is an example: %d!`` with the argument -1. The
				750	string's token is 0x4b016e66.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	751
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	752	.. code-block:: text
				753
				754	Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
				755
				756	Plain text: This is an example: -1! [23 bytes]
				757
				758	Binary: 66 6e 01 4b 01 [ 5 bytes]
				759
				760	Base64: $Zm4BSwE= [ 9 bytes]
				761
				762	Encoding
				763	--------
				764	To encode with the Base64 format, add a call to
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	765	``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	766	in the tokenizer handler function. For example,
				767
				768	.. code-block:: cpp
				769
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	770	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	771	size_t size_bytes) {
				772	char base64_buffer[64];
				773	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				774	pw::span(encoded_message, size_bytes), base64_buffer);
				775
				776	TransmitLogMessage(base64_buffer, base64_size);
				777	}
				778
				779	Decoding
				780	--------
				781	Base64 decoding and detokenizing is supported in the Python detokenizer through
				782	the ``detokenize_base64`` and related functions.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	783
				784	.. tip::
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	785	The Python detokenization tools support recursive detokenization for prefixed
				786	Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	787	prefixed Base64 messages can be passed as ``%s`` arguments.
				788
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	789	For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
				790	passed as an argument to the printf-style string ``Nested message: %s``, which
				791	encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
				792	as follows:
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	793
				794	::
				795
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	796	"$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	797
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	798	Base64 decoding is supported in C++ or C with the
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	799	``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	800	functions.
				801
				802	.. code-block:: cpp
				803
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	804	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	805	size_t size_bytes) {
				806	char base64_buffer[64];
				807	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				808	pw::span(encoded_message, size_bytes), base64_buffer);
				809
				810	TransmitLogMessage(base64_buffer, base64_size);
				811	}
				812
Wyatt Hepler	63e6cfd	2020-08-04 11:31:55 -0700	[diff] [blame]	813	Command line utilities
				814	^^^^^^^^^^^^^^^^^^^^^^
				815	``pw_tokenizer`` provides two standalone command line utilities for detokenizing
				816	Base64-encoded tokenized strings.
				817
				818	* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
				819	stdin.
				820	* ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a
				821	connected serial device.
				822
				823	If the ``pw_tokenizer`` Python package is installed, these tools may be executed
				824	as runnable modules. For example:
				825
				826	.. code-block::
				827
				828	# Detokenize Base64-encoded strings in a file
				829	python -m pw_tokenizer.detokenize -i input_file.txt
				830
				831	# Detokenize Base64-encoded strings in output from a serial device
				832	python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0
				833
				834	See the ``--help`` options for these tools for full usage information.
				835
Keir Mierle	086ef1c	2020-03-19 02:03:51 -0700	[diff] [blame]	836	Deployment war story
				837	====================
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	838	The tokenizer module was developed to bring tokenized logging to an
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	839	in-development product. The product already had an established text-based
				840	logging system. Deploying tokenization was straightforward and had substantial
				841	benefits.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	842
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	843	Results
				844	-------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	845	* Log contents shrunk by over 50%, even with Base64 encoding.
				846
				847	* Significant size savings for encoded logs, even using the less-efficient
				848	Base64 encoding required for compatibility with the existing log system.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	849	* Freed valuable communication bandwidth.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	850	* Allowed storing many more logs in crash dumps.
				851
				852	* Substantial flash savings.
				853
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	854	* Reduced the size firmware images by up to 18%.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	855
				856	* Simpler logging code.
				857
				858	* Removed CPU-heavy ``snprintf`` calls.
				859	* Removed complex code for forwarding log arguments to a low-priority task.
				860
				861	This section describes the tokenizer deployment process and highlights key
				862	insights.
				863
				864	Firmware deployment
				865	-------------------
				866	* In the project's logging macro, calls to the underlying logging function
				867	were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
				868	invocation.
				869	* The log level was passed as the payload argument to facilitate runtime log
				870	level control.
				871	* For this project, it was necessary to encode the log messages as text. In
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	872	``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	873	encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
				874	messages.
				875	* Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
				876
				877	.. attention::
				878	Do not encode line numbers in tokenized strings. This results in a huge
				879	number of lines being added to the database, since every time code moves,
				880	new strings are tokenized. If line numbers are desired in a tokenized
				881	string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
				882
				883	Database management
				884	-------------------
				885	* The token database was stored as a CSV file in the project's Git repo.
				886	* The token database was automatically updated as part of the build, and
				887	developers were expected to check in the database changes alongside their
				888	code changes.
				889	* A presubmit check verified that all strings added by a change were added to
				890	the token database.
				891	* The token database included logs and asserts for all firmware images in the
				892	project.
				893	* No strings were purged from the token database.
				894
				895	.. tip::
				896	Merge conflicts may be a frequent occurrence with an in-source database. If
				897	the database is in-source, make sure there is a simple script to resolve any
				898	merge conflicts. The script could either keep both sets of lines or discard
				899	local changes and regenerate the database.
				900
				901	Decoding tooling deployment
				902	---------------------------
				903	* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
				904
				905	* Product-specific Python command line tools, using
				906	``pw_tokenizer.Detokenizer``.
				907	* Standalone script for decoding prefixed Base64 tokens in files or
				908	live output (e.g. from ``adb``), using ``detokenize.py``'s command line
				909	interface.
				910
				911	* The C++ detokenizer library was deployed to two Android apps with a Java
				912	Native Interface (JNI) layer.
				913
				914	* The binary token database was included as a raw resource in the APK.
				915	* In one app, the built-in token database could be overridden by copying a
				916	file to the phone.
				917
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	918	.. tip::
				919	Make the tokenized logging tools simple to use for your project.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	920
				921	* Provide simple wrapper shell scripts that fill in arguments for the
				922	project. For example, point ``detokenize.py`` to the project's token
				923	databases.
				924	* Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
				925	continuously-running tools, so that users don't have to restart the tool
				926	when the token database updates.
				927	* Integrate detokenization everywhere it is needed. Integrating the tools
				928	takes just a few lines of code, and token databases can be embedded in
				929	APKs or binaries.
				930
				931	Limitations and future work
				932	===========================
				933
				934	GCC bug: tokenization in template functions
				935	-------------------------------------------
				936	GCC incorrectly ignores the section attribute for template
				937	`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
				938	`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
				939	bug, tokenized strings in template functions may be emitted into ``.rodata``
				940	instead of the special tokenized string section. This causes two problems:
				941
				942	1. Tokenized strings will not be discovered by the token database tools.
				943	2. Tokenized strings may not be removed from the final binary.
				944
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	945	clang does not have this issue! Use clang to avoid this.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	946
				947	It is possible to work around this bug in GCC. One approach would be to tag
				948	format strings so that the database tools can find them in ``.rodata``. Then, to
				949	remove the strings, compile two binaries: one metadata binary with all tokenized
				950	strings and a second, final binary that removes the strings. The strings could
				951	be removed by providing the appropriate linker flags or by removing the ``used``
				952	attribute from the tokenized string character array declaration.
				953
				954	64-bit tokenization
				955	-------------------
				956	The Python and C++ detokenizing libraries currently assume that strings were
				957	tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
				958	``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
				959	device performed the tokenization.
				960
				961	Supporting detokenization of strings tokenized on 64-bit targets would be
				962	simple. This could be done by adding an option to switch the 32-bit types to
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	963	64-bit. The tokenizer stores the sizes of these types in the
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	964	``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	965	by checking the ELF file, if necessary.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	966
				967	Tokenization in headers
				968	-----------------------
				969	Tokenizing code in header files (inline functions or templates) may trigger
				970	warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
				971	is because tokenization requires declaring a character array for each tokenized
				972	string. If the tokenized string includes macros that change value, the size of
				973	this character array changes, which means the same static variable is defined
				974	with different sizes. It should be safe to suppress these warnings, but, when
				975	possible, code that tokenizes strings with macros that can change value should
				976	be moved to source files rather than headers.
				977
				978	Tokenized strings as ``%s`` arguments
				979	-------------------------------------
				980	Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
				981	encoded 1:1, with no tokenization. It would be better to send a tokenized string
				982	literal as an integer instead of a string argument, but this is not yet
				983	supported.
				984
				985	A string token could be sent by marking an integer % argument in a way
				986	recognized by the detokenization tools. The detokenizer would expand the
				987	argument to the string represented by the integer.
				988
				989	.. code-block:: cpp
				990
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	991	#define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	992
				993	constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
				994
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	995	PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	996
				997	Strings with arguments could be encoded to a buffer, but since printf strings
				998	are null-terminated, a binary encoding would not work. These strings can be
				999	prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
				1000
				1001	Another possibility: encode strings with arguments to a ``uint64_t`` and send
				1002	them as an integer. This would be efficient and simple, but only support a small
				1003	number of arguments.
				1004
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	1005	Legacy tokenized string ELF format
				1006	==================================
				1007	The original version of ``pw_tokenizer`` stored tokenized stored as plain C
				1008	strings in the ELF file instead of structured tokenized string entries. Strings
				1009	in different domains were stored in different linker sections. The Python script
				1010	that parsed the ELF file would re-calculate the tokens.
				1011
				1012	In the current version of ``pw_tokenizer``, tokenized strings are stored in a
				1013	structured entry containing a token, domain, and length-delimited string. This
				1014	has several advantages over the legacy format:
				1015
				1016	* The Python script does not have to recalculate the token, so any hash
				1017	algorithm may be used in the firmware.
				1018	* In C++, the tokenization hash no longer has a length limitation.
				1019	* Strings with null terminators in them are properly handled.
				1020	* Only one linker section is required in the linker script, instead of a
				1021	separate section for each domain.
				1022
				1023	To migrate to the new format, all that is required is update the linker sections
				1024	to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
				1025	``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
				1026	The Python tooling continues to support the legacy tokenized string ELF format.
				1027
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1028	Compatibility
				1029	=============
				1030	* C11
Wyatt Hepler	a6d5cc6	2020-01-17 14:15:40 -0800	[diff] [blame]	1031	* C++11
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1032	* Python 3
				1033
				1034	Dependencies
				1035	============
Armando Montanez	0054a9b	2020-03-13 13:06:24 -0700	[diff] [blame]	1036	* ``pw_varint`` module
				1037	* ``pw_preprocessor`` module
				1038	* ``pw_span`` module