Blame - pw_tokenizer/docs.rst - platform/external/pigweed

blob: dc908ea4b7f76664f2d452190f36cafb7aa94be2 [file] [log] [blame]

Wyatt Hepler	f9fb90f	2020-09-30 18:59:33 -0700	[diff] [blame]	1	.. _module-pw_tokenizer:
Keir Mierle	bc5a269	2020-05-21 16:52:25 -0700	[diff] [blame]	2
Alexei Frolov	44d5473	2020-01-10 14:45:43 -0800	[diff] [blame]	3	------------
				4	pw_tokenizer
				5	------------
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	6	Logging is critical, but developers are often forced to choose between
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	7	additional logging or saving crucial flash space. The ``pw_tokenizer`` module
				8	helps address this by replacing printf-style strings with binary tokens during
				9	compilation. This enables extensive logging with substantially less memory
				10	usage.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	11
				12	.. note::
				13	This usage of the term "tokenizer" is not related to parsing! The
				14	module is called tokenizer because it replaces a whole string literal with an
				15	integer token. It does not parse strings into separate tokens.
				16
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	17	The most common application of ``pw_tokenizer`` is binary logging, and it is
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	18	designed to integrate easily into existing logging systems. However, the
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	19	tokenizer is general purpose and can be used to tokenize any strings, with or
				20	without printf-style arguments.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	21
				22	Why tokenize strings?
				23
				24	* Dramatically reduce binary size by removing string literals from binaries.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	25	* Reduce I/O traffic, RAM, and flash usage by sending and storing compact
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	26	tokens instead of strings. We've seen over 50% reduction in encoded log
				27	contents.
				28	* Reduce CPU usage by replacing snprintf calls with simple tokenization code.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	29	* Remove potentially sensitive log, assert, and other strings from binaries.
				30
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	31	Basic overview
				32	==============
				33	There are two sides to ``pw_tokenizer``, which we call tokenization and
				34	detokenization.
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	35
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	36	* Tokenization converts string literals in the source code to
				37	binary tokens at compile time. If the string has printf-style arguments,
				38	these are encoded to compact binary form at runtime.
				39	* Detokenization converts tokenized strings back to the original
				40	human-readable strings.
				41
				42	Here's an overview of what happens when ``pw_tokenizer`` is used:
				43
				44	1. During compilation, the ``pw_tokenizer`` module hashes string literals to
				45	generate stable 32-bit tokens.
				46	2. The tokenization macro removes these strings by declaring them in an ELF
				47	section that is excluded from the final binary.
				48	3. After compilation, strings are extracted from the ELF to build a database
				49	of tokenized strings for use by the detokenizer. The ELF file may also be
				50	used directly.
				51	4. During operation, the device encodes the string token and its arguments, if
				52	any.
				53	5. The encoded tokenized strings are sent off-device or stored.
				54	6. Off-device, the detokenizer tools use the token database to decode the
				55	strings to human-readable form.
				56
				57	Example: tokenized logging
				58	--------------------------
				59	This example demonstrates using ``pw_tokenizer`` for logging. In this example,
				60	tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
				61	size (49 → 15 bytes).
				62
				63	Before: plain text logging
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	64
				65	+------------------+-------------------------------------------+---------------+
				66	\| Location \| Logging Content \| Size in bytes \|
				67	+==================+===========================================+===============+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	68	\| Source contains \| ``LOG("Battery state: %s; battery \| \|
				69	\| \| voltage: %d mV", state, voltage);`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	70	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	71	\| Binary contains \| ``"Battery state: %s; battery \| 41 \|
				72	\| \| voltage: %d mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	73	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	74	\| \| (log statement is called with \| \|
				75	\| \| ``"CHARGING"`` and ``3989`` as arguments) \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	76	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	77	\| Device transmits \| ``"Battery state: CHARGING; battery \| 49 \|
				78	\| \| voltage: 3989 mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	79	+------------------+-------------------------------------------+---------------+
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	80	\| When viewed \| ``"Battery state: CHARGING; battery \| \|
				81	\| \| voltage: 3989 mV"`` \| \|
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	82	+------------------+-------------------------------------------+---------------+
				83
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	84	After: tokenized logging
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	85
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	86	+------------------+-----------------------------------------------------------+---------+
				87	\| Location \| Logging Content \| Size in \|
				88	\| \| \| bytes \|
				89	+==================+===========================================================+=========+
				90	\| Source contains \| ``LOG("Battery state: %s; battery \| \|
				91	\| \| voltage: %d mV", state, voltage);`` \| \|
				92	+------------------+-----------------------------------------------------------+---------+
				93	\| Binary contains \| ``d9 28 47 8e`` (0x8e4728d9) \| 4 \|
				94	+------------------+-----------------------------------------------------------+---------+
				95	\| \| (log statement is called with \| \|
				96	\| \| ``"CHARGING"`` and ``3989`` as arguments) \| \|
				97	+------------------+-----------------------------------------------------------+---------+
				98	\| Device transmits \| =============== ============================== ========== \| 15 \|
				99	\| \| ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` \| \|
				100	\| \| --------------- ------------------------------ ---------- \| \|
				101	\| \| Token ``"CHARGING"`` argument ``3989``, \| \|
				102	\| \| as \| \|
				103	\| \| varint \| \|
				104	\| \| =============== ============================== ========== \| \|
				105	+------------------+-----------------------------------------------------------+---------+
				106	\| When viewed \| ``"Battery state: CHARGING; battery voltage: 3989 mV"`` \| \|
				107	+------------------+-----------------------------------------------------------+---------+
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	108
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	109	Getting started
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	110	===============
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	111	Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
				112	section describes one way ``pw_tokenizer`` might be integrated with a project.
				113	These steps can be adapted as needed.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	114
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	115	1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
				116	are provided. For Make or other build systems, add the files specified in
				117	the BUILD.gn's ``pw_tokenizer`` target to the build.
				118	2. Use the tokenization macros in your code. See `Tokenization`_.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	119	3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
Wyatt Hepler	3a8df98	2020-11-03 14:06:28 -0800	[diff] [blame]	120	linker script. In GN and CMake, this step is done automatically.
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	121	4. Compile your code to produce an ELF file.
				122	5. Run ``database.py create`` on the ELF file to generate a CSV token
				123	database. See `Managing token databases`_.
				124	6. Commit the token database to your repository. See notes in `Database
				125	management`_.
				126	7. Integrate a ``database.py add`` command to your build to automatically
Wyatt Hepler	3a8df98	2020-11-03 14:06:28 -0800	[diff] [blame]	127	update the committed token database. In GN, use the
				128	``pw_tokenizer_database`` template to do this. See `Update a database`_.
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	129	8. Integrate ``detokenize.py`` or the C++ detokenization library with your
				130	tools to decode tokenized logs. See `Detokenization`_.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	131
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	132	Tokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	133	============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	134	Tokenization converts a string literal to a token. If it's a printf-style
				135	string, its arguments are encoded along with it. The results of tokenization can
				136	be sent off device or stored in place of a full string.
				137
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	138	Tokenization macros
				139	-------------------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	140	Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	141	``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	142
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	143	Tokenize a string literal
				144	^^^^^^^^^^^^^^^^^^^^^^^^^
				145	The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
				146	token.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	147
				148	.. code-block:: cpp
				149
				150	constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
				151
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	152	.. admonition:: When to use this macro
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	153
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	154	Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	155	%-style arguments.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	156
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	157	Tokenize to a handler function
				158	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	159	``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	160	since it takes the fewest arguments. It encodes a tokenized string to a
				161	buffer on the stack. The size of the buffer is set with
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	162	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				163
				164	This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	165	backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	166	C-linkage function.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	167
				168	.. code-block:: cpp
				169
				170	PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
				171
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	172	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
				173	size_t size_bytes);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	174
				175	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	176	``uintptr_t`` argument to the global handler function. Values like a log level
				177	can be packed into the ``uintptr_t``.
				178
				179	This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
				180	facade. The backend for this facade must define the
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	181	``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	182
				183	.. code-block:: cpp
				184
				185	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
				186	format_string_literal,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	187	arguments...);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	188
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	189	void pw_tokenizer_HandleEncodedMessageWithPayload(
				190	uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	191
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	192	.. admonition:: When to use these macros
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	193
				194	Use anytime a global handler is sufficient, particularly for widely expanded
				195	macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
				196	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
				197	for tokenizing printf-style strings.
				198
				199	Tokenize to a callback
				200	^^^^^^^^^^^^^^^^^^^^^^
				201	``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
				202	``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
				203	the call site. The size of the buffer is set with
				204	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				205
				206	.. code-block:: cpp
				207
				208	PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
				209
				210	.. admonition:: When to use this macro
				211
				212	Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
				213	use for another purpose or more flexibility is needed.
				214
				215	Tokenize to a buffer
				216	^^^^^^^^^^^^^^^^^^^^
				217	The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
				218	to a caller-provided buffer.
				219
				220	.. code-block:: cpp
				221
				222	uint8_t buffer[BUFFER_SIZE];
				223	size_t size_bytes = sizeof(buffer);
				224	PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
				225
				226	While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
				227	than the other macros, so its per-use code size overhead is larger.
				228
				229	.. admonition:: When to use this macro
				230
				231	Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
				232	other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
				233	widely expanded macros, such as a logging macro, because it will result in
				234	larger code size than its alternatives.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	235
Wyatt Hepler	bcf0735	2021-04-05 14:44:30 -0700	[diff] [blame]	236	.. _module-pw_tokenizer-custom-macro:
				237
				238	Tokenize with a custom macro
				239	^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				240	Projects may need more flexbility than the standard ``pw_tokenizer`` macros
				241	provide. To support this, projects may define custom tokenization macros. This
				242	requires the use of two low-level ``pw_tokenizer`` macros:
				243
				244	.. c:macro:: PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)
				245
				246	Tokenizes a format string and sets the ``_pw_tokenizer_token`` variable to the
				247	token. Must be used in its own scope, since the same variable is used in every
				248	invocation.
				249
				250	The tokenized string uses the specified :ref:`tokenization domain
				251	<module-pw_tokenizer-domains>`. Use ``PW_TOKENIZER_DEFAULT_DOMAIN`` for the
				252	default. The token also may be masked; use ``UINT32_MAX`` to keep all bits.
				253
				254	.. c:macro:: PW_TOKENIZER_ARG_TYPES(...)
				255
				256	Converts a series of arguments to a compact format that replaces the format
				257	string literal.
				258
				259	Use these two macros within the custom tokenization macro to call a function
				260	that does the encoding. The following example implements a custom tokenization
				261	macro for use with :ref:`module-pw_log_tokenized`.
				262
				263	.. code-block:: cpp
				264
				265	#include "pw_tokenizer/tokenize.h"
				266
				267	#ifndef __cplusplus
				268	extern "C" {
				269	#endif
				270
				271	void EncodeTokenizedMessage(pw_tokenizer_Payload metadata,
				272	pw_tokenizer_Token token,
				273	pw_tokenizer_ArgTypes types,
				274	...);
				275
				276	#ifndef __cplusplus
				277	} // extern "C"
				278	#endif
				279
				280	#define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \
				281	do { \
Wyatt Hepler	b847615	2021-04-06 15:28:32 -0700	[diff] [blame]	282	PW_TOKENIZE_FORMAT_STRING( \
Wyatt Hepler	bcf0735	2021-04-05 14:44:30 -0700	[diff] [blame]	283	PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \
				284	EncodeTokenizedMessage(payload, \
				285	_pw_tokenizer_token, \
				286	PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \
				287	PW_COMMA_ARGS(__VA_ARGS__)); \
				288	} while (0)
				289
				290	In this example, the ``EncodeTokenizedMessage`` function would handle encoding
				291	and processing the message. Encoding is done by the
				292	``pw::tokenizer::EncodedMessage`` class or ``pw::tokenizer::EncodeArgs``
				293	function from ``pw_tokenizer/encode_args.h``. The encoded message can then be
				294	transmitted or stored as needed.
				295
				296	.. code-block:: cpp
				297
				298	#include "pw_log_tokenized/log_tokenized.h"
				299	#include "pw_tokenizer/encode_args.h"
				300
				301	void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
				302	std::span<std::byte> message);
				303
				304	extern "C" void EncodeTokenizedMessage(const pw_tokenizer_Payload metadata,
				305	const pw_tokenizer_Token token,
				306	const pw_tokenizer_ArgTypes types,
				307	...) {
				308	va_list args;
				309	va_start(args, types);
				310	pw::tokenizer::EncodedMessage encoded_message(token, types, args);
				311	va_end(args);
				312
				313	HandleTokenizedMessage(metadata, encoded_message);
				314	}
				315
				316	.. admonition:: When to use a custom macro
				317
				318	Use existing tokenization macros whenever possible. A custom macro may be
				319	needed to support use cases like the following:
				320
				321	* Variations of ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` that take
				322	different arguments.
				323	* Supporting global handler macros that use different handler functions.
				324
				325	Binary logging with pw_tokenizer
				326	--------------------------------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	327	String tokenization is perfect for logging. Consider the following log macro,
				328	which gathers the file, line number, and log message. It calls the ``RecordLog``
				329	function, which formats the log string, collects a timestamp, and transmits the
				330	result.
				331
				332	.. code-block:: cpp
				333
				334	#define LOG_INFO(format, ...) \
				335	RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
				336
				337	void RecordLog(LogLevel level, const char* file, int line, const char* format,
				338	...) {
				339	if (level < current_log_level) {
				340	return;
				341	}
				342
				343	int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
				344
				345	va_list args;
				346	va_start(args, format);
				347	bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
				348	va_end(args);
				349
				350	TransmitLog(TimeSinceBootMillis(), buffer, size);
				351	}
				352
				353	It is trivial to convert this to a binary log using the tokenizer. The
				354	``RecordLog`` call is replaced with a
				355	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	356	``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	357	timestamp and transmits the message with ``TransmitLog``.
				358
				359	.. code-block:: cpp
				360
				361	#define LOG_INFO(format, ...) \
				362	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	363	(pw_tokenizer_Payload)LogLevel_INFO, \
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	364	__FILE_NAME__ ":%d " format, \
				365	__LINE__, \
				366	__VA_ARGS__); \
				367
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	368	extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
Wyatt Hepler	6639c45	2020-05-06 11:43:07 -0700	[diff] [blame]	369	uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
				370	if (static_cast<LogLevel>(level) >= current_log_level) {
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	371	TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
				372	}
				373	}
				374
				375	Note that the ``__FILE_NAME__`` string is directly included in the log format
				376	string. Since the string is tokenized, this has no effect on binary size. A
				377	``%d`` for the line number is added to the format string, so that changing the
				378	line of the log message does not generate a new token. There is no overhead for
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	379	additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	380	duplicate log lines.
				381
Wyatt Hepler	7e58723	2020-08-28 07:51:29 -0700	[diff] [blame]	382	Tokenizing function names
				383	-------------------------
				384	The string literal tokenization functions support tokenizing string literals or
				385	constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
				386	special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
				387	as ``static constexpr char[]`` in C++ instead of the standard ``static const
				388	char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
				389	tokenized while compiling C++ with GCC or Clang.
				390
				391	.. code-block:: cpp
				392
				393	// Tokenize the special function name variables.
				394	constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
				395	constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
				396
				397	// Tokenize the function name variables to a handler function.
				398	PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
				399	PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
				400
				401	Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
				402	They are defined as static character arrays, so they cannot be implicitly
				403	concatentated with string literals. For example, ``printf(__func__ ": %d",
				404	123);`` will not compile.
				405
Wyatt Hepler	f5e984a	2020-04-15 18:15:09 -0700	[diff] [blame]	406	Tokenization in Python
				407	----------------------
				408	The Python ``pw_tokenizer.encode`` module has limited support for encoding
				409	tokenized messages with the ``encode_token_and_args`` function.
				410
				411	.. autofunction:: pw_tokenizer.encode.encode_token_and_args
				412
Armando Montanez	36a1ef7	2022-01-13 17:23:45 -0800	[diff] [blame]	413	This function requires a string's token is already calculated. Typically these
				414	tokens are provided by a database, but they can be manually created using the
				415	tokenizer hash.
				416
				417	.. autofunction:: pw_tokenizer.tokens.pw_tokenizer_65599_hash
				418
				419	This is particularly useful for offline token database generation in cases where
				420	tokenized strings in a binary cannot be embedded as parsable pw_tokenizer
				421	entries.
				422
				423	.. note::
				424	In C, the hash length of a string has a fixed limit controlled by
				425	``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed
				426	to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching
				427	hash length limit. When creating an offline database, it's a good idea to
				428	generate tokens for both, and merge the databases.
				429
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	430	Encoding
				431	--------
				432	The token is a 32-bit hash calculated during compilation. The string is encoded
				433	little-endian with the token followed by arguments, if any. For example, the
				434	31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
				435	This is encoded as 4 bytes: ``44 a2 c9 da``.
				436
				437	Arguments are encoded as follows:
				438
				439	* Integers (1--10 bytes) --
				440	`ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
				441	similarly to Protocol Buffers. Smaller values take fewer bytes.
				442	* Floating point numbers (4 bytes) -- Single precision floating point.
				443	* Strings (1--128 bytes) -- Length byte followed by the string contents.
Wyatt Hepler	f5e984a	2020-04-15 18:15:09 -0700	[diff] [blame]	444	The top bit of the length whether the string was truncated or
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	445	not. The remaining 7 bits encode the string length, with a maximum of 127
				446	bytes.
				447
				448	.. TODO: insert diagram here!
				449
				450	.. tip::
				451	``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
				452	short or avoid encoding them as strings (e.g. encode an enum as an integer
				453	instead of a string). See also `Tokenized strings as %s arguments`_.
				454
				455	Token generation: fixed length hashing at compile time
				456	------------------------------------------------------
				457	String tokens are generated using a modified version of the x65599 hash used by
				458	the SDBM project. All hashing is done at compile time.
				459
				460	In C code, strings are hashed with a preprocessor macro. For compatibility with
				461	macros, the hash must be limited to a fixed maximum number of characters. This
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	462	value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
				463	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
				464	the complexity of the hashing macros.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	465
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	466	C++ macros use a constexpr function instead of a macro. This function works with
				467	any length of string and has lower compilation time impact than the C macros.
				468	For consistency, C++ tokenization uses the same hash algorithm, but the
				469	calculated values will differ between C and C++ for strings longer than
				470	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	471
Wyatt Hepler	bcf0735	2021-04-05 14:44:30 -0700	[diff] [blame]	472	.. _module-pw_tokenizer-domains:
				473
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	474	Tokenization domains
				475	--------------------
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	476	``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
				477	string label associated with each tokenized string. This allows projects to keep
				478	tokens from different sources separate. Potential use cases include the
				479	following:
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	480
				481	* Keep large sets of tokenized strings separate to avoid collisions.
				482	* Create a separate database for a small number of strings that use truncated
				483	tokens, for example only 10 or 16 bits instead of the full 32 bits.
				484
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	485	If no domain is specified, the domain is empty (``""``). For many projects, this
				486	default domain is sufficient, so no additional configuration is required.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	487
				488	.. code-block:: cpp
				489
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	490	// Tokenizes this string to the default ("") domain.
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	491	PW_TOKENIZE_STRING("Hello, world!");
				492
				493	// Tokenizes this string to the "my_custom_domain" domain.
				494	PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
				495
Wyatt Hepler	23f831d	2020-05-12 13:53:30 -0700	[diff] [blame]	496	The database and detokenization command line tools default to reading from the
				497	default domain. The domain may be specified for ELF files by appending
				498	``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
				499	example, the following reads strings in ``some_domain`` from ``my_image.elf``.
				500
				501	.. code-block:: sh
				502
				503	./database.py create --database my_db.csv path/to/my_image.elf#some_domain
				504
				505	See `Managing token databases`_ for information about the ``database.py``
				506	command line tool.
				507
Wyatt Hepler	4b62b89	2021-03-04 10:03:43 -0800	[diff] [blame]	508	Smaller tokens with masking
				509	---------------------------
				510	``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
				511	fewer than 32 bits does not improve runtime or code size efficiency. However,
				512	when tokens are packed into data structures or stored in arrays, the size of the
				513	token directly affects memory usage. In those cases, every bit counts, and it
				514	may be desireable to use fewer bits for the token.
				515
				516	``pw_tokenizer`` allows users to provide a mask to apply to the token. This
				517	masked token is used in both the token database and the code. The masked token
				518	is not a masked version of the full 32-bit token, the masked token is the token.
				519	This makes it trivial to decode tokens that use fewer than 32 bits.
				520
				521	Masking functionality is provided through the ``*_MASK`` versions of the macros.
				522	For example, the following generates 16-bit tokens and packs them into an
				523	existing value.
				524
				525	.. code-block:: cpp
				526
				527	constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
				528	uint32_t packed_word = (other_bits << 16) \| token;
				529
				530	Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
				531	used for tokens, the more likely two strings are to hash to the same token. See
				532	`token collisions`_.
				533
				534	Token collisions
				535	----------------
				536	Tokens are calculated with a hash function. It is possible for different
				537	strings to hash to the same token. When this happens, multiple strings will have
				538	the same token in the database, and it may not be possible to unambiguously
				539	decode a token.
				540
				541	The detokenization tools attempt to resolve collisions automatically. Collisions
				542	are resolved based on two things:
				543
				544	- whether the tokenized data matches the strings arguments' (if any), and
				545	- if / when the string was marked as having been removed from the database.
				546
				547	Working with collisions
				548	^^^^^^^^^^^^^^^^^^^^^^^
				549	Collisions may occur occasionally. Run the command
				550	``python -m pw_tokenizer.database report <database>`` to see information about a
				551	token database, including any collisions.
				552
				553	If there are collisions, take the following steps to resolve them.
				554
				555	- Change one of the colliding strings slightly to give it a new token.
				556	- In C (not C++), artificial collisions may occur if strings longer than
				557	``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
				558	consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
				559	See ``pw_tokenizer/public/pw_tokenizer/config.h``.
				560	- Run the ``mark_removed`` command with the latest version of the build
				561	artifacts to mark missing strings as removed. This deprioritizes them in
				562	collision resolution.
				563
				564	.. code-block:: sh
				565
				566	python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
				567
				568	The ``purge`` command may be used to delete these tokens from the database.
				569
				570	Probability of collisions
				571	^^^^^^^^^^^^^^^^^^^^^^^^^
				572	Hashes of any size have a collision risk. The probability of one at least
				573	one collision occurring for a given number of strings is unintuitively high
				574	(this is known as the `birthday problem
				575	<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
				576	used for tokens, the probability of collisions increases substantially.
				577
				578	This table shows the approximate number of strings that can be hashed to have a
				579	1% or 50% probability of at least one collision (assuming a uniform, random
				580	hash).
				581
				582	+-------+---------------------------------------+
				583	\| Token \| Collision probability by string count \|
				584	\| bits +--------------------+------------------+
				585	\| \| 50% \| 1% \|
				586	+=======+====================+==================+
				587	\| 32 \| 77000 \| 9300 \|
				588	+-------+--------------------+------------------+
				589	\| 31 \| 54000 \| 6600 \|
				590	+-------+--------------------+------------------+
				591	\| 24 \| 4800 \| 580 \|
				592	+-------+--------------------+------------------+
				593	\| 16 \| 300 \| 36 \|
				594	+-------+--------------------+------------------+
				595	\| 8 \| 19 \| 3 \|
				596	+-------+--------------------+------------------+
				597
				598	Keep this table in mind when masking tokens (see `Smaller tokens with
				599	masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
				600	such as module names, but won't be suitable for large sets of strings, like log
				601	messages.
				602
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	603	Token databases
				604	===============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	605	Token databases store a mapping of tokens to the strings they represent. An ELF
				606	file can be used as a token database, but it only contains the strings for its
				607	exact build. A token database file aggregates tokens from multiple ELF files, so
				608	that a single database can decode tokenized strings from any known ELF.
				609
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	610	Token databases contain the token, removal date (if any), and string for each
				611	tokenized string. Two token database formats are supported: CSV and binary.
				612
				613	CSV database format
				614	-------------------
				615	The CSV database format has three columns: the token in hexadecimal, the removal
				616	date (if any) in year-month-day format, and the string literal, surrounded by
				617	quotes. Quote characters within the string are represented as two quote
				618	characters.
				619
				620	This example database contains six strings, three of which have removal dates.
				621
				622	.. code-block::
				623
				624	141c35d5, ,"The answer: ""%s"""
				625	2e668cd6,2019-12-25,"Jello, world!"
				626	7b940e2a, ,"Hello %s! %hd %e"
				627	851beeb6, ,"%u %d"
				628	881436a0,2020-01-01,"The answer is: %s"
				629	e13b0f94,2020-04-01,"%llu"
				630
				631	Binary database format
				632	----------------------
				633	The binary database format is comprised of a 16-byte header followed by a series
				634	of 8-byte entries. Each entry stores the token and the removal date, which is
				635	0xFFFFFFFF if there is none. The string literals are stored next in the same
				636	order as the entries. Strings are stored with null terminators. See
Rob Mohr	81244f0	2021-05-21 15:46:01 -0700	[diff] [blame]	637	`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	638	for full details.
				639
				640	The binary form of the CSV database is shown below. It contains the same
				641	information, but in a more compact and easily processed form. It takes 141 B
				642	compared with the CSV database's 211 B.
				643
				644	.. code-block:: text
				645
				646	[header]
				647	0x00: 454b4f54 0000534e TOKENS..
				648	0x08: 00000006 00000000 ........
				649
				650	[entries]
				651	0x10: 141c35d5 ffffffff .5......
				652	0x18: 2e668cd6 07e30c19 ..f.....
				653	0x20: 7b940e2a ffffffff *..{....
				654	0x28: 851beeb6 ffffffff ........
				655	0x30: 881436a0 07e40101 .6......
				656	0x38: e13b0f94 07e40401 ..;.....
				657
				658	[string table]
				659	0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
				660	0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
				661	0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
				662	0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
				663	0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
				664
Armando Montanez	1bf0fb3	2022-01-11 09:53:24 -0800	[diff] [blame]	665
				666	JSON support
				667	------------
				668	While pw_tokenizer doesn't specify a JSON database format, a token database can
				669	be created from a JSON formatted array of strings. This is useful for side-band
				670	token database generation for strings that are not embedded as parsable tokens
				671	in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
				672	instructions on generating a token database from a JSON file.
				673
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	674	Managing token databases
				675	------------------------
				676	Token databases are managed with the ``database.py`` script. This script can be
				677	used to extract tokens from compilation artifacts and manage database files.
				678	Invoke ``database.py`` with ``-h`` for full usage information.
				679
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	680	An example ELF file with tokenized logs is provided at
Wyatt Hepler	23f831d	2020-05-12 13:53:30 -0700	[diff] [blame]	681	``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
Wyatt Hepler	d32daea	2020-03-26 13:55:47 -0700	[diff] [blame]	682	file to experiment with the ``database.py`` commands.
				683
Armando Montanez	1bf0fb3	2022-01-11 09:53:24 -0800	[diff] [blame]	684	.. _module-pw_tokenizer-database-creation:
				685
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	686	Create a database
				687	^^^^^^^^^^^^^^^^^
				688	The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
Armando Montanez	1bf0fb3	2022-01-11 09:53:24 -0800	[diff] [blame]	689	etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
				690	containing an array of strings.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	691
				692	.. code-block:: sh
				693
				694	./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				695
Armando Montanez	1bf0fb3	2022-01-11 09:53:24 -0800	[diff] [blame]	696	Two database output formats are supported: CSV and binary. Provide
				697	``--type binary`` to ``create`` to generate a binary database instead of the
				698	default CSV. CSV databases are great for checking into a source control or for
				699	human review. Binary databases are more compact and simpler to parse. The C++
				700	detokenizer library only supports binary databases currently.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	701
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	702	Update a database
				703	^^^^^^^^^^^^^^^^^
				704	As new tokenized strings are added, update the database with the ``add``
				705	command.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	706
				707	.. code-block:: sh
				708
				709	./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				710
				711	A CSV token database can be checked into a source repository and updated as code
				712	changes are made. The build system can invoke ``database.py`` to update the
				713	database after each build.
				714
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	715	GN integration
				716	^^^^^^^^^^^^^^
Wyatt Hepler	676b1d2	2020-08-13 12:32:00 -0700	[diff] [blame]	717	Token databases may be updated or created as part of a GN build. The
Wyatt Hepler	2868e07	2021-03-10 15:50:03 -0800	[diff] [blame]	718	``pw_tokenizer_database`` template provided by
				719	``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
				720	strings database or creates a new database with artifacts from one or more GN
				721	targets or other database files.
Wyatt Hepler	676b1d2	2020-08-13 12:32:00 -0700	[diff] [blame]	722
				723	To create a new database, set the ``create`` variable to the desired database
				724	type (``"csv"`` or ``"binary"``). The database will be created in the output
				725	directory. To update an existing database, provide the path to the database with
				726	the ``database`` variable.
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	727
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	728	.. code-block::
				729
Wyatt Hepler	a74f7b0	2020-07-23 14:10:56 -0700	[diff] [blame]	730	import("//build_overrides/pigweed.gni")
				731
				732	import("$dir_pw_tokenizer/database.gni")
				733
				734	pw_tokenizer_database("my_database") {
				735	database = "database_in_the_source_tree.csv"
				736	targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
				737	input_databases = [ "other_database.csv" ]
				738	}
				739
Wyatt Hepler	2868e07	2021-03-10 15:50:03 -0800	[diff] [blame]	740	Instead of specifying GN targets, paths or globs to output files may be provided
				741	with the ``paths`` option.
				742
				743	.. code-block::
				744
				745	pw_tokenizer_database("my_database") {
				746	database = "database_in_the_source_tree.csv"
				747	deps = [ ":apps" ]
Wyatt Hepler	7faecc9	2021-03-12 11:24:37 -0800	[diff] [blame]	748	optional_paths = [ "$root_build_dir/*/.elf" ]
Wyatt Hepler	2868e07	2021-03-10 15:50:03 -0800	[diff] [blame]	749	}
				750
Wyatt Hepler	7faecc9	2021-03-12 11:24:37 -0800	[diff] [blame]	751	.. note::
				752
				753	The ``paths`` and ``optional_targets`` arguments do not add anything to
				754	``deps``, so there is no guarantee that the referenced artifacts will exist
				755	when the database is updated. Provide ``targets`` or ``deps`` or build other
				756	GN targets first if this is a concern.
				757
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	758	Detokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	759	==============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	760	Detokenization is the process of expanding a token to the string it represents
				761	and decoding its arguments. This module provides Python and C++ detokenization
				762	libraries.
				763
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	764	Example: decoding tokenized logs
				765
				766	A project might tokenize its log messages with the `Base64 format`_. Consider
				767	the following log file, which has four tokenized logs and one plain text log:
				768
				769	.. code-block:: text
				770
				771	20200229 14:38:58 INF $HL2VHA==
				772	20200229 14:39:00 DBG $5IhTKg==
				773	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				774	20200229 14:39:21 INF $EgFj8lVVAUI=
				775	20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
				776
				777	The project's log strings are stored in a database like the following:
				778
				779	.. code-block::
				780
				781	1c95bd1c, ,"Initiating retrieval process for recovery object"
				782	2a5388e4, ,"Determining optimal approach and coordinating vectors"
				783	3743540c, ,"Recovery object retrieval failed with status %s"
				784	f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
				785
				786	Using the detokenizing tools with the database, the logs can be decoded:
				787
				788	.. code-block:: text
				789
				790	20200229 14:38:58 INF Initiating retrieval process for recovery object
				791	20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
				792	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				793	20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
				794	20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
				795
				796	.. note::
				797
				798	This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
				799	much space as the default binary format when encoded. For projects that wish
				800	to interleave tokenized with plain text, using Base64 is a worthwhile
				801	tradeoff.
				802
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	803	Python
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	804	------
				805	To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
				806	package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	807
				808	.. code-block:: python
				809
				810	import pw_tokenizer
				811
				812	detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
				813
				814	def process_log_message(log_message):
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	815	result = detokenizer.detokenize(log_message.payload)
				816	self._log(str(result))
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	817
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	818	The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	819	class, which can be used in place of the standard ``Detokenizer``. This class
				820	monitors database files for changes and automatically reloads them when they
				821	change. This is helpful for long-running tools that use detokenization.
				822
Armando Montanez	dd64671	2021-07-14 16:10:36 -0700	[diff] [blame]	823	For messages that are optionally tokenized and may be encoded as binary,
				824	Base64, or plaintext UTF-8, use
				825	:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to
				826	determine the correct method to detokenize and always provide a printable
				827	string. For more information on this feature, see
				828	:ref:`module-pw_tokenizer-proto`.
				829
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	830	C++
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	831	---
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	832	The C++ detokenization libraries can be used in C++ or any language that can
				833	call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	834	Java Native Interface (JNI) implementation is provided.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	835
				836	The C++ detokenization library uses binary-format token databases (created with
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	837	``database.py create --type binary``). Read a binary format database from a
				838	file or include it in the source code. Pass the database array to
				839	``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	840
				841	.. code-block:: cpp
				842
				843	Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
				844
				845	std::string ProcessLog(span<uint8_t> log_data) {
				846	return detokenizer.Detokenize(log_data).BestString();
				847	}
				848
				849	The ``TokenDatabase`` class verifies that its data is valid before using it. If
				850	it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
				851	``ok()`` returns false. If the token database is included in the source code,
				852	this check can be done at compile time.
				853
				854	.. code-block:: cpp
				855
				856	// This line fails to compile with a static_assert if the database is invalid.
				857	constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
				858
				859	Detokenizer OpenDatabase(std::string_view path) {
				860	std::vector<uint8_t> data = ReadWholeFile(path);
				861
				862	TokenDatabase database = TokenDatabase::Create(data);
				863
				864	// This checks if the file contained a valid database. It is safe to use a
				865	// TokenDatabase that failed to load (it will be empty), but it may be
				866	// desirable to provide a default database or otherwise handle the error.
				867	if (database.ok()) {
				868	return Detokenizer(database);
				869	}
				870	return Detokenizer(kDefaultDatabase);
				871	}
				872
Wyatt Hepler	de20d74	2021-06-02 23:34:14 -0700	[diff] [blame]	873	Protocol buffers
				874	----------------
				875	``pw_tokenizer`` provides utilities for handling tokenized fields in protobufs.
				876	See :ref:`module-pw_tokenizer-proto` for details.
				877
				878	.. toctree::
				879	:hidden:
				880
				881	proto.rst
				882
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	883	Base64 format
				884	=============
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	885	The tokenizer encodes messages to a compact binary representation. Applications
				886	may desire a textual representation of tokenized strings. This makes it easy to
				887	use tokenized messages alongside plain text messages, but comes at a small
				888	efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
				889	as binary messages.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	890
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	891	The Base64 format is comprised of a ``$`` character followed by the
				892	Base64-encoded contents of the tokenized message. For example, consider
				893	tokenizing the string ``This is an example: %d!`` with the argument -1. The
				894	string's token is 0x4b016e66.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	895
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	896	.. code-block:: text
				897
				898	Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
				899
				900	Plain text: This is an example: -1! [23 bytes]
				901
				902	Binary: 66 6e 01 4b 01 [ 5 bytes]
				903
				904	Base64: $Zm4BSwE= [ 9 bytes]
				905
				906	Encoding
				907	--------
				908	To encode with the Base64 format, add a call to
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	909	``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	910	in the tokenizer handler function. For example,
				911
				912	.. code-block:: cpp
				913
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	914	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	915	size_t size_bytes) {
				916	char base64_buffer[64];
				917	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				918	pw::span(encoded_message, size_bytes), base64_buffer);
				919
				920	TransmitLogMessage(base64_buffer, base64_size);
				921	}
				922
				923	Decoding
				924	--------
Wyatt Hepler	5f53d27	2021-06-01 21:58:25 -0700	[diff] [blame]	925	The Python ``Detokenizer`` class supprts decoding and detokenizing prefixed
				926	Base64 messages with ``detokenize_base64`` and related methods.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	927
				928	.. tip::
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	929	The Python detokenization tools support recursive detokenization for prefixed
				930	Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	931	prefixed Base64 messages can be passed as ``%s`` arguments.
				932
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	933	For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
				934	passed as an argument to the printf-style string ``Nested message: %s``, which
				935	encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
				936	as follows:
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	937
				938	::
				939
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	940	"$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	941
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	942	Base64 decoding is supported in C++ or C with the
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	943	``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	944	functions.
				945
				946	.. code-block:: cpp
				947
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	948	void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	949	size_t size_bytes) {
				950	char base64_buffer[64];
				951	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				952	pw::span(encoded_message, size_bytes), base64_buffer);
				953
				954	TransmitLogMessage(base64_buffer, base64_size);
				955	}
				956
Wyatt Hepler	03d14dc	2022-02-16 16:13:38 -0800	[diff] [blame]	957	Investigating undecoded messages
				958	--------------------------------
				959	Tokenized messages cannot be decoded if the token is not recognized. The Python
				960	package includes the ``parse_message`` tool, which parses tokenized Base64
				961	messages without looking up the token in a database. This tool attempts to guess
				962	the types of the arguments and displays potential ways to decode them.
				963
				964	This tool can be used to extract argument information from an otherwise unusable
				965	message. It could help identify which statement in the code produced the
				966	message. This tool is not particularly helpful for tokenized messages without
				967	arguments, since all it can do is show the value of the unknown token.
				968
				969	The tool is executed by passing Base64 tokenized messages, with or without the
				970	``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to
				971	see full usage information.
				972
				973	Example
				974	^^^^^^^
				975	.. code-block::
				976
				977	$ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d
				978
				979	INF Decoding arguments for '$329JMwA='
				980	INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
				981	INF Token: 0x33496fdf
				982	INF Args: b'\x00' [00] (1 bytes)
				983	INF Decoding with up to 8 %s or %d arguments
				984	INF Attempt 1: [%s]
				985	INF Attempt 2: [%d] 0
				986
				987	INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
				988	INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
				989	INF Token: 0xe7a58492
				990	INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
				991	INF Decoding with up to 8 %s or %d arguments
				992	INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
				993	INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK
				994
Wyatt Hepler	63e6cfd	2020-08-04 11:31:55 -0700	[diff] [blame]	995	Command line utilities
				996	^^^^^^^^^^^^^^^^^^^^^^
				997	``pw_tokenizer`` provides two standalone command line utilities for detokenizing
				998	Base64-encoded tokenized strings.
				999
				1000	* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
				1001	stdin.
Keir Mierle	de972b8	2021-05-18 09:27:09 -0700	[diff] [blame]	1002	* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a
Wyatt Hepler	63e6cfd	2020-08-04 11:31:55 -0700	[diff] [blame]	1003	connected serial device.
				1004
				1005	If the ``pw_tokenizer`` Python package is installed, these tools may be executed
				1006	as runnable modules. For example:
				1007
				1008	.. code-block::
				1009
				1010	# Detokenize Base64-encoded strings in a file
				1011	python -m pw_tokenizer.detokenize -i input_file.txt
				1012
				1013	# Detokenize Base64-encoded strings in output from a serial device
Keir Mierle	de972b8	2021-05-18 09:27:09 -0700	[diff] [blame]	1014	python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0
Wyatt Hepler	63e6cfd	2020-08-04 11:31:55 -0700	[diff] [blame]	1015
				1016	See the ``--help`` options for these tools for full usage information.
				1017
Keir Mierle	086ef1c	2020-03-19 02:03:51 -0700	[diff] [blame]	1018	Deployment war story
				1019	====================
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1020	The tokenizer module was developed to bring tokenized logging to an
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1021	in-development product. The product already had an established text-based
				1022	logging system. Deploying tokenization was straightforward and had substantial
				1023	benefits.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1024
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1025	Results
				1026	-------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1027	* Log contents shrunk by over 50%, even with Base64 encoding.
				1028
				1029	* Significant size savings for encoded logs, even using the less-efficient
				1030	Base64 encoding required for compatibility with the existing log system.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1031	* Freed valuable communication bandwidth.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1032	* Allowed storing many more logs in crash dumps.
				1033
				1034	* Substantial flash savings.
				1035
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1036	* Reduced the size firmware images by up to 18%.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1037
				1038	* Simpler logging code.
				1039
				1040	* Removed CPU-heavy ``snprintf`` calls.
				1041	* Removed complex code for forwarding log arguments to a low-priority task.
				1042
				1043	This section describes the tokenizer deployment process and highlights key
				1044	insights.
				1045
				1046	Firmware deployment
				1047	-------------------
				1048	* In the project's logging macro, calls to the underlying logging function
				1049	were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
				1050	invocation.
				1051	* The log level was passed as the payload argument to facilitate runtime log
				1052	level control.
				1053	* For this project, it was necessary to encode the log messages as text. In
Wyatt Hepler	7a5e4d6	2020-08-31 08:39:16 -0700	[diff] [blame]	1054	``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1055	encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
				1056	messages.
				1057	* Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
				1058
				1059	.. attention::
				1060	Do not encode line numbers in tokenized strings. This results in a huge
				1061	number of lines being added to the database, since every time code moves,
Wyatt Hepler	ebbce4c	2021-06-03 17:34:00 -0700	[diff] [blame]	1062	new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line
				1063	numbers are encoded in the log metadata. Line numbers may also be included by
				1064	by adding ``"%d"`` to the format string and passing ``__LINE__``.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1065
				1066	Database management
				1067	-------------------
				1068	* The token database was stored as a CSV file in the project's Git repo.
				1069	* The token database was automatically updated as part of the build, and
				1070	developers were expected to check in the database changes alongside their
				1071	code changes.
				1072	* A presubmit check verified that all strings added by a change were added to
				1073	the token database.
				1074	* The token database included logs and asserts for all firmware images in the
				1075	project.
				1076	* No strings were purged from the token database.
				1077
				1078	.. tip::
				1079	Merge conflicts may be a frequent occurrence with an in-source database. If
				1080	the database is in-source, make sure there is a simple script to resolve any
				1081	merge conflicts. The script could either keep both sets of lines or discard
				1082	local changes and regenerate the database.
				1083
				1084	Decoding tooling deployment
				1085	---------------------------
				1086	* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
				1087
				1088	* Product-specific Python command line tools, using
				1089	``pw_tokenizer.Detokenizer``.
				1090	* Standalone script for decoding prefixed Base64 tokens in files or
				1091	live output (e.g. from ``adb``), using ``detokenize.py``'s command line
				1092	interface.
				1093
				1094	* The C++ detokenizer library was deployed to two Android apps with a Java
				1095	Native Interface (JNI) layer.
				1096
				1097	* The binary token database was included as a raw resource in the APK.
				1098	* In one app, the built-in token database could be overridden by copying a
				1099	file to the phone.
				1100
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1101	.. tip::
				1102	Make the tokenized logging tools simple to use for your project.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1103
				1104	* Provide simple wrapper shell scripts that fill in arguments for the
				1105	project. For example, point ``detokenize.py`` to the project's token
				1106	databases.
Wyatt Hepler	5f53d27	2021-06-01 21:58:25 -0700	[diff] [blame]	1107	* Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1108	continuously-running tools, so that users don't have to restart the tool
				1109	when the token database updates.
				1110	* Integrate detokenization everywhere it is needed. Integrating the tools
				1111	takes just a few lines of code, and token databases can be embedded in
				1112	APKs or binaries.
				1113
				1114	Limitations and future work
				1115	===========================
				1116
				1117	GCC bug: tokenization in template functions
				1118	-------------------------------------------
				1119	GCC incorrectly ignores the section attribute for template
				1120	`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
				1121	`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
				1122	bug, tokenized strings in template functions may be emitted into ``.rodata``
				1123	instead of the special tokenized string section. This causes two problems:
				1124
				1125	1. Tokenized strings will not be discovered by the token database tools.
				1126	2. Tokenized strings may not be removed from the final binary.
				1127
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1128	clang does not have this issue! Use clang to avoid this.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1129
				1130	It is possible to work around this bug in GCC. One approach would be to tag
				1131	format strings so that the database tools can find them in ``.rodata``. Then, to
				1132	remove the strings, compile two binaries: one metadata binary with all tokenized
				1133	strings and a second, final binary that removes the strings. The strings could
				1134	be removed by providing the appropriate linker flags or by removing the ``used``
				1135	attribute from the tokenized string character array declaration.
				1136
				1137	64-bit tokenization
				1138	-------------------
				1139	The Python and C++ detokenizing libraries currently assume that strings were
				1140	tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
				1141	``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
				1142	device performed the tokenization.
				1143
				1144	Supporting detokenization of strings tokenized on 64-bit targets would be
				1145	simple. This could be done by adding an option to switch the 32-bit types to
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	1146	64-bit. The tokenizer stores the sizes of these types in the
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	1147	``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
Wyatt Hepler	d58eef9	2020-05-08 10:39:56 -0700	[diff] [blame]	1148	by checking the ELF file, if necessary.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1149
				1150	Tokenization in headers
				1151	-----------------------
				1152	Tokenizing code in header files (inline functions or templates) may trigger
				1153	warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
				1154	is because tokenization requires declaring a character array for each tokenized
				1155	string. If the tokenized string includes macros that change value, the size of
				1156	this character array changes, which means the same static variable is defined
				1157	with different sizes. It should be safe to suppress these warnings, but, when
				1158	possible, code that tokenizes strings with macros that can change value should
				1159	be moved to source files rather than headers.
				1160
				1161	Tokenized strings as ``%s`` arguments
				1162	-------------------------------------
				1163	Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
				1164	encoded 1:1, with no tokenization. It would be better to send a tokenized string
				1165	literal as an integer instead of a string argument, but this is not yet
				1166	supported.
				1167
				1168	A string token could be sent by marking an integer % argument in a way
				1169	recognized by the detokenization tools. The detokenizer would expand the
				1170	argument to the string represented by the integer.
				1171
				1172	.. code-block:: cpp
				1173
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1174	#define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1175
				1176	constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
				1177
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	1178	PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1179
				1180	Strings with arguments could be encoded to a buffer, but since printf strings
				1181	are null-terminated, a binary encoding would not work. These strings can be
				1182	prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
				1183
				1184	Another possibility: encode strings with arguments to a ``uint64_t`` and send
				1185	them as an integer. This would be efficient and simple, but only support a small
				1186	number of arguments.
				1187
Wyatt Hepler	eb020a1	2020-10-28 14:01:51 -0700	[diff] [blame]	1188	Legacy tokenized string ELF format
				1189	==================================
				1190	The original version of ``pw_tokenizer`` stored tokenized stored as plain C
				1191	strings in the ELF file instead of structured tokenized string entries. Strings
				1192	in different domains were stored in different linker sections. The Python script
				1193	that parsed the ELF file would re-calculate the tokens.
				1194
				1195	In the current version of ``pw_tokenizer``, tokenized strings are stored in a
				1196	structured entry containing a token, domain, and length-delimited string. This
				1197	has several advantages over the legacy format:
				1198
				1199	* The Python script does not have to recalculate the token, so any hash
				1200	algorithm may be used in the firmware.
				1201	* In C++, the tokenization hash no longer has a length limitation.
				1202	* Strings with null terminators in them are properly handled.
				1203	* Only one linker section is required in the linker script, instead of a
				1204	separate section for each domain.
				1205
				1206	To migrate to the new format, all that is required is update the linker sections
				1207	to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
				1208	``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
				1209	The Python tooling continues to support the legacy tokenized string ELF format.
				1210
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1211	Compatibility
				1212	=============
				1213	* C11
Wyatt Hepler	0132da5	2021-10-11 16:20:11 -0700	[diff] [blame]	1214	* C++14
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1215	* Python 3
				1216
				1217	Dependencies
				1218	============
Armando Montanez	0054a9b	2020-03-13 13:06:24 -0700	[diff] [blame]	1219	* ``pw_varint`` module
				1220	* ``pw_preprocessor`` module
				1221	* ``pw_span`` module