Blame - pw_tokenizer/docs.rst - platform/external/pigweed

blob: 1f5005c484175ce94ddc405058654b0f1a8897d8 [file] [log] [blame]

Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1	.. _chapter-tokenizer:
				2
				3	.. default-domain:: cpp
				4
				5	.. highlight:: sh
				6
Alexei Frolov	44d5473	2020-01-10 14:45:43 -0800	[diff] [blame]	7	------------
				8	pw_tokenizer
				9	------------
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	10
				11	Logging is critical, but developers are often forced to choose between
Armando Montanez	0054a9b	2020-03-13 13:06:24 -0700	[diff] [blame]	12	additional logging or saving crucial flash space. ``pw_tokenizer`` helps
				13	ameliorate this issue by providing facilities to convert strings to integer
				14	tokens that can be decoded off-device, enabling extensive logging and debugging
				15	with significantly less memory usage. Printf-style format strings such as ``"My
				16	name is %s"`` are also supported; ``pw_tokenizer`` encodes the arguments into
				17	compact binary form at runtime. We’ve seen over 50% optimization in log contents
				18	and substantial savings in flash size, with additional benefits such as
				19	minimizing communication bandwidth and reducing CPU usage.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	20
				21	.. note::
				22	This usage of the term "tokenizer" is not related to parsing! The
				23	module is called tokenizer because it replaces a whole string literal with an
				24	integer token. It does not parse strings into separate tokens.
				25
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	26	The most common application of the tokenizer module is binary logging, and it is
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	27	designed to integrate easily into existing logging systems. However, the
				28	tokenizer is general purpose and can be used to tokenize any strings.
				29
				30	Why tokenize strings?
				31
				32	* Dramatically reduce binary size by removing string literals from binaries.
				33	* Reduce CPU usage by replacing snprintf calls with simple tokenization code.
				34	* Reduce I/O traffic, RAM, and flash usage by sending and storing compact
				35	tokens instead of strings.
				36	* Remove potentially sensitive log, assert, and other strings from binaries.
				37
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	38	Example
				39	=======
				40
				41	Before: With plain text logging
				42
				43	+------------------+-------------------------------------------+---------------+
				44	\| Location \| Logging Content \| Size in bytes \|
				45	+==================+===========================================+===============+
				46	\| Source contains \| LOG_INFO("Battery state: %s; battery \| \|
				47	\| \| voltage: %d mV", state, voltage); \| \|
				48	+------------------+-------------------------------------------+---------------+
				49	\| Binary contains \| "Battery state: %s; battery \| 41 \|
				50	\| \| voltage: %d mV" \| \|
				51	+------------------+-------------------------------------------+---------------+
				52	\| \| (log statement is called with "CHARGING" \| \|
				53	\| \| and 3989 as arguments) \| \|
				54	+------------------+-------------------------------------------+---------------+
				55	\| Device transmits \| "Battery state: CHARGING; battery \| 49 \|
				56	\| \| voltage: 3989 mV" \| \|
				57	+------------------+-------------------------------------------+---------------+
				58	\| When viewed \| "Battery state: CHARGING; battery \| \|
				59	\| \| voltage: 3989 mV" \| \|
				60	+------------------+-------------------------------------------+---------------+
				61
				62	After: With tokenized logging
				63
				64	+------------------+-------------------------------------------------+---------+
				65	\| Location \| Logging Content \| Size in \|
				66	\| \| \| bytes \|
				67	+==================+=================================================+=========+
				68	\| Source contains \| LOG_INFO("Battery state: %s; battery \| \|
				69	\| \| voltage: %d mV", state, voltage); \| \|
				70	+------------------+-------------------------------------------------+---------+
				71	\| Binary contains \| 0x8e4728d9 \| 4 \|
				72	+------------------+-------------------------------------------------+---------+
				73	\| \| (log statement is called with "CHARGING" \| \|
				74	\| \| and 3989 as arguments) \| \|
				75	+------------------+-------------------------------------------------+---------+
				76	\| Device transmits \| =========== ========================== ====== \| 15 \|
				77	\| \| d9 28 47 8e 08 43 48 41 52 47 49 4E 47 aa 3e \| \|
				78	\| \| ----------- -------------------------- ------ \| \|
				79	\| \| Token "CHARGING" argument 3989, \| \|
				80	\| \| as \| \|
				81	\| \| varint \| \|
				82	\| \| =========== ========================== ====== \| \|
				83	+------------------+-------------------------------------------------+---------+
				84	\| When viewed \| "Battery state: CHARGING; battery \| \|
				85	\| \| voltage: 3989 mV" \| \|
				86	+------------------+-------------------------------------------------+---------+
				87
				88	In the above logging example, we achieve a savings of ~90% in binary size (41 →
Armando Montanez	0054a9b	2020-03-13 13:06:24 -0700	[diff] [blame]	89	4 bytes) and 70% in bandwidth (49 → 15 bytes).
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	90
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	91	Basic operation
				92	===============
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	93	There are two sides to tokenization: tokenizing strings in the source code and
				94	detokenizing these strings back to human-readable form.
				95
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	96	1. In C or C++ code, strings are hashed to generate a stable 32-bit token.
				97	2. The tokenization macro removes the string literal by placing it in an ELF
				98	section that is excluded from the final binary.
				99	3. Strings are extracted from an ELF to build a database of tokenized strings
				100	for use by the detokenizer. The ELF file may also be used directly.
				101	4. During operation, the device encodes the string token and its arguments, if
				102	any.
				103	5. The encoded tokenized strings are sent off-device or stored.
				104	6. Off-device, the detokenizer tools use the token database or ELF files to
				105	detokenize the strings to human-readable form.
				106
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	107	Tokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	108	============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	109	Tokenization converts a string literal to a token. If it's a printf-style
				110	string, its arguments are encoded along with it. The results of tokenization can
				111	be sent off device or stored in place of a full string.
				112
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	113	Tokenization macros
				114	-------------------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	115	Adding tokenization to a project is simple. To tokenize a string, include
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	116	``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	117
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	118	Tokenize a string literal
				119	^^^^^^^^^^^^^^^^^^^^^^^^^
				120	The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
				121	token.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	122
				123	.. code-block:: cpp
				124
				125	constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
				126
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	127	.. admonition:: When to use this macro
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	128
Armando Montanez	bcc194b	2020-03-10 10:23:18 -0700	[diff] [blame]	129	Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	130	%-style arguments.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	131
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	132	Tokenize to a handler function
				133	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	134	``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	135	since it takes the fewest arguments. It encodes a tokenized string to a
				136	buffer on the stack. The size of the buffer is set with
				137	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``. It then calls the C-linkage
				138	function ``pw_TokenizerHandleEncodedMessage``, which must be defined by the
				139	project.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	140
				141	.. code-block:: cpp
				142
				143	PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
				144
				145	void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
				146	size_t size_bytes);
				147
				148	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	149	``void*`` argument to the global handler function. Values like a log level can
				150	be packed into the ``void*``.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	151
				152	.. code-block:: cpp
				153
				154	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
				155	format_string_literal,
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	156	arguments...);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	157
				158	void pw_TokenizerHandleEncodedMessageWithPayload(void* payload,
				159	const uint8_t encoded_message[],
				160	size_t size_bytes);
				161
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	162	.. admonition:: When to use this macro
				163
				164	Use anytime a global handler is sufficient, particularly for widely expanded
				165	macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
				166	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
				167	for tokenizing printf-style strings.
				168
				169	Tokenize to a callback
				170	^^^^^^^^^^^^^^^^^^^^^^
				171	``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
				172	``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
				173	the call site. The size of the buffer is set with
				174	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				175
				176	.. code-block:: cpp
				177
				178	PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
				179
				180	.. admonition:: When to use this macro
				181
				182	Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
				183	use for another purpose or more flexibility is needed.
				184
				185	Tokenize to a buffer
				186	^^^^^^^^^^^^^^^^^^^^
				187	The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
				188	to a caller-provided buffer.
				189
				190	.. code-block:: cpp
				191
				192	uint8_t buffer[BUFFER_SIZE];
				193	size_t size_bytes = sizeof(buffer);
				194	PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
				195
				196	While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
				197	than the other macros, so its per-use code size overhead is larger.
				198
				199	.. admonition:: When to use this macro
				200
				201	Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
				202	other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
				203	widely expanded macros, such as a logging macro, because it will result in
				204	larger code size than its alternatives.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	205
				206	Example: binary logging
				207	^^^^^^^^^^^^^^^^^^^^^^^
				208	String tokenization is perfect for logging. Consider the following log macro,
				209	which gathers the file, line number, and log message. It calls the ``RecordLog``
				210	function, which formats the log string, collects a timestamp, and transmits the
				211	result.
				212
				213	.. code-block:: cpp
				214
				215	#define LOG_INFO(format, ...) \
				216	RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
				217
				218	void RecordLog(LogLevel level, const char* file, int line, const char* format,
				219	...) {
				220	if (level < current_log_level) {
				221	return;
				222	}
				223
				224	int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
				225
				226	va_list args;
				227	va_start(args, format);
				228	bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
				229	va_end(args);
				230
				231	TransmitLog(TimeSinceBootMillis(), buffer, size);
				232	}
				233
				234	It is trivial to convert this to a binary log using the tokenizer. The
				235	``RecordLog`` call is replaced with a
				236	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
				237	``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the
				238	timestamp and transmits the message with ``TransmitLog``.
				239
				240	.. code-block:: cpp
				241
				242	#define LOG_INFO(format, ...) \
				243	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
				244	(void*)LogLevel_INFO, \
				245	__FILE_NAME__ ":%d " format, \
				246	__LINE__, \
				247	__VA_ARGS__); \
				248
				249	extern "C" void pw_TokenizerHandleEncodedMessageWithPayload(
				250	void* level, const uint8_t encoded_message[], size_t size_bytes) {
				251	if (reinterpret_cast<LogLevel>(level) >= current_log_level) {
				252	TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
				253	}
				254	}
				255
				256	Note that the ``__FILE_NAME__`` string is directly included in the log format
				257	string. Since the string is tokenized, this has no effect on binary size. A
				258	``%d`` for the line number is added to the format string, so that changing the
				259	line of the log message does not generate a new token. There is no overhead for
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	260	additional tokens, but it may not be desirable to fill a token database with
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	261	duplicate log lines.
				262
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	263	Encoding
				264	--------
				265	The token is a 32-bit hash calculated during compilation. The string is encoded
				266	little-endian with the token followed by arguments, if any. For example, the
				267	31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
				268	This is encoded as 4 bytes: ``44 a2 c9 da``.
				269
				270	Arguments are encoded as follows:
				271
				272	* Integers (1--10 bytes) --
				273	`ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
				274	similarly to Protocol Buffers. Smaller values take fewer bytes.
				275	* Floating point numbers (4 bytes) -- Single precision floating point.
				276	* Strings (1--128 bytes) -- Length byte followed by the string contents.
				277	The top bit of the length byte indicates whether the string was truncated or
				278	not. The remaining 7 bits encode the string length, with a maximum of 127
				279	bytes.
				280
				281	.. TODO: insert diagram here!
				282
				283	.. tip::
				284	``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
				285	short or avoid encoding them as strings (e.g. encode an enum as an integer
				286	instead of a string). See also `Tokenized strings as %s arguments`_.
				287
				288	Token generation: fixed length hashing at compile time
				289	------------------------------------------------------
				290	String tokens are generated using a modified version of the x65599 hash used by
				291	the SDBM project. All hashing is done at compile time.
				292
				293	In C code, strings are hashed with a preprocessor macro. For compatibility with
				294	macros, the hash must be limited to a fixed maximum number of characters. This
				295	value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``.
				296
				297	Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C
				298	due to the complexity of the hashing macros. C++ macros use a constexpr
				299	function instead of a macro, so the compilation time impact is minimal. Projects
				300	primarily in C++ may use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH``
				301	(perhaps even ``std::numeric_limits<size_t>::max()``).
				302
				303	Token databases
				304	===============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	305	Token databases store a mapping of tokens to the strings they represent. An ELF
				306	file can be used as a token database, but it only contains the strings for its
				307	exact build. A token database file aggregates tokens from multiple ELF files, so
				308	that a single database can decode tokenized strings from any known ELF.
				309
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	310	Token databases contain the token, removal date (if any), and string for each
				311	tokenized string. Two token database formats are supported: CSV and binary.
				312
				313	CSV database format
				314	-------------------
				315	The CSV database format has three columns: the token in hexadecimal, the removal
				316	date (if any) in year-month-day format, and the string literal, surrounded by
				317	quotes. Quote characters within the string are represented as two quote
				318	characters.
				319
				320	This example database contains six strings, three of which have removal dates.
				321
				322	.. code-block::
				323
				324	141c35d5, ,"The answer: ""%s"""
				325	2e668cd6,2019-12-25,"Jello, world!"
				326	7b940e2a, ,"Hello %s! %hd %e"
				327	851beeb6, ,"%u %d"
				328	881436a0,2020-01-01,"The answer is: %s"
				329	e13b0f94,2020-04-01,"%llu"
				330
				331	Binary database format
				332	----------------------
				333	The binary database format is comprised of a 16-byte header followed by a series
				334	of 8-byte entries. Each entry stores the token and the removal date, which is
				335	0xFFFFFFFF if there is none. The string literals are stored next in the same
				336	order as the entries. Strings are stored with null terminators. See
				337	`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
				338	for full details.
				339
				340	The binary form of the CSV database is shown below. It contains the same
				341	information, but in a more compact and easily processed form. It takes 141 B
				342	compared with the CSV database's 211 B.
				343
				344	.. code-block:: text
				345
				346	[header]
				347	0x00: 454b4f54 0000534e TOKENS..
				348	0x08: 00000006 00000000 ........
				349
				350	[entries]
				351	0x10: 141c35d5 ffffffff .5......
				352	0x18: 2e668cd6 07e30c19 ..f.....
				353	0x20: 7b940e2a ffffffff *..{....
				354	0x28: 851beeb6 ffffffff ........
				355	0x30: 881436a0 07e40101 .6......
				356	0x38: e13b0f94 07e40401 ..;.....
				357
				358	[string table]
				359	0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
				360	0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
				361	0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
				362	0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
				363	0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
				364
				365	Managing token databases
				366	------------------------
				367	Token databases are managed with the ``database.py`` script. This script can be
				368	used to extract tokens from compilation artifacts and manage database files.
				369	Invoke ``database.py`` with ``-h`` for full usage information.
				370
				371	Create a database
				372	^^^^^^^^^^^^^^^^^
				373	The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
				374	etc.), archives (.a), or existing token databases (CSV or binary).
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	375
				376	.. code-block:: sh
				377
				378	./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				379
				380	Two database formats are supported: CSV and binary. Provide ``--type binary`` to
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	381	``create`` to generate a binary database instead of the default CSV. CSV
				382	databases are great for checking into a source control or for human review.
				383	Binary databases are more compact and simpler to parse. The C++ detokenizer
				384	library only supports binary databases currently.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	385
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	386	Update a database
				387	^^^^^^^^^^^^^^^^^
				388	As new tokenized strings are added, update the database with the ``add``
				389	command.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	390
				391	.. code-block:: sh
				392
				393	./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				394
				395	A CSV token database can be checked into a source repository and updated as code
				396	changes are made. The build system can invoke ``database.py`` to update the
				397	database after each build.
				398
				399	Detokenization
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	400	==============
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	401	Detokenization is the process of expanding a token to the string it represents
				402	and decoding its arguments. This module provides Python and C++ detokenization
				403	libraries.
				404
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	405	Example: decoding tokenized logs
				406
				407	A project might tokenize its log messages with the `Base64 format`_. Consider
				408	the following log file, which has four tokenized logs and one plain text log:
				409
				410	.. code-block:: text
				411
				412	20200229 14:38:58 INF $HL2VHA==
				413	20200229 14:39:00 DBG $5IhTKg==
				414	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				415	20200229 14:39:21 INF $EgFj8lVVAUI=
				416	20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
				417
				418	The project's log strings are stored in a database like the following:
				419
				420	.. code-block::
				421
				422	1c95bd1c, ,"Initiating retrieval process for recovery object"
				423	2a5388e4, ,"Determining optimal approach and coordinating vectors"
				424	3743540c, ,"Recovery object retrieval failed with status %s"
				425	f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
				426
				427	Using the detokenizing tools with the database, the logs can be decoded:
				428
				429	.. code-block:: text
				430
				431	20200229 14:38:58 INF Initiating retrieval process for recovery object
				432	20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
				433	20200229 14:39:20 DBG Crunching numbers to calculate probability of success
				434	20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
				435	20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
				436
				437	.. note::
				438
				439	This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
				440	much space as the default binary format when encoded. For projects that wish
				441	to interleave tokenized with plain text, using Base64 is a worthwhile
				442	tradeoff.
				443
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	444	Python
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	445	------
				446	To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
				447	package, and instantiate it with paths to token databases or ELF files.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	448
				449	.. code-block:: python
				450
				451	import pw_tokenizer
				452
				453	detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
				454
				455	def process_log_message(log_message):
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	456	result = detokenizer.detokenize(log_message.payload)
				457	self._log(str(result))
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	458
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	459	The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	460	class, which can be used in place of the standard ``Detokenizer``. This class
				461	monitors database files for changes and automatically reloads them when they
				462	change. This is helpful for long-running tools that use detokenization.
				463
				464	C++
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	465	---
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	466	The C++ detokenization libraries can be used in C++ or any language that can
				467	call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	468	Java Native Interface (JNI) implementation is provided.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	469
				470	The C++ detokenization library uses binary-format token databases (created with
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	471	``database.py create --type binary``). Read a binary format database from a
				472	file or include it in the source code. Pass the database array to
				473	``TokenDatabase::Create``, and construct a detokenizer.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	474
				475	.. code-block:: cpp
				476
				477	Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
				478
				479	std::string ProcessLog(span<uint8_t> log_data) {
				480	return detokenizer.Detokenize(log_data).BestString();
				481	}
				482
				483	The ``TokenDatabase`` class verifies that its data is valid before using it. If
				484	it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
				485	``ok()`` returns false. If the token database is included in the source code,
				486	this check can be done at compile time.
				487
				488	.. code-block:: cpp
				489
				490	// This line fails to compile with a static_assert if the database is invalid.
				491	constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
				492
				493	Detokenizer OpenDatabase(std::string_view path) {
				494	std::vector<uint8_t> data = ReadWholeFile(path);
				495
				496	TokenDatabase database = TokenDatabase::Create(data);
				497
				498	// This checks if the file contained a valid database. It is safe to use a
				499	// TokenDatabase that failed to load (it will be empty), but it may be
				500	// desirable to provide a default database or otherwise handle the error.
				501	if (database.ok()) {
				502	return Detokenizer(database);
				503	}
				504	return Detokenizer(kDefaultDatabase);
				505	}
				506
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	507	Base64 format
				508	=============
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	509	The tokenizer encodes messages to a compact binary representation. Applications
				510	may desire a textual representation of tokenized strings. This makes it easy to
				511	use tokenized messages alongside plain text messages, but comes at a small
				512	efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
				513	as binary messages.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	514
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	515	The Base64 format is comprised of a ``$`` character followed by the
				516	Base64-encoded contents of the tokenized message. For example, consider
				517	tokenizing the string ``This is an example: %d!`` with the argument -1. The
				518	string's token is 0x4b016e66.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	519
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	520	.. code-block:: text
				521
				522	Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
				523
				524	Plain text: This is an example: -1! [23 bytes]
				525
				526	Binary: 66 6e 01 4b 01 [ 5 bytes]
				527
				528	Base64: $Zm4BSwE= [ 9 bytes]
				529
				530	Encoding
				531	--------
				532	To encode with the Base64 format, add a call to
				533	``pw::tokenizer::PrefixedBase64Encode`` or ``pw_TokenizerPrefixedBase64Encode``
				534	in the tokenizer handler function. For example,
				535
				536	.. code-block:: cpp
				537
				538	void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
				539	size_t size_bytes) {
				540	char base64_buffer[64];
				541	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				542	pw::span(encoded_message, size_bytes), base64_buffer);
				543
				544	TransmitLogMessage(base64_buffer, base64_size);
				545	}
				546
				547	Decoding
				548	--------
				549	Base64 decoding and detokenizing is supported in the Python detokenizer through
				550	the ``detokenize_base64`` and related functions.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	551
				552	.. tip::
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	553	The Python detokenization tools support recursive detokenization for prefixed
				554	Base64 text. Tokenized strings found in detokenized text are detokenized, so
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	555	prefixed Base64 messages can be passed as ``%s`` arguments.
				556
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	557	For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
				558	passed as an argument to the printf-style string ``Nested message: %s``, which
				559	encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
				560	as follows:
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	561
				562	::
				563
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	564	"$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	565
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	566	Base64 decoding is supported in C++ or C with the
				567	``pw::tokenizer::PrefixedBase64Decode`` or ``pw_TokenizerPrefixedBase64Decode``
				568	functions.
				569
				570	.. code-block:: cpp
				571
				572	void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
				573	size_t size_bytes) {
				574	char base64_buffer[64];
				575	size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
				576	pw::span(encoded_message, size_bytes), base64_buffer);
				577
				578	TransmitLogMessage(base64_buffer, base64_size);
				579	}
				580
Keir Mierle	086ef1c	2020-03-19 02:03:51 -0700	[diff] [blame^]	581	Deployment war story
				582	====================
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	583	The tokenizer module was developed to bring tokenized logging to an
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	584	in-development product. The product already had an established text-based
				585	logging system. Deploying tokenization was straightforward and had substantial
				586	benefits.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	587
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	588	Results
				589	-------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	590	* Log contents shrunk by over 50%, even with Base64 encoding.
				591
				592	* Significant size savings for encoded logs, even using the less-efficient
				593	Base64 encoding required for compatibility with the existing log system.
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	594	* Freed valuable communication bandwidth.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	595	* Allowed storing many more logs in crash dumps.
				596
				597	* Substantial flash savings.
				598
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	599	* Reduced the size firmware images by up to 18%.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	600
				601	* Simpler logging code.
				602
				603	* Removed CPU-heavy ``snprintf`` calls.
				604	* Removed complex code for forwarding log arguments to a low-priority task.
				605
				606	This section describes the tokenizer deployment process and highlights key
				607	insights.
				608
				609	Firmware deployment
				610	-------------------
				611	* In the project's logging macro, calls to the underlying logging function
				612	were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
				613	invocation.
				614	* The log level was passed as the payload argument to facilitate runtime log
				615	level control.
				616	* For this project, it was necessary to encode the log messages as text. In
				617	``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were
				618	encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
				619	messages.
				620	* Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
				621
				622	.. attention::
				623	Do not encode line numbers in tokenized strings. This results in a huge
				624	number of lines being added to the database, since every time code moves,
				625	new strings are tokenized. If line numbers are desired in a tokenized
				626	string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
				627
				628	Database management
				629	-------------------
				630	* The token database was stored as a CSV file in the project's Git repo.
				631	* The token database was automatically updated as part of the build, and
				632	developers were expected to check in the database changes alongside their
				633	code changes.
				634	* A presubmit check verified that all strings added by a change were added to
				635	the token database.
				636	* The token database included logs and asserts for all firmware images in the
				637	project.
				638	* No strings were purged from the token database.
				639
				640	.. tip::
				641	Merge conflicts may be a frequent occurrence with an in-source database. If
				642	the database is in-source, make sure there is a simple script to resolve any
				643	merge conflicts. The script could either keep both sets of lines or discard
				644	local changes and regenerate the database.
				645
				646	Decoding tooling deployment
				647	---------------------------
				648	* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
				649
				650	* Product-specific Python command line tools, using
				651	``pw_tokenizer.Detokenizer``.
				652	* Standalone script for decoding prefixed Base64 tokens in files or
				653	live output (e.g. from ``adb``), using ``detokenize.py``'s command line
				654	interface.
				655
				656	* The C++ detokenizer library was deployed to two Android apps with a Java
				657	Native Interface (JNI) layer.
				658
				659	* The binary token database was included as a raw resource in the APK.
				660	* In one app, the built-in token database could be overridden by copying a
				661	file to the phone.
				662
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	663	.. tip::
				664	Make the tokenized logging tools simple to use for your project.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	665
				666	* Provide simple wrapper shell scripts that fill in arguments for the
				667	project. For example, point ``detokenize.py`` to the project's token
				668	databases.
				669	* Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
				670	continuously-running tools, so that users don't have to restart the tool
				671	when the token database updates.
				672	* Integrate detokenization everywhere it is needed. Integrating the tools
				673	takes just a few lines of code, and token databases can be embedded in
				674	APKs or binaries.
				675
				676	Limitations and future work
				677	===========================
				678
				679	GCC bug: tokenization in template functions
				680	-------------------------------------------
				681	GCC incorrectly ignores the section attribute for template
				682	`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
				683	`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
				684	bug, tokenized strings in template functions may be emitted into ``.rodata``
				685	instead of the special tokenized string section. This causes two problems:
				686
				687	1. Tokenized strings will not be discovered by the token database tools.
				688	2. Tokenized strings may not be removed from the final binary.
				689
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	690	clang does not have this issue! Use clang to avoid this.
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	691
				692	It is possible to work around this bug in GCC. One approach would be to tag
				693	format strings so that the database tools can find them in ``.rodata``. Then, to
				694	remove the strings, compile two binaries: one metadata binary with all tokenized
				695	strings and a second, final binary that removes the strings. The strings could
				696	be removed by providing the appropriate linker flags or by removing the ``used``
				697	attribute from the tokenized string character array declaration.
				698
				699	64-bit tokenization
				700	-------------------
				701	The Python and C++ detokenizing libraries currently assume that strings were
				702	tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
				703	``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
				704	device performed the tokenization.
				705
				706	Supporting detokenization of strings tokenized on 64-bit targets would be
				707	simple. This could be done by adding an option to switch the 32-bit types to
				708	64-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info``
				709	ELF section, so the sizes of these types can be verified by checking the ELF
				710	file, if necessary.
				711
				712	Tokenization in headers
				713	-----------------------
				714	Tokenizing code in header files (inline functions or templates) may trigger
				715	warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
				716	is because tokenization requires declaring a character array for each tokenized
				717	string. If the tokenized string includes macros that change value, the size of
				718	this character array changes, which means the same static variable is defined
				719	with different sizes. It should be safe to suppress these warnings, but, when
				720	possible, code that tokenizes strings with macros that can change value should
				721	be moved to source files rather than headers.
				722
				723	Tokenized strings as ``%s`` arguments
				724	-------------------------------------
				725	Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
				726	encoded 1:1, with no tokenization. It would be better to send a tokenized string
				727	literal as an integer instead of a string argument, but this is not yet
				728	supported.
				729
				730	A string token could be sent by marking an integer % argument in a way
				731	recognized by the detokenization tools. The detokenizer would expand the
				732	argument to the string represented by the integer.
				733
				734	.. code-block:: cpp
				735
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	736	#define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	737
				738	constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
				739
Wyatt Hepler	a46bf7d	2020-01-08 18:10:25 -0800	[diff] [blame]	740	PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	741
				742	Strings with arguments could be encoded to a buffer, but since printf strings
				743	are null-terminated, a binary encoding would not work. These strings can be
				744	prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
				745
				746	Another possibility: encode strings with arguments to a ``uint64_t`` and send
				747	them as an integer. This would be efficient and simple, but only support a small
				748	number of arguments.
				749
				750	Compatibility
				751	=============
				752	* C11
Wyatt Hepler	a6d5cc6	2020-01-17 14:15:40 -0800	[diff] [blame]	753	* C++11
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	754	* Python 3
				755
				756	Dependencies
				757	============
Armando Montanez	0054a9b	2020-03-13 13:06:24 -0700	[diff] [blame]	758	* ``pw_varint`` module
				759	* ``pw_preprocessor`` module
				760	* ``pw_span`` module