Blame - pw_tokenizer/docs.rst - platform/external/pigweed

blob: eff728093ccc5217279514c261ba6041c60fa2fc [file] [log] [blame]

Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	1	.. _chapter-tokenizer:
				2
				3	.. default-domain:: cpp
				4
				5	.. highlight:: sh
				6
Alexei Frolov	44d5473	2020-01-10 14:45:43 -0800	[diff] [blame]	7	------------
				8	pw_tokenizer
				9	------------
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	10	The tokenizer module provides facilities for converting strings to binary
				11	tokens. String literals are replaced with integer tokens in the firmware image,
				12	which can be decoded off-device to restore the original string. Strings may be
				13	printf-style format strings and include arguments, such as ``"My name is %s"``.
				14	Arguments are encoded into compact binary form at runtime.
				15
				16	.. note::
				17	This usage of the term "tokenizer" is not related to parsing! The
				18	module is called tokenizer because it replaces a whole string literal with an
				19	integer token. It does not parse strings into separate tokens.
				20
				21	The most common application of the tokenizer module is binary logging, and its
				22	designed to integrate easily into existing logging systems. However, the
				23	tokenizer is general purpose and can be used to tokenize any strings.
				24
				25	Why tokenize strings?
				26
				27	* Dramatically reduce binary size by removing string literals from binaries.
				28	* Reduce CPU usage by replacing snprintf calls with simple tokenization code.
				29	* Reduce I/O traffic, RAM, and flash usage by sending and storing compact
				30	tokens instead of strings.
				31	* Remove potentially sensitive log, assert, and other strings from binaries.
				32
				33	Basic operation
				34	===============
				35	1. In C or C++ code, strings are hashed to generate a stable 32-bit token.
				36	2. The tokenization macro removes the string literal by placing it in an ELF
				37	section that is excluded from the final binary.
				38	3. Strings are extracted from an ELF to build a database of tokenized strings
				39	for use by the detokenizer. The ELF file may also be used directly.
				40	4. During operation, the device encodes the string token and its arguments, if
				41	any.
				42	5. The encoded tokenized strings are sent off-device or stored.
				43	6. Off-device, the detokenizer tools use the token database or ELF files to
				44	detokenize the strings to human-readable form.
				45
				46	Module usage
				47	============
				48	There are two sides to tokenization: tokenizing strings in the source code and
				49	detokenizing these strings back to human-readable form.
				50
				51	Tokenization
				52	------------
				53	Tokenization converts a string literal to a token. If it's a printf-style
				54	string, its arguments are encoded along with it. The results of tokenization can
				55	be sent off device or stored in place of a full string.
				56
				57	Adding tokenization to a project is simple. To tokenize a string, include
				58	``pw_tokenizer/tokenize.h`` and invoke a ``PW_TOKENIZE_`` macro.
				59
				60	To tokenize a string literal, invoke ``PW_TOKENIZE_STRING``. This macro returns
				61	a ``uint32_t`` token.
				62
				63	.. code-block:: cpp
				64
				65	constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
				66
				67	Format strings are tokenized into a fixed-size buffer. The buffer contains the
				68	``uint32_t`` token followed by the encoded form of the arguments, if any. The
				69	most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes to
				70	a caller-provided buffer.
				71
				72	.. code-block:: cpp
				73
				74	uint8_t buffer[BUFFER_SIZE];
				75	size_t size_bytes = sizeof(buffer);
				76	PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, args...);
				77
				78	While ``PW_TOKENIZE_TO_BUFFER`` is flexible, its per-use code size overhead is
				79	larger than its alternatives. ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer
				80	on the stack and calls a ``void(const uint8_t* buffer, size_t buffer_size)``
				81	callback that is provided at the call site. The size of the buffer is set with
				82	``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
				83
				84	.. code-block:: cpp
				85
				86	PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arg);
				87
				88	``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
				89	since it takes the fewest arguments. Like the callback form, it encodes to a
				90	buffer on the stack. It then calls the C-linkage function
				91	``pw_TokenizerHandleEncodedMessage``, which must be defined by the project.
				92
				93	.. code-block:: cpp
				94
				95	PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
				96
				97	void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
				98	size_t size_bytes);
				99
				100	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
				101	``void*`` argument to the global handler function. This can be used to pass a
				102	log level or other metadata along with the tokenized string.
				103
				104	.. code-block:: cpp
				105
				106	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
				107	format_string_literal,
				108	args...);
				109
				110	void pw_TokenizerHandleEncodedMessageWithPayload(void* payload,
				111	const uint8_t encoded_message[],
				112	size_t size_bytes);
				113
				114	.. tip::
				115	``%s`` arguments are inefficient to encode and can quickly fill a tokenization
				116	buffer. Keep ``%s`` arguments short or avoid encoding them as strings if
				117	possible. See `Tokenized strings as %s arguments`_.
				118
				119	Example: binary logging
				120	^^^^^^^^^^^^^^^^^^^^^^^
				121	String tokenization is perfect for logging. Consider the following log macro,
				122	which gathers the file, line number, and log message. It calls the ``RecordLog``
				123	function, which formats the log string, collects a timestamp, and transmits the
				124	result.
				125
				126	.. code-block:: cpp
				127
				128	#define LOG_INFO(format, ...) \
				129	RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
				130
				131	void RecordLog(LogLevel level, const char* file, int line, const char* format,
				132	...) {
				133	if (level < current_log_level) {
				134	return;
				135	}
				136
				137	int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
				138
				139	va_list args;
				140	va_start(args, format);
				141	bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
				142	va_end(args);
				143
				144	TransmitLog(TimeSinceBootMillis(), buffer, size);
				145	}
				146
				147	It is trivial to convert this to a binary log using the tokenizer. The
				148	``RecordLog`` call is replaced with a
				149	``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
				150	``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the
				151	timestamp and transmits the message with ``TransmitLog``.
				152
				153	.. code-block:: cpp
				154
				155	#define LOG_INFO(format, ...) \
				156	PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
				157	(void*)LogLevel_INFO, \
				158	__FILE_NAME__ ":%d " format, \
				159	__LINE__, \
				160	__VA_ARGS__); \
				161
				162	extern "C" void pw_TokenizerHandleEncodedMessageWithPayload(
				163	void* level, const uint8_t encoded_message[], size_t size_bytes) {
				164	if (reinterpret_cast<LogLevel>(level) >= current_log_level) {
				165	TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
				166	}
				167	}
				168
				169	Note that the ``__FILE_NAME__`` string is directly included in the log format
				170	string. Since the string is tokenized, this has no effect on binary size. A
				171	``%d`` for the line number is added to the format string, so that changing the
				172	line of the log message does not generate a new token. There is no overhead for
				173	additional tokens, but it may not be desireable to fill a token database with
				174	duplicate log lines.
				175
				176	Database management
				177	^^^^^^^^^^^^^^^^^^^
				178	Token databases store a mapping of tokens to the strings they represent. An ELF
				179	file can be used as a token database, but it only contains the strings for its
				180	exact build. A token database file aggregates tokens from multiple ELF files, so
				181	that a single database can decode tokenized strings from any known ELF.
				182
				183	Creating and maintaining a token database is simple. Token databases are managed
				184	with the ``database.py`` script. The ``create`` command makes a new token
				185	database from ELF files or other databases.
				186
				187	.. code-block:: sh
				188
				189	./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				190
				191	Two database formats are supported: CSV and binary. Provide ``--type binary`` to
				192	``create`` generate a binary database instead of the default CSV. CSV databases
				193	are great for checking into a source control or for human review. Binary
				194	databases are more compact and simpler to parse. The C++ detokenizer library
				195	only supports binary databases currently.
				196
				197	As new tokenized strings are added, update the database with the add command.
				198
				199	.. code-block:: sh
				200
				201	./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
				202
				203	A CSV token database can be checked into a source repository and updated as code
				204	changes are made. The build system can invoke ``database.py`` to update the
				205	database after each build.
				206
				207	Detokenization
				208	--------------
				209	Detokenization is the process of expanding a token to the string it represents
				210	and decoding its arguments. This module provides Python and C++ detokenization
				211	libraries.
				212
				213	Python
				214	^^^^^^
				215	To detokenize in Python, import Detokenizer from the ``pw_tokenizer`` package,
				216	and instantiate it with paths to token databases or ELF files.
				217
				218	.. code-block:: python
				219
				220	import pw_tokenizer
				221
				222	detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
				223
				224	def process_log_message(log_message):
				225	result = detokenizer.detokenize(log_message.payload)
				226	self._log(str(result))
				227
				228	The ``pw_tokenizer`` pacakge also provices the ``AutoUpdatingDetokenizer``
				229	class, which can be used in place of the standard ``Detokenizer``. This class
				230	monitors database files for changes and automatically reloads them when they
				231	change. This is helpful for long-running tools that use detokenization.
				232
				233	C++
				234	^^^
				235	The C++ detokenization libraries can be used in C++ or any language that can
				236	call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
				237	Android Java JNI is provided.
				238
				239	The C++ detokenization library uses binary-format token databases (created with
				240	``--type binary``). Read a binary format database from a file or include it in
				241	the source code. Pass the database array to ``TokenDatabase::Create``, and
				242	construct a detokenizer.
				243
				244	.. code-block:: cpp
				245
				246	Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
				247
				248	std::string ProcessLog(span<uint8_t> log_data) {
				249	return detokenizer.Detokenize(log_data).BestString();
				250	}
				251
				252	The ``TokenDatabase`` class verifies that its data is valid before using it. If
				253	it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
				254	``ok()`` returns false. If the token database is included in the source code,
				255	this check can be done at compile time.
				256
				257	.. code-block:: cpp
				258
				259	// This line fails to compile with a static_assert if the database is invalid.
				260	constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
				261
				262	Detokenizer OpenDatabase(std::string_view path) {
				263	std::vector<uint8_t> data = ReadWholeFile(path);
				264
				265	TokenDatabase database = TokenDatabase::Create(data);
				266
				267	// This checks if the file contained a valid database. It is safe to use a
				268	// TokenDatabase that failed to load (it will be empty), but it may be
				269	// desirable to provide a default database or otherwise handle the error.
				270	if (database.ok()) {
				271	return Detokenizer(database);
				272	}
				273	return Detokenizer(kDefaultDatabase);
				274	}
				275
				276	Token generation: fixed length hashing at compile time
				277	======================================================
				278	String tokens are generated using a modified version of the x65599 hash used by
				279	the SDBM project. All hashing is done at compile time.
				280
				281	In C code, strings are hashed with a preprocessor macro. For compatibility with
				282	macros, the hash must be limited to a fixed maximum number of characters. This
				283	value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``.
				284
				285	Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C
				286	due to the complexity of the hashing macros. C++ macros use a constexpr
				287	function instead of a macro, so the compilation time impact is minimal. Projects
				288	primarily in C++ should use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH``
				289	(perhaps even ``std::numeric_limits<size_t>::max()``).
				290
				291	Base64 format
				292	=============
				293	The tokenizer defaults to a compact binary representation of tokenized messages.
				294	Applications may desire a textual representation of tokenized strings. This
				295	makes it easy to use tokenized messages alongside plain text messages, but comes
				296	at an efficiency cost.
				297
				298	The tokenizer module supports prefixed Base64-encoded messages: a single
				299	character ($) followed by the Base64-encoded message. For example, the token
				300	0xabcdef01 followed by the argument 0x05 would be encoded as ``01 ef cd ab 05``
				301	in binary and ``$Ae/NqwU=`` in Base64.
				302
				303	Base64 decoding is supported in the Python detokenizer through the
				304	``detokenize_base64`` and related functions. Base64 encoding and decoding are
				305	not yet supprted in C++, but it is straightforward to add Base64 encoding with
				306	any Base64 library.
				307
				308	.. tip::
				309	The detokenization tools support recursive detokenization for prefixed Base64
				310	text. Tokenized strings found in detokenized text are detokenized, so
				311	prefixed Base64 messages can be passed as ``%s`` arguments.
				312
				313	For example, the message ``"$d4ffJaRn`` might be passed as the argument to a
				314	``"Nested message: %s"`` string. The detokenizer would decode the message in
				315	two steps:
				316
				317	::
				318
				319	"$alRZyuk2J3v=" → "Nested message: $d4ffJaRn" → "Nested message: Wow!"
				320
				321	War story: deploying tokenized logging to an existing product
				322	=============================================================
				323	The tokenizer module was developed to bring tokenized logging to an
				324	in-development product. The product is complex, with several interacting
				325	microcontrollers. It already had an established text-based logging system.
				326	Deploying tokenization was straightforward and had substantial benefits.
				327
				328	Results
				329	* Log contents shrunk by over 50%, even with Base64 encoding.
				330
				331	* Significant size savings for encoded logs, even using the less-efficient
				332	Base64 encoding required for compatibility with the existing log system.
				333	* Freed valueable communication bandwidth.
				334	* Allowed storing many more logs in crash dumps.
				335
				336	* Substantial flash savings.
				337
				338	* Reduced the size of 115 KB and 172 KB firmware images by over 20 KB each.
				339	* Shaved over 100 KB from a large 2 MB image.
				340
				341	* Simpler logging code.
				342
				343	* Removed CPU-heavy ``snprintf`` calls.
				344	* Removed complex code for forwarding log arguments to a low-priority task.
				345
				346	This section describes the tokenizer deployment process and highlights key
				347	insights.
				348
				349	Firmware deployment
				350	-------------------
				351	* In the project's logging macro, calls to the underlying logging function
				352	were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
				353	invocation.
				354	* The log level was passed as the payload argument to facilitate runtime log
				355	level control.
				356	* For this project, it was necessary to encode the log messages as text. In
				357	``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were
				358	encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
				359	messages.
				360	* Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
				361
				362	.. attention::
				363	Do not encode line numbers in tokenized strings. This results in a huge
				364	number of lines being added to the database, since every time code moves,
				365	new strings are tokenized. If line numbers are desired in a tokenized
				366	string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
				367
				368	Database management
				369	-------------------
				370	* The token database was stored as a CSV file in the project's Git repo.
				371	* The token database was automatically updated as part of the build, and
				372	developers were expected to check in the database changes alongside their
				373	code changes.
				374	* A presubmit check verified that all strings added by a change were added to
				375	the token database.
				376	* The token database included logs and asserts for all firmware images in the
				377	project.
				378	* No strings were purged from the token database.
				379
				380	.. tip::
				381	Merge conflicts may be a frequent occurrence with an in-source database. If
				382	the database is in-source, make sure there is a simple script to resolve any
				383	merge conflicts. The script could either keep both sets of lines or discard
				384	local changes and regenerate the database.
				385
				386	Decoding tooling deployment
				387	---------------------------
				388	* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
				389
				390	* Product-specific Python command line tools, using
				391	``pw_tokenizer.Detokenizer``.
				392	* Standalone script for decoding prefixed Base64 tokens in files or
				393	live output (e.g. from ``adb``), using ``detokenize.py``'s command line
				394	interface.
				395
				396	* The C++ detokenizer library was deployed to two Android apps with a Java
				397	Native Interface (JNI) layer.
				398
				399	* The binary token database was included as a raw resource in the APK.
				400	* In one app, the built-in token database could be overridden by copying a
				401	file to the phone.
				402
				403	.. tip:: Make the tokenized logging tools simple to use.
				404
				405	* Provide simple wrapper shell scripts that fill in arguments for the
				406	project. For example, point ``detokenize.py`` to the project's token
				407	databases.
				408	* Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
				409	continuously-running tools, so that users don't have to restart the tool
				410	when the token database updates.
				411	* Integrate detokenization everywhere it is needed. Integrating the tools
				412	takes just a few lines of code, and token databases can be embedded in
				413	APKs or binaries.
				414
				415	Limitations and future work
				416	===========================
				417
				418	GCC bug: tokenization in template functions
				419	-------------------------------------------
				420	GCC incorrectly ignores the section attribute for template
				421	`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
				422	`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
				423	bug, tokenized strings in template functions may be emitted into ``.rodata``
				424	instead of the special tokenized string section. This causes two problems:
				425
				426	1. Tokenized strings will not be discovered by the token database tools.
				427	2. Tokenized strings may not be removed from the final binary.
				428
				429	clang does not have this issue! Use clang if you can.
				430
				431	It is possible to work around this bug in GCC. One approach would be to tag
				432	format strings so that the database tools can find them in ``.rodata``. Then, to
				433	remove the strings, compile two binaries: one metadata binary with all tokenized
				434	strings and a second, final binary that removes the strings. The strings could
				435	be removed by providing the appropriate linker flags or by removing the ``used``
				436	attribute from the tokenized string character array declaration.
				437
				438	64-bit tokenization
				439	-------------------
				440	The Python and C++ detokenizing libraries currently assume that strings were
				441	tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
				442	``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
				443	device performed the tokenization.
				444
				445	Supporting detokenization of strings tokenized on 64-bit targets would be
				446	simple. This could be done by adding an option to switch the 32-bit types to
				447	64-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info``
				448	ELF section, so the sizes of these types can be verified by checking the ELF
				449	file, if necessary.
				450
				451	Tokenization in headers
				452	-----------------------
				453	Tokenizing code in header files (inline functions or templates) may trigger
				454	warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
				455	is because tokenization requires declaring a character array for each tokenized
				456	string. If the tokenized string includes macros that change value, the size of
				457	this character array changes, which means the same static variable is defined
				458	with different sizes. It should be safe to suppress these warnings, but, when
				459	possible, code that tokenizes strings with macros that can change value should
				460	be moved to source files rather than headers.
				461
				462	Tokenized strings as ``%s`` arguments
				463	-------------------------------------
				464	Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
				465	encoded 1:1, with no tokenization. It would be better to send a tokenized string
				466	literal as an integer instead of a string argument, but this is not yet
				467	supported.
				468
				469	A string token could be sent by marking an integer % argument in a way
				470	recognized by the detokenization tools. The detokenizer would expand the
				471	argument to the string represented by the integer.
				472
				473	.. code-block:: cpp
				474
				475	#define PW_TOKEN_ARG "TOKEN<([\\%" PRIx32 "/])>END_TOKEN"
				476
				477	constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
				478
				479	PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: " PW_TOKEN_ARG "?", answer_token);
				480
				481	Strings with arguments could be encoded to a buffer, but since printf strings
				482	are null-terminated, a binary encoding would not work. These strings can be
				483	prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
				484
				485	Another possibility: encode strings with arguments to a ``uint64_t`` and send
				486	them as an integer. This would be efficient and simple, but only support a small
				487	number of arguments.
				488
				489	Compatibility
				490	=============
				491	* C11
Wyatt Hepler	a6d5cc6	2020-01-17 14:15:40 -0800	[diff] [blame]	492	* C++11
Wyatt Hepler	80c6ee5	2020-01-03 09:54:58 -0800	[diff] [blame]	493	* Python 3
				494
				495	Dependencies
				496	============
				497	* pw_varint module
				498	* pw_preprocessor module
				499	* pw_span module