Blame - pw_tokenizer/proto.rst - platform/external/pigweed

blob: 2479dd47fc455f97e376d350a7f3763045848a2f [file] [log] [blame]

Wyatt Hepler	de20d74	2021-06-02 23:34:14 -0700	[diff] [blame]	1	.. _module-pw_tokenizer-proto:
				2
				3	------------------------------------
				4	Tokenized fields in protocol buffers
				5	------------------------------------
				6	Text may be represented in a few different ways:
				7
				8	- Plain ASCII or UTF-8 text (``This is plain text``)
				9	- Base64-encoded tokenized message (``$ibafcA==``)
				10	- Binary-encoded tokenized message (``89 b6 9f 70``)
				11	- Little-endian 32-bit integer token (``0x709fb689``)
				12
				13	``pw_tokenizer`` provides tools for working with protobuf fields that may
				14	contain tokenized text.
				15
				16	Tokenized field protobuf option
				17	===============================
				18	``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
				19	This option may be applied to a protobuf field to indicate that it may contain a
				20	tokenized string. A string that is optionally tokenized is represented with a
				21	single ``bytes`` field annotated with ``(pw.tokenizer.format) =
				22	TOKENIZATION_OPTIONAL``.
				23
				24	For example, the following protobuf has one field that may contain a tokenized
				25	string.
				26
				27	.. code-block:: protobuf
				28
				29	message MessageWithOptionallyTokenizedField {
				30	bytes just_bytes = 1;
				31	bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
				32	string just_text = 3;
				33	}
				34
				35	Decoding optionally tokenized strings
				36	=====================================
				37	The encoding used for an optionally tokenized field is not recorded in the
				38	protobuf. Despite this, the text can reliably be decoded. This is accomplished
				39	by attempting to decode the field as binary or Base64 tokenized data before
				40	treating it like plain text.
				41
				42	The following diagram describes the decoding process for optionally tokenized
				43	fields in detail.
				44
				45	.. mermaid::
				46
				47	flowchart TD
				48	start([Received bytes]) --> binary
				49
				50	binary[Decode as<br>binary tokenized] --> binary_ok
				51	binary_ok{Detokenizes<br>successfully?} -->\|no\| utf8
				52	binary_ok -->\|yes\| done_binary([Display decoded binary])
				53
				54	utf8[Decode as UTF-8] --> utf8_ok
				55	utf8_ok{Valid UTF-8?} -->\|no\| base64_encode
				56	utf8_ok -->\|yes\| base64
				57
				58	base64_encode[Encode as<br>tokenized Base64] --> display
				59	display([Display encoded Base64])
				60
				61	base64[Decode as<br>Base64 tokenized] --> base64_ok
				62
				63	base64_ok{Fully<br>or partially<br>detokenized?} -->\|no\| is_plain_text
				64	base64_ok -->\|yes\| base64_results
				65
				66	is_plain_text{Text is<br>printable?} -->\|no\| base64_encode
				67	is_plain_text-->\|yes\| plain_text
				68
				69	base64_results([Display decoded Base64])
				70	plain_text([Display text])
				71
				72	Potential decoding problems
				73	---------------------------
				74	The decoding process for optionally tokenized fields will yield correct results
				75	in almost every situation. In rare circumstances, it is possible for it to fail,
				76	but these can be avoided with a low-overhead mitigation if desired.
				77
				78	There are two ways in which the decoding process may fail.
				79
				80	Accidentally interpreting plain text as tokenized binary
				81	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				82	If a plain-text string happens to decode as a binary tokenized message, the
				83	incorrect message could be displayed. This is very unlikely to occur. While many
				84	tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
				85	that a device will happen to log one of these strings as plain text. The
				86	overwhelming majority of these strings will be nonsense.
				87
				88	If an implementation wishes to guard against this extremely improbable
				89	situation, it is possible to prevent it. This situation is prevented by
				90	appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
				91	that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
				92	When decoding, if there is an extra 0xFF byte, it is discarded.
				93
				94	Displaying undecoded binary as plain text instead of Base64
				95	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				96	If a message fails to decode as binary tokenized and it is not valid UTF-8, it
				97	is displayed as tokenized Base64. This makes it easily recognizable as a
				98	tokenized message and makes it simple to decode later from the text output (for
				99	example, with an updated token database).
				100
				101	A binary message for which the token is not known may coincidentally be valid
				102	UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
				103	When decoding with an out-of-date token database, it is possible that some
				104	binary tokenized messages will be displayed as plain text rather than tokenized
				105	Base64.
				106
				107	This situation is likely to occur, but should be infrequent. Even if it does
				108	happen, it is not a serious issue. A very small number of strings will be
				109	displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
				110	string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
				111	Updating the token database would resolve the issue, though the non-Base64 logs
				112	would be difficult decode later from a log file.
				113
				114	This situation can be avoided with the same approach described in
				115	`Accidentally interpreting plain text as tokenized binary`_. Appending
				116	an invalid UTF-8 character prevents the undecoded binary message from being
				117	interpreted as plain text.
				118
				119	Python library
				120	==============
				121	The ``pw_tokenizer.proto`` module defines functions that may be used to
				122	detokenize protobuf objects in Python. The function
				123	:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
				124	tokenized, replacing them with their detokenized version. For example:
				125
				126	.. code-block:: python
				127
				128	my_detokenizer = pw_tokenizer.Detokenizer(some_database)
				129
				130	my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
				131	pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
				132
				133	assert my_message.tokenized_field == b'The detokenized string! Cool!'
				134
				135	pw_tokenizer.proto
				136	------------------
				137	.. automodule:: pw_tokenizer.proto
				138	:members: