pw_tokenizer: Replace string literals with tokens pw_tokenizer provides macros that replace printf-style string literals with 32-bit hashes at compile time. The string literals are removed from the resulting binary, which dramatically reduces the binary size. Like any printf-style string, binary versions of the strings can be formatted with arguments and then transmitted or stored. The pw_tokenizer module is general purpose, but its most common use case is binary logging. In binary logging, human-readable text logs are replaced with binary tokens. These are decoded off-device. This commit includes the C and C++ code for tokenizing strings. It also includes a C++ library for decoding tokenized strings. Change-Id: I6d5737ab2d6dfdd76dcf70c852b547fdcd68d683

commit: 80c6ee51d4ea0a8eb67e2b0a330f30bd3a4eb03e [log] [tgz]
author: Wyatt Hepler <hepler@google.com> Fri Jan 03 09:54:58 2020 -0800
committer: Wyatt Hepler <hepler@google.com> Thu Jan 09 14:26:14 2020 -0800
tree: 12036046cca9ceb239ab4e6e9ec26a896ab3dd80
parent: b8b01e3a5e21af7dbef199b0faecefa29671acc9 [diff] [blame]
diff --git a/pw_tokenizer/docs.rst b/pw_tokenizer/docs.rst
new file mode 100644
index 0000000..71abfeb
--- /dev/null
+++ b/pw_tokenizer/docs.rst

@@ -0,0 +1,499 @@
+.. _chapter-tokenizer:
+
+.. default-domain:: cpp
+
+.. highlight:: sh
+
+---------
+Tokenizer
+---------
+The tokenizer module provides facilities for converting strings to binary
+tokens. String literals are replaced with integer tokens in the firmware image,
+which can be decoded off-device to restore the original string. Strings may be
+printf-style format strings and include arguments, such as ``"My name is %s"``.
+Arguments are encoded into compact binary form at runtime.
+
+.. note::
+  This usage of the term "tokenizer" is not related to parsing! The
+  module is called tokenizer because it replaces a whole string literal with an
+  integer token. It does not parse strings into separate tokens.
+
+The most common application of the tokenizer module is binary logging, and its
+designed to integrate easily into existing logging systems. However, the
+tokenizer is general purpose and can be used to tokenize any strings.
+
+**Why tokenize strings?**
+
+  * Dramatically reduce binary size by removing string literals from binaries.
+  * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
+  * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
+    tokens instead of strings.
+  * Remove potentially sensitive log, assert, and other strings from binaries.
+
+Basic operation
+===============
+  1. In C or C++ code, strings are hashed to generate a stable 32-bit token.
+  2. The tokenization macro removes the string literal by placing it in an ELF
+     section that is excluded from the final binary.
+  3. Strings are extracted from an ELF to build a database of tokenized strings
+     for use by the detokenizer. The ELF file may also be used directly.
+  4. During operation, the device encodes the string token and its arguments, if
+     any.
+  5. The encoded tokenized strings are sent off-device or stored.
+  6. Off-device, the detokenizer tools use the token database or ELF files to
+     detokenize the strings to human-readable form.
+
+Module usage
+============
+There are two sides to tokenization: tokenizing strings in the source code and
+detokenizing these strings back to human-readable form.
+
+Tokenization
+------------
+Tokenization converts a string literal to a token. If it's a printf-style
+string, its arguments are encoded along with it. The results of tokenization can
+be sent off device or stored in place of a full string.
+
+Adding tokenization to a project is simple. To tokenize a string, include
+``pw_tokenizer/tokenize.h`` and invoke a ``PW_TOKENIZE_`` macro.
+
+To tokenize a string literal, invoke ``PW_TOKENIZE_STRING``. This macro returns
+a ``uint32_t`` token.
+
+.. code-block:: cpp
+
+  constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
+
+Format strings are tokenized into a fixed-size buffer. The buffer contains the
+``uint32_t`` token followed by the encoded form of the arguments, if any. The
+most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes to
+a caller-provided buffer.
+
+.. code-block:: cpp
+
+  uint8_t buffer[BUFFER_SIZE];
+  size_t size_bytes = sizeof(buffer);
+  PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, args...);
+
+While ``PW_TOKENIZE_TO_BUFFER`` is flexible, its per-use code size overhead is
+larger than its alternatives. ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer
+on the stack and calls a ``void(const uint8_t* buffer, size_t buffer_size)``
+callback that is provided at the call site. The size of the buffer is set with
+``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
+
+.. code-block:: cpp
+
+  PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arg);
+
+``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
+since it takes the fewest arguments. Like the callback form, it encodes to a
+buffer on the stack. It then calls the C-linkage function
+``pw_TokenizerHandleEncodedMessage``, which must be defined by the project.
+
+.. code-block:: cpp
+
+  PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
+
+  void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[],
+                                        size_t size_bytes);
+
+``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
+``void*`` argument to the global handler function. This can be used to pass a
+log level or other metadata along with the tokenized string.
+
+.. code-block:: cpp
+
+  PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
+                                             format_string_literal,
+                                             args...);
+
+  void pw_TokenizerHandleEncodedMessageWithPayload(void* payload,
+                                                   const uint8_t encoded_message[],
+                                                   size_t size_bytes);
+
+.. tip::
+  ``%s`` arguments are inefficient to encode and can quickly fill a tokenization
+  buffer. Keep ``%s`` arguments short or avoid encoding them as strings if
+  possible. See `Tokenized strings as %s arguments`_.
+
+Example: binary logging
+^^^^^^^^^^^^^^^^^^^^^^^
+String tokenization is perfect for logging. Consider the following log macro,
+which gathers the file, line number, and log message. It calls the ``RecordLog``
+function, which formats the log string, collects a timestamp, and transmits the
+result.
+
+.. code-block:: cpp
+
+  #define LOG_INFO(format, ...) \
+      RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
+
+  void RecordLog(LogLevel level, const char* file, int line, const char* format,
+                 ...) {
+    if (level < current_log_level) {
+      return;
+    }
+
+    int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
+
+    va_list args;
+    va_start(args, format);
+    bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
+    va_end(args);
+
+    TransmitLog(TimeSinceBootMillis(), buffer, size);
+  }
+
+It is trivial to convert this to a binary log using the tokenizer. The
+``RecordLog`` call is replaced with a
+``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
+``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the
+timestamp and transmits the message with ``TransmitLog``.
+
+.. code-block:: cpp
+
+  #define LOG_INFO(format, ...)                   \
+      PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
+          (void*)LogLevel_INFO,                   \
+          __FILE_NAME__ ":%d " format,            \
+          __LINE__,                               \
+          __VA_ARGS__);                           \
+
+  extern "C" void pw_TokenizerHandleEncodedMessageWithPayload(
+      void* level, const uint8_t encoded_message[], size_t size_bytes) {
+    if (reinterpret_cast<LogLevel>(level) >= current_log_level) {
+      TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
+    }
+  }
+
+Note that the ``__FILE_NAME__`` string is directly included in the log format
+string. Since the string is tokenized, this has no effect on binary size. A
+``%d`` for the line number is added to the format string, so that changing the
+line of the log message does not generate a new token. There is no overhead for
+additional tokens, but it may not be desireable to fill a token database with
+duplicate log lines.
+
+Database management
+^^^^^^^^^^^^^^^^^^^
+Token databases store a mapping of tokens to the strings they represent. An ELF
+file can be used as a token database, but it only contains the strings for its
+exact build. A token database file aggregates tokens from multiple ELF files, so
+that a single database can decode tokenized strings from any known ELF.
+
+Creating and maintaining a token database is simple. Token databases are managed
+with the ``database.py`` script. The ``create`` command makes a new token
+database from ELF files or other databases.
+
+.. code-block:: sh
+
+  ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
+
+Two database formats are supported: CSV and binary. Provide ``--type binary`` to
+``create`` generate a binary database instead of the default CSV. CSV databases
+are great for checking into a source control or for human review. Binary
+databases are more compact and simpler to parse. The C++ detokenizer library
+only supports binary databases currently.
+
+As new tokenized strings are added, update the database with the add command.
+
+.. code-block:: sh
+
+  ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
+
+A CSV token database can be checked into a source repository and updated as code
+changes are made. The build system can invoke ``database.py`` to update the
+database after each build.
+
+Detokenization
+--------------
+Detokenization is the process of expanding a token to the string it represents
+and decoding its arguments. This module provides Python and C++ detokenization
+libraries.
+
+Python
+^^^^^^
+To detokenize in Python, import Detokenizer from the ``pw_tokenizer`` package,
+and instantiate it with paths to token databases or ELF files.
+
+.. code-block:: python
+
+  import pw_tokenizer
+
+  detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
+
+  def process_log_message(log_message):
+    result = detokenizer.detokenize(log_message.payload)
+    self._log(str(result))
+
+The ``pw_tokenizer`` pacakge also provices the ``AutoUpdatingDetokenizer``
+class, which can be used in place of the standard ``Detokenizer``. This class
+monitors database files for changes and automatically reloads them when they
+change. This is helpful for long-running tools that use detokenization.
+
+C++
+^^^
+The C++ detokenization libraries can be used in C++ or any language that can
+call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
+Android Java JNI is provided.
+
+The C++ detokenization library uses binary-format token databases (created with
+``--type binary``). Read a binary format database from a file or include it in
+the source code. Pass the database array to ``TokenDatabase::Create``, and
+construct a detokenizer.
+
+.. code-block:: cpp
+
+  Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
+
+  std::string ProcessLog(span<uint8_t> log_data) {
+    return detokenizer.Detokenize(log_data).BestString();
+  }
+
+The ``TokenDatabase`` class verifies that its data is valid before using it. If
+it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
+``ok()`` returns false. If the token database is included in the source code,
+this check can be done at compile time.
+
+.. code-block:: cpp
+
+  // This line fails to compile with a static_assert if the database is invalid.
+  constexpr TokenDatabase kDefaultDatabase =  TokenDatabase::Create<kData>();
+
+  Detokenizer OpenDatabase(std::string_view path) {
+    std::vector<uint8_t> data = ReadWholeFile(path);
+
+    TokenDatabase database = TokenDatabase::Create(data);
+
+    // This checks if the file contained a valid database. It is safe to use a
+    // TokenDatabase that failed to load (it will be empty), but it may be
+    // desirable to provide a default database or otherwise handle the error.
+    if (database.ok()) {
+      return Detokenizer(database);
+    }
+    return Detokenizer(kDefaultDatabase);
+  }
+
+Token generation: fixed length hashing at compile time
+======================================================
+String tokens are generated using a modified version of the x65599 hash used by
+the SDBM project. All hashing is done at compile time.
+
+In C code, strings are hashed with a preprocessor macro. For compatibility with
+macros, the hash must be limited to a fixed maximum number of characters. This
+value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``.
+
+Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C
+due to the complexity of the hashing macros. C++ macros use a constexpr
+function instead of a macro, so the compilation time impact is minimal. Projects
+primarily in C++ should use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH``
+(perhaps even ``std::numeric_limits<size_t>::max()``).
+
+Base64 format
+=============
+The tokenizer defaults to a compact binary representation of tokenized messages.
+Applications may desire a textual representation of tokenized strings. This
+makes it easy to use tokenized messages alongside plain text messages, but comes
+at an efficiency cost.
+
+The tokenizer module supports prefixed Base64-encoded messages: a single
+character ($) followed by the Base64-encoded message. For example, the token
+0xabcdef01 followed by the argument 0x05 would be encoded as ``01 ef cd ab 05``
+in binary and ``$Ae/NqwU=`` in Base64.
+
+Base64 decoding is supported in the Python detokenizer through the
+``detokenize_base64`` and related functions. Base64 encoding and decoding are
+not yet supprted in C++, but it is straightforward to add Base64 encoding with
+any Base64 library.
+
+.. tip::
+  The detokenization tools support recursive detokenization for prefixed Base64
+  text. Tokenized strings found in detokenized text are detokenized, so
+  prefixed Base64 messages can be passed as ``%s`` arguments.
+
+  For example, the message ``"$d4ffJaRn`` might be passed as the argument to a
+  ``"Nested message: %s"`` string. The detokenizer would decode the message in
+  two steps:
+
+  ::
+
+   "$alRZyuk2J3v=" → "Nested message: $d4ffJaRn" → "Nested message: Wow!"
+
+War story: deploying tokenized logging to an existing product
+=============================================================
+The tokenizer module was developed to bring tokenized logging to an
+in-development product. The product is complex, with several interacting
+microcontrollers. It already had an established text-based logging system.
+Deploying tokenization was straightforward and had substantial benefits.
+
+**Results**
+  * Log contents shrunk by over 50%, even with Base64 encoding.
+
+    * Significant size savings for encoded logs, even using the less-efficient
+      Base64 encoding required for compatibility with the existing log system.
+    * Freed valueable communication bandwidth.
+    * Allowed storing many more logs in crash dumps.
+
+  * Substantial flash savings.
+
+    * Reduced the size of 115 KB and 172 KB firmware images by over 20 KB each.
+    * Shaved over 100 KB from a large 2 MB image.
+
+  * Simpler logging code.
+
+    * Removed CPU-heavy ``snprintf`` calls.
+    * Removed complex code for forwarding log arguments to a low-priority task.
+
+This section describes the tokenizer deployment process and highlights key
+insights.
+
+Firmware deployment
+-------------------
+  * In the project's logging macro, calls to the underlying logging function
+    were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
+    invocation.
+  * The log level was passed as the payload argument to facilitate runtime log
+    level control.
+  * For this project, it was necessary to encode the log messages as text. In
+    ``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were
+    encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
+    messages.
+  * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
+
+.. attention::
+  Do not encode line numbers in tokenized strings. This results in a huge
+  number of lines being added to the database, since every time code moves,
+  new strings are tokenized. If line numbers are desired in a tokenized
+  string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
+
+Database management
+-------------------
+  * The token database was stored as a CSV file in the project's Git repo.
+  * The token database was automatically updated as part of the build, and
+    developers were expected to check in the database changes alongside their
+    code changes.
+  * A presubmit check verified that all strings added by a change were added to
+    the token database.
+  * The token database included logs and asserts for all firmware images in the
+    project.
+  * No strings were purged from the token database.
+
+.. tip::
+  Merge conflicts may be a frequent occurrence with an in-source database. If
+  the database is in-source, make sure there is a simple script to resolve any
+  merge conflicts. The script could either keep both sets of lines or discard
+  local changes and regenerate the database.
+
+Decoding tooling deployment
+---------------------------
+  * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
+
+      * Product-specific Python command line tools, using
+        ``pw_tokenizer.Detokenizer``.
+      * Standalone script for decoding prefixed Base64 tokens in files or
+        live output (e.g. from ``adb``), using ``detokenize.py``'s command line
+        interface.
+
+  * The C++ detokenizer library was deployed to two Android apps with a Java
+    Native Interface (JNI) layer.
+
+      * The binary token database was included as a raw resource in the APK.
+      * In one app, the built-in token database could be overridden by copying a
+        file to the phone.
+
+.. tip:: Make the tokenized logging tools simple to use.
+
+  * Provide simple wrapper shell scripts that fill in arguments for the
+    project. For example, point ``detokenize.py`` to the project's token
+    databases.
+  * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
+    continuously-running tools, so that users don't have to restart the tool
+    when the token database updates.
+  * Integrate detokenization everywhere it is needed. Integrating the tools
+    takes just a few lines of code, and token databases can be embedded in
+    APKs or binaries.
+
+Limitations and future work
+===========================
+
+GCC bug: tokenization in template functions
+-------------------------------------------
+GCC incorrectly ignores the section attribute for template
+`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
+`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
+bug, tokenized strings in template functions may be emitted into ``.rodata``
+instead of the special tokenized string section. This causes two problems:
+
+  1. Tokenized strings will not be discovered by the token database tools.
+  2. Tokenized strings may not be removed from the final binary.
+
+clang does **not** have this issue! Use clang if you can.
+
+It is possible to work around this bug in GCC. One approach would be to tag
+format strings so that the database tools can find them in ``.rodata``. Then, to
+remove the strings, compile two binaries: one metadata binary with all tokenized
+strings and a second, final binary that removes the strings. The strings could
+be removed by providing the appropriate linker flags or by removing the ``used``
+attribute from the tokenized string character array declaration.
+
+64-bit tokenization
+-------------------
+The Python and C++ detokenizing libraries currently assume that strings were
+tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
+``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
+device performed the tokenization.
+
+Supporting detokenization of strings tokenized on 64-bit targets would be
+simple. This could be done by adding an option to switch the 32-bit types to
+64-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info``
+ELF section, so the sizes of these types can be verified by checking the ELF
+file, if necessary.
+
+Tokenization in headers
+-----------------------
+Tokenizing code in header files (inline functions or templates) may trigger
+warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
+is because tokenization requires declaring a character array for each tokenized
+string. If the tokenized string includes macros that change value, the size of
+this character array changes, which means the same static variable is defined
+with different sizes. It should be safe to suppress these warnings, but, when
+possible, code that tokenizes strings with macros that can change value should
+be moved to source files rather than headers.
+
+Tokenized strings as ``%s`` arguments
+-------------------------------------
+Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
+encoded 1:1, with no tokenization. It would be better to send a tokenized string
+literal as an integer instead of a string argument, but this is not yet
+supported.
+
+A string token could be sent by marking an integer % argument in a way
+recognized by the detokenization tools. The detokenizer would expand the
+argument to the string represented by the integer.
+
+.. code-block:: cpp
+
+  #define PW_TOKEN_ARG "TOKEN<([\\%" PRIx32 "/])>END_TOKEN"
+
+  constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
+
+  PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: " PW_TOKEN_ARG "?", answer_token);
+
+Strings with arguments could be encoded to a buffer, but since printf strings
+are null-terminated, a binary encoding would not work. These strings can be
+prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
+
+Another possibility: encode strings with arguments to a ``uint64_t`` and send
+them as an integer. This would be efficient and simple, but only support a small
+number of arguments.
+
+Compatibility
+=============
+  * C11
+  * C++17
+  * Python 3
+
+Dependencies
+============
+  * pw_varint module
+  * pw_preprocessor module
+  * pw_span module
commit	80c6ee51d4ea0a8eb67e2b0a330f30bd3a4eb03e	[log] [tgz]
author	Wyatt Hepler <hepler@google.com>	Fri Jan 03 09:54:58 2020 -0800
committer	Wyatt Hepler <hepler@google.com>	Thu Jan 09 14:26:14 2020 -0800
tree	12036046cca9ceb239ab4e6e9ec26a896ab3dd80
parent	b8b01e3a5e21af7dbef199b0faecefa29671acc9 [diff] [blame]