Wyatt Hepler | 80c6ee5 | 2020-01-03 09:54:58 -0800 | [diff] [blame] | 1 | .. _chapter-tokenizer: |
| 2 | |
| 3 | .. default-domain:: cpp |
| 4 | |
| 5 | .. highlight:: sh |
| 6 | |
Alexei Frolov | 44d5473 | 2020-01-10 14:45:43 -0800 | [diff] [blame] | 7 | ------------ |
| 8 | pw_tokenizer |
| 9 | ------------ |
Wyatt Hepler | 80c6ee5 | 2020-01-03 09:54:58 -0800 | [diff] [blame] | 10 | The tokenizer module provides facilities for converting strings to binary |
| 11 | tokens. String literals are replaced with integer tokens in the firmware image, |
| 12 | which can be decoded off-device to restore the original string. Strings may be |
| 13 | printf-style format strings and include arguments, such as ``"My name is %s"``. |
| 14 | Arguments are encoded into compact binary form at runtime. |
| 15 | |
| 16 | .. note:: |
| 17 | This usage of the term "tokenizer" is not related to parsing! The |
| 18 | module is called tokenizer because it replaces a whole string literal with an |
| 19 | integer token. It does not parse strings into separate tokens. |
| 20 | |
| 21 | The most common application of the tokenizer module is binary logging, and its |
| 22 | designed to integrate easily into existing logging systems. However, the |
| 23 | tokenizer is general purpose and can be used to tokenize any strings. |
| 24 | |
| 25 | **Why tokenize strings?** |
| 26 | |
| 27 | * Dramatically reduce binary size by removing string literals from binaries. |
| 28 | * Reduce CPU usage by replacing snprintf calls with simple tokenization code. |
| 29 | * Reduce I/O traffic, RAM, and flash usage by sending and storing compact |
| 30 | tokens instead of strings. |
| 31 | * Remove potentially sensitive log, assert, and other strings from binaries. |
| 32 | |
| 33 | Basic operation |
| 34 | =============== |
| 35 | 1. In C or C++ code, strings are hashed to generate a stable 32-bit token. |
| 36 | 2. The tokenization macro removes the string literal by placing it in an ELF |
| 37 | section that is excluded from the final binary. |
| 38 | 3. Strings are extracted from an ELF to build a database of tokenized strings |
| 39 | for use by the detokenizer. The ELF file may also be used directly. |
| 40 | 4. During operation, the device encodes the string token and its arguments, if |
| 41 | any. |
| 42 | 5. The encoded tokenized strings are sent off-device or stored. |
| 43 | 6. Off-device, the detokenizer tools use the token database or ELF files to |
| 44 | detokenize the strings to human-readable form. |
| 45 | |
| 46 | Module usage |
| 47 | ============ |
| 48 | There are two sides to tokenization: tokenizing strings in the source code and |
| 49 | detokenizing these strings back to human-readable form. |
| 50 | |
| 51 | Tokenization |
| 52 | ------------ |
| 53 | Tokenization converts a string literal to a token. If it's a printf-style |
| 54 | string, its arguments are encoded along with it. The results of tokenization can |
| 55 | be sent off device or stored in place of a full string. |
| 56 | |
| 57 | Adding tokenization to a project is simple. To tokenize a string, include |
| 58 | ``pw_tokenizer/tokenize.h`` and invoke a ``PW_TOKENIZE_`` macro. |
| 59 | |
| 60 | To tokenize a string literal, invoke ``PW_TOKENIZE_STRING``. This macro returns |
| 61 | a ``uint32_t`` token. |
| 62 | |
| 63 | .. code-block:: cpp |
| 64 | |
| 65 | constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!"); |
| 66 | |
| 67 | Format strings are tokenized into a fixed-size buffer. The buffer contains the |
| 68 | ``uint32_t`` token followed by the encoded form of the arguments, if any. The |
| 69 | most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes to |
| 70 | a caller-provided buffer. |
| 71 | |
| 72 | .. code-block:: cpp |
| 73 | |
| 74 | uint8_t buffer[BUFFER_SIZE]; |
| 75 | size_t size_bytes = sizeof(buffer); |
| 76 | PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, args...); |
| 77 | |
| 78 | While ``PW_TOKENIZE_TO_BUFFER`` is flexible, its per-use code size overhead is |
| 79 | larger than its alternatives. ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer |
| 80 | on the stack and calls a ``void(const uint8_t* buffer, size_t buffer_size)`` |
| 81 | callback that is provided at the call site. The size of the buffer is set with |
| 82 | ``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``. |
| 83 | |
| 84 | .. code-block:: cpp |
| 85 | |
| 86 | PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arg); |
| 87 | |
| 88 | ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function, |
| 89 | since it takes the fewest arguments. Like the callback form, it encodes to a |
| 90 | buffer on the stack. It then calls the C-linkage function |
| 91 | ``pw_TokenizerHandleEncodedMessage``, which must be defined by the project. |
| 92 | |
| 93 | .. code-block:: cpp |
| 94 | |
| 95 | PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...); |
| 96 | |
| 97 | void pw_TokenizerHandleEncodedMessage(const uint8_t encoded_message[], |
| 98 | size_t size_bytes); |
| 99 | |
| 100 | ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a |
| 101 | ``void*`` argument to the global handler function. This can be used to pass a |
| 102 | log level or other metadata along with the tokenized string. |
| 103 | |
| 104 | .. code-block:: cpp |
| 105 | |
| 106 | PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload, |
| 107 | format_string_literal, |
| 108 | args...); |
| 109 | |
| 110 | void pw_TokenizerHandleEncodedMessageWithPayload(void* payload, |
| 111 | const uint8_t encoded_message[], |
| 112 | size_t size_bytes); |
| 113 | |
| 114 | .. tip:: |
| 115 | ``%s`` arguments are inefficient to encode and can quickly fill a tokenization |
| 116 | buffer. Keep ``%s`` arguments short or avoid encoding them as strings if |
| 117 | possible. See `Tokenized strings as %s arguments`_. |
| 118 | |
| 119 | Example: binary logging |
| 120 | ^^^^^^^^^^^^^^^^^^^^^^^ |
| 121 | String tokenization is perfect for logging. Consider the following log macro, |
| 122 | which gathers the file, line number, and log message. It calls the ``RecordLog`` |
| 123 | function, which formats the log string, collects a timestamp, and transmits the |
| 124 | result. |
| 125 | |
| 126 | .. code-block:: cpp |
| 127 | |
| 128 | #define LOG_INFO(format, ...) \ |
| 129 | RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__) |
| 130 | |
| 131 | void RecordLog(LogLevel level, const char* file, int line, const char* format, |
| 132 | ...) { |
| 133 | if (level < current_log_level) { |
| 134 | return; |
| 135 | } |
| 136 | |
| 137 | int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line); |
| 138 | |
| 139 | va_list args; |
| 140 | va_start(args, format); |
| 141 | bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args); |
| 142 | va_end(args); |
| 143 | |
| 144 | TransmitLog(TimeSinceBootMillis(), buffer, size); |
| 145 | } |
| 146 | |
| 147 | It is trivial to convert this to a binary log using the tokenizer. The |
| 148 | ``RecordLog`` call is replaced with a |
| 149 | ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The |
| 150 | ``pw_TokenizerHandleEncodedMessageWithPayload`` implementation collects the |
| 151 | timestamp and transmits the message with ``TransmitLog``. |
| 152 | |
| 153 | .. code-block:: cpp |
| 154 | |
| 155 | #define LOG_INFO(format, ...) \ |
| 156 | PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \ |
| 157 | (void*)LogLevel_INFO, \ |
| 158 | __FILE_NAME__ ":%d " format, \ |
| 159 | __LINE__, \ |
| 160 | __VA_ARGS__); \ |
| 161 | |
| 162 | extern "C" void pw_TokenizerHandleEncodedMessageWithPayload( |
| 163 | void* level, const uint8_t encoded_message[], size_t size_bytes) { |
| 164 | if (reinterpret_cast<LogLevel>(level) >= current_log_level) { |
| 165 | TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes); |
| 166 | } |
| 167 | } |
| 168 | |
| 169 | Note that the ``__FILE_NAME__`` string is directly included in the log format |
| 170 | string. Since the string is tokenized, this has no effect on binary size. A |
| 171 | ``%d`` for the line number is added to the format string, so that changing the |
| 172 | line of the log message does not generate a new token. There is no overhead for |
| 173 | additional tokens, but it may not be desireable to fill a token database with |
| 174 | duplicate log lines. |
| 175 | |
| 176 | Database management |
| 177 | ^^^^^^^^^^^^^^^^^^^ |
| 178 | Token databases store a mapping of tokens to the strings they represent. An ELF |
| 179 | file can be used as a token database, but it only contains the strings for its |
| 180 | exact build. A token database file aggregates tokens from multiple ELF files, so |
| 181 | that a single database can decode tokenized strings from any known ELF. |
| 182 | |
| 183 | Creating and maintaining a token database is simple. Token databases are managed |
| 184 | with the ``database.py`` script. The ``create`` command makes a new token |
| 185 | database from ELF files or other databases. |
| 186 | |
| 187 | .. code-block:: sh |
| 188 | |
| 189 | ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... |
| 190 | |
| 191 | Two database formats are supported: CSV and binary. Provide ``--type binary`` to |
| 192 | ``create`` generate a binary database instead of the default CSV. CSV databases |
| 193 | are great for checking into a source control or for human review. Binary |
| 194 | databases are more compact and simpler to parse. The C++ detokenizer library |
| 195 | only supports binary databases currently. |
| 196 | |
| 197 | As new tokenized strings are added, update the database with the add command. |
| 198 | |
| 199 | .. code-block:: sh |
| 200 | |
| 201 | ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... |
| 202 | |
| 203 | A CSV token database can be checked into a source repository and updated as code |
| 204 | changes are made. The build system can invoke ``database.py`` to update the |
| 205 | database after each build. |
| 206 | |
| 207 | Detokenization |
| 208 | -------------- |
| 209 | Detokenization is the process of expanding a token to the string it represents |
| 210 | and decoding its arguments. This module provides Python and C++ detokenization |
| 211 | libraries. |
| 212 | |
| 213 | Python |
| 214 | ^^^^^^ |
| 215 | To detokenize in Python, import Detokenizer from the ``pw_tokenizer`` package, |
| 216 | and instantiate it with paths to token databases or ELF files. |
| 217 | |
| 218 | .. code-block:: python |
| 219 | |
| 220 | import pw_tokenizer |
| 221 | |
| 222 | detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf') |
| 223 | |
| 224 | def process_log_message(log_message): |
| 225 | result = detokenizer.detokenize(log_message.payload) |
| 226 | self._log(str(result)) |
| 227 | |
| 228 | The ``pw_tokenizer`` pacakge also provices the ``AutoUpdatingDetokenizer`` |
| 229 | class, which can be used in place of the standard ``Detokenizer``. This class |
| 230 | monitors database files for changes and automatically reloads them when they |
| 231 | change. This is helpful for long-running tools that use detokenization. |
| 232 | |
| 233 | C++ |
| 234 | ^^^ |
| 235 | The C++ detokenization libraries can be used in C++ or any language that can |
| 236 | call into C++ with a C-linkage wrapper, such as Java or Rust. A reference |
| 237 | Android Java JNI is provided. |
| 238 | |
| 239 | The C++ detokenization library uses binary-format token databases (created with |
| 240 | ``--type binary``). Read a binary format database from a file or include it in |
| 241 | the source code. Pass the database array to ``TokenDatabase::Create``, and |
| 242 | construct a detokenizer. |
| 243 | |
| 244 | .. code-block:: cpp |
| 245 | |
| 246 | Detokenizer detokenizer(TokenDatabase::Create(token_database_array)); |
| 247 | |
| 248 | std::string ProcessLog(span<uint8_t> log_data) { |
| 249 | return detokenizer.Detokenize(log_data).BestString(); |
| 250 | } |
| 251 | |
| 252 | The ``TokenDatabase`` class verifies that its data is valid before using it. If |
| 253 | it is invalid, the ``TokenDatabase::Create`` returns an empty database for which |
| 254 | ``ok()`` returns false. If the token database is included in the source code, |
| 255 | this check can be done at compile time. |
| 256 | |
| 257 | .. code-block:: cpp |
| 258 | |
| 259 | // This line fails to compile with a static_assert if the database is invalid. |
| 260 | constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>(); |
| 261 | |
| 262 | Detokenizer OpenDatabase(std::string_view path) { |
| 263 | std::vector<uint8_t> data = ReadWholeFile(path); |
| 264 | |
| 265 | TokenDatabase database = TokenDatabase::Create(data); |
| 266 | |
| 267 | // This checks if the file contained a valid database. It is safe to use a |
| 268 | // TokenDatabase that failed to load (it will be empty), but it may be |
| 269 | // desirable to provide a default database or otherwise handle the error. |
| 270 | if (database.ok()) { |
| 271 | return Detokenizer(database); |
| 272 | } |
| 273 | return Detokenizer(kDefaultDatabase); |
| 274 | } |
| 275 | |
| 276 | Token generation: fixed length hashing at compile time |
| 277 | ====================================================== |
| 278 | String tokens are generated using a modified version of the x65599 hash used by |
| 279 | the SDBM project. All hashing is done at compile time. |
| 280 | |
| 281 | In C code, strings are hashed with a preprocessor macro. For compatibility with |
| 282 | macros, the hash must be limited to a fixed maximum number of characters. This |
| 283 | value is set by ``PW_TOKENIZER_CFG_HASH_LENGTH``. |
| 284 | |
| 285 | Increasing ``PW_TOKENIZER_CFG_HASH_LENGTH`` increases the compilation time for C |
| 286 | due to the complexity of the hashing macros. C++ macros use a constexpr |
| 287 | function instead of a macro, so the compilation time impact is minimal. Projects |
| 288 | primarily in C++ should use a large value for ``PW_TOKENIZER_CFG_HASH_LENGTH`` |
| 289 | (perhaps even ``std::numeric_limits<size_t>::max()``). |
| 290 | |
| 291 | Base64 format |
| 292 | ============= |
| 293 | The tokenizer defaults to a compact binary representation of tokenized messages. |
| 294 | Applications may desire a textual representation of tokenized strings. This |
| 295 | makes it easy to use tokenized messages alongside plain text messages, but comes |
| 296 | at an efficiency cost. |
| 297 | |
| 298 | The tokenizer module supports prefixed Base64-encoded messages: a single |
| 299 | character ($) followed by the Base64-encoded message. For example, the token |
| 300 | 0xabcdef01 followed by the argument 0x05 would be encoded as ``01 ef cd ab 05`` |
| 301 | in binary and ``$Ae/NqwU=`` in Base64. |
| 302 | |
| 303 | Base64 decoding is supported in the Python detokenizer through the |
| 304 | ``detokenize_base64`` and related functions. Base64 encoding and decoding are |
| 305 | not yet supprted in C++, but it is straightforward to add Base64 encoding with |
| 306 | any Base64 library. |
| 307 | |
| 308 | .. tip:: |
| 309 | The detokenization tools support recursive detokenization for prefixed Base64 |
| 310 | text. Tokenized strings found in detokenized text are detokenized, so |
| 311 | prefixed Base64 messages can be passed as ``%s`` arguments. |
| 312 | |
| 313 | For example, the message ``"$d4ffJaRn`` might be passed as the argument to a |
| 314 | ``"Nested message: %s"`` string. The detokenizer would decode the message in |
| 315 | two steps: |
| 316 | |
| 317 | :: |
| 318 | |
| 319 | "$alRZyuk2J3v=" → "Nested message: $d4ffJaRn" → "Nested message: Wow!" |
| 320 | |
| 321 | War story: deploying tokenized logging to an existing product |
| 322 | ============================================================= |
| 323 | The tokenizer module was developed to bring tokenized logging to an |
| 324 | in-development product. The product is complex, with several interacting |
| 325 | microcontrollers. It already had an established text-based logging system. |
| 326 | Deploying tokenization was straightforward and had substantial benefits. |
| 327 | |
| 328 | **Results** |
| 329 | * Log contents shrunk by over 50%, even with Base64 encoding. |
| 330 | |
| 331 | * Significant size savings for encoded logs, even using the less-efficient |
| 332 | Base64 encoding required for compatibility with the existing log system. |
| 333 | * Freed valueable communication bandwidth. |
| 334 | * Allowed storing many more logs in crash dumps. |
| 335 | |
| 336 | * Substantial flash savings. |
| 337 | |
| 338 | * Reduced the size of 115 KB and 172 KB firmware images by over 20 KB each. |
| 339 | * Shaved over 100 KB from a large 2 MB image. |
| 340 | |
| 341 | * Simpler logging code. |
| 342 | |
| 343 | * Removed CPU-heavy ``snprintf`` calls. |
| 344 | * Removed complex code for forwarding log arguments to a low-priority task. |
| 345 | |
| 346 | This section describes the tokenizer deployment process and highlights key |
| 347 | insights. |
| 348 | |
| 349 | Firmware deployment |
| 350 | ------------------- |
| 351 | * In the project's logging macro, calls to the underlying logging function |
| 352 | were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` |
| 353 | invocation. |
| 354 | * The log level was passed as the payload argument to facilitate runtime log |
| 355 | level control. |
| 356 | * For this project, it was necessary to encode the log messages as text. In |
| 357 | ``pw_TokenizerHandleEncodedMessageWithPayload``, the log messages were |
| 358 | encoded in the $-prefixed `Base64 format`_, then dispatched as normal log |
| 359 | messages. |
| 360 | * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``. |
| 361 | |
| 362 | .. attention:: |
| 363 | Do not encode line numbers in tokenized strings. This results in a huge |
| 364 | number of lines being added to the database, since every time code moves, |
| 365 | new strings are tokenized. If line numbers are desired in a tokenized |
| 366 | string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument. |
| 367 | |
| 368 | Database management |
| 369 | ------------------- |
| 370 | * The token database was stored as a CSV file in the project's Git repo. |
| 371 | * The token database was automatically updated as part of the build, and |
| 372 | developers were expected to check in the database changes alongside their |
| 373 | code changes. |
| 374 | * A presubmit check verified that all strings added by a change were added to |
| 375 | the token database. |
| 376 | * The token database included logs and asserts for all firmware images in the |
| 377 | project. |
| 378 | * No strings were purged from the token database. |
| 379 | |
| 380 | .. tip:: |
| 381 | Merge conflicts may be a frequent occurrence with an in-source database. If |
| 382 | the database is in-source, make sure there is a simple script to resolve any |
| 383 | merge conflicts. The script could either keep both sets of lines or discard |
| 384 | local changes and regenerate the database. |
| 385 | |
| 386 | Decoding tooling deployment |
| 387 | --------------------------- |
| 388 | * The Python detokenizer in ``pw_tokenizer`` was deployed to two places: |
| 389 | |
| 390 | * Product-specific Python command line tools, using |
| 391 | ``pw_tokenizer.Detokenizer``. |
| 392 | * Standalone script for decoding prefixed Base64 tokens in files or |
| 393 | live output (e.g. from ``adb``), using ``detokenize.py``'s command line |
| 394 | interface. |
| 395 | |
| 396 | * The C++ detokenizer library was deployed to two Android apps with a Java |
| 397 | Native Interface (JNI) layer. |
| 398 | |
| 399 | * The binary token database was included as a raw resource in the APK. |
| 400 | * In one app, the built-in token database could be overridden by copying a |
| 401 | file to the phone. |
| 402 | |
| 403 | .. tip:: Make the tokenized logging tools simple to use. |
| 404 | |
| 405 | * Provide simple wrapper shell scripts that fill in arguments for the |
| 406 | project. For example, point ``detokenize.py`` to the project's token |
| 407 | databases. |
| 408 | * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in |
| 409 | continuously-running tools, so that users don't have to restart the tool |
| 410 | when the token database updates. |
| 411 | * Integrate detokenization everywhere it is needed. Integrating the tools |
| 412 | takes just a few lines of code, and token databases can be embedded in |
| 413 | APKs or binaries. |
| 414 | |
| 415 | Limitations and future work |
| 416 | =========================== |
| 417 | |
| 418 | GCC bug: tokenization in template functions |
| 419 | ------------------------------------------- |
| 420 | GCC incorrectly ignores the section attribute for template |
| 421 | `functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and |
| 422 | `variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this |
| 423 | bug, tokenized strings in template functions may be emitted into ``.rodata`` |
| 424 | instead of the special tokenized string section. This causes two problems: |
| 425 | |
| 426 | 1. Tokenized strings will not be discovered by the token database tools. |
| 427 | 2. Tokenized strings may not be removed from the final binary. |
| 428 | |
| 429 | clang does **not** have this issue! Use clang if you can. |
| 430 | |
| 431 | It is possible to work around this bug in GCC. One approach would be to tag |
| 432 | format strings so that the database tools can find them in ``.rodata``. Then, to |
| 433 | remove the strings, compile two binaries: one metadata binary with all tokenized |
| 434 | strings and a second, final binary that removes the strings. The strings could |
| 435 | be removed by providing the appropriate linker flags or by removing the ``used`` |
| 436 | attribute from the tokenized string character array declaration. |
| 437 | |
| 438 | 64-bit tokenization |
| 439 | ------------------- |
| 440 | The Python and C++ detokenizing libraries currently assume that strings were |
| 441 | tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and |
| 442 | ``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit |
| 443 | device performed the tokenization. |
| 444 | |
| 445 | Supporting detokenization of strings tokenized on 64-bit targets would be |
| 446 | simple. This could be done by adding an option to switch the 32-bit types to |
| 447 | 64-bit. The tokenizer stores the sizes of these types in the ``.tokenizer_info`` |
| 448 | ELF section, so the sizes of these types can be verified by checking the ELF |
| 449 | file, if necessary. |
| 450 | |
| 451 | Tokenization in headers |
| 452 | ----------------------- |
| 453 | Tokenizing code in header files (inline functions or templates) may trigger |
| 454 | warnings such as ``-Wlto-type-mismatch`` under certain conditions. That |
| 455 | is because tokenization requires declaring a character array for each tokenized |
| 456 | string. If the tokenized string includes macros that change value, the size of |
| 457 | this character array changes, which means the same static variable is defined |
| 458 | with different sizes. It should be safe to suppress these warnings, but, when |
| 459 | possible, code that tokenizes strings with macros that can change value should |
| 460 | be moved to source files rather than headers. |
| 461 | |
| 462 | Tokenized strings as ``%s`` arguments |
| 463 | ------------------------------------- |
| 464 | Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are |
| 465 | encoded 1:1, with no tokenization. It would be better to send a tokenized string |
| 466 | literal as an integer instead of a string argument, but this is not yet |
| 467 | supported. |
| 468 | |
| 469 | A string token could be sent by marking an integer % argument in a way |
| 470 | recognized by the detokenization tools. The detokenizer would expand the |
| 471 | argument to the string represented by the integer. |
| 472 | |
| 473 | .. code-block:: cpp |
| 474 | |
| 475 | #define PW_TOKEN_ARG "TOKEN<([\\%" PRIx32 "/])>END_TOKEN" |
| 476 | |
| 477 | constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there"); |
| 478 | |
| 479 | PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: " PW_TOKEN_ARG "?", answer_token); |
| 480 | |
| 481 | Strings with arguments could be encoded to a buffer, but since printf strings |
| 482 | are null-terminated, a binary encoding would not work. These strings can be |
| 483 | prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_. |
| 484 | |
| 485 | Another possibility: encode strings with arguments to a ``uint64_t`` and send |
| 486 | them as an integer. This would be efficient and simple, but only support a small |
| 487 | number of arguments. |
| 488 | |
| 489 | Compatibility |
| 490 | ============= |
| 491 | * C11 |
Wyatt Hepler | a6d5cc6 | 2020-01-17 14:15:40 -0800 | [diff] [blame] | 492 | * C++11 |
Wyatt Hepler | 80c6ee5 | 2020-01-03 09:54:58 -0800 | [diff] [blame] | 493 | * Python 3 |
| 494 | |
| 495 | Dependencies |
| 496 | ============ |
| 497 | * pw_varint module |
| 498 | * pw_preprocessor module |
| 499 | * pw_span module |