| .. _module-pw_snapshot-setup: |
| |
| ============================== |
| Setting up a Snapshot Pipeline |
| ============================== |
| |
| .. contents:: Table of Contents |
| |
| ------------------- |
| Crash Handler Setup |
| ------------------- |
| The Snapshot proto was designed first and foremost as a crash reporting format. |
| This section covers how to set up a crash handler to capture Snapshots. |
| |
| .. image:: images/generic_crash_flow.svg |
| :width: 600 |
| :alt: Generic crash handler flow |
| |
| A typical crash handler has two entry points: |
| |
| 1. A software entry path through developer-written ASSERT() or CHECK() calls |
| that indicate a device should go down for a crash if a condition is not met. |
| 2. A hardware-triggered exception handler path that is initiated when a CPU |
| encounters a fault signal (invalid memory access, bad instruction, etc.). |
| |
| Before deferring to a common crash handler, these entry paths should disable |
| interrupts to force the system into a single-threaded execution mode. This |
| prevents other threads from operating on potentially bad data or clobbering |
| system state that could be useful for debugging. |
| |
| The first step in a crash handler should always be a check for nested crashes to |
| prevent infinitely recursive crashes. Once it's deemed it's safe to continue, |
| the crash handler can re-initialize logging, initialize storage for crash report |
| capture, and then build a snapshot to later be retrieved from the device. Once |
| the crash report collection process is complete, some post-crash callbacks can |
| be run on a best-effort basis to clean up the system before rebooting. For |
| devices with debug port access, it's helpful to optionally hold the device in |
| an infinite loop rather than resetting to allow developers to access the device |
| via a hardware debugger. |
| |
| Assert Handler Setup |
| ==================== |
| :ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software |
| crashes. Route any existing assert functions through pw_assert to centralize the |
| software crash path. You’ll need to create a :ref:`pw_assert backend |
| <module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler |
| <module-pw_assert_basic-custom_handler>` to pass collected information to a more |
| sophisticated crash handler. One way to do this is to collect the data into a |
| statically allocated struct that is passed to a common crash handler. It’s |
| important to immediately disable interrupts to prevent the system from doing |
| other things while in an impacted state. |
| |
| .. code-block:: cpp |
| |
| // This can be be directly accessed by a crash handler |
| static CrashData crash_data; |
| extern "C" void pw_assert_basic_HandleFailure(const char* file_name, |
| int line_number, |
| const char* format, |
| ...) { |
| // Always disable interrupts first! How this is done depends |
| // on your platform. |
| __disable_irq(); |
| |
| va_list args; |
| va_start(args, format); |
| crash_data.file_name = file_name; |
| crash_data.line_number = line_number; |
| crash_data.reason_fmt = format; |
| crash_data.reason_args = &args; |
| crash_data.cpu_state = nullptr; |
| |
| HandleCrash(crash_data); |
| PW_UNREACHABLE; |
| } |
| |
| Exception Handler Setup |
| ======================= |
| :ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry |
| point for CPU-triggered faults (divide by zero, invalid memory access, etc.). |
| You will need to provide a definition for pw_cpu_exception_DefaultHandler() that |
| passes the exception state produced by pw_cpu_exception to your common crash |
| handler. |
| |
| .. code-block:: cpp |
| |
| static CrashData crash_data; |
| // This helper turns a format string to a va_list that can be used by the |
| // common crash handling path. |
| void HandleExceptionWithString(pw_cpu_exception_State& state, |
| const char* fmt, |
| ...) { |
| va_list args; |
| va_start(args, fmt); |
| crash_data.cpu_state = state; |
| crash_data.file_name = nullptr; |
| crash_data.reason_fmt = fmt; |
| crash_data.reason_args = &args; |
| |
| HandleCrash(crash_data); |
| PW_UNREACHABLE; |
| } |
| |
| extern "C" void pw_cpu_exception_DefaultHandler( |
| pw_cpu_exception_State* state) { |
| // Always disable interrupts first! How this is done depends |
| // on your platform. |
| __disable_irq(); |
| |
| crash_data.state = cpu_state; |
| // The CFSR is an extremely useful register for understanding ARMv7-M and |
| // ARMv8-M CPU faults. Other architectures should put something else here. |
| HandleExceptionWithString(crash_data, |
| "Exception encountered, cfsr=0x%", |
| cpu_state->extended.cfsr); |
| } |
| |
| Common Crash Handler Setup |
| ========================== |
| To minimize duplication of crash handling logic, it's good practice to route the |
| pw_assert and pw_cpu_exception handlers to a common crash handling codepath. |
| Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert |
| information to the shared handler. |
| |
| .. code-block:: cpp |
| |
| struct CrashData { |
| pw_cpu_exception_State *cpu_state; |
| const char *reason_fmt; |
| const va_list *reason_args; |
| const char *file_name; |
| int line_number; |
| }; |
| |
| // This function assumes interrupts are properly disabled BEFORE it is called. |
| [[noreturn]] void HandleCrash(CrashData& crash_info) { |
| // Handle crash |
| } |
| |
| In the crash handler your project can re-initialize a minimal subset of the |
| system needed to safely capture a snapshot before rebooting the device. The |
| remainder of this section focuses on ways you can improve the reliability and |
| usability of your project's crash handler. |
| |
| Check for Nested Crashes |
| ------------------------ |
| It’s important to include crash handler checks that prevent infinite recursive |
| nesting of crashes. Maintain a static variable that checks the crash nesting |
| depth. After one or two nested crashes, abort crash handling entirely and reset |
| the device or sit in an infinite loop to wait for a hardware debugger to attach. |
| It’s simpler to put this logic at the beginning of the shared crash handler, but |
| if your assert/exception handlers are complex it might be safer to inject the |
| checks earlier in both codepaths. |
| |
| .. code-block:: cpp |
| |
| [[noreturn]] void HandleCrash(CrashData &crash_info) { |
| static size_t crash_depth = 0; |
| if (crash_depth > kMaxCrashDepth) { |
| Abort(/*run_callbacks=*/false); |
| } |
| crash_depth++; |
| ... |
| } |
| |
| Re-initialize Logging (Optional) |
| -------------------------------- |
| Logging can be helpful for debugging your crash handler, but depending on your |
| device/system design may be challenging to safely support at crash time. To |
| re-initialize logging, you’ll need to re-construct C++ objects and re-initialize |
| any systems/hardware in the logging codepath. You may even need an entirely |
| separate logging pipeline that is single-threaded and interrupt-safe. Depending |
| on your system’s design, this may be difficult to set up. |
| |
| Reinitialize Dependencies |
| ------------------------- |
| It's good practice to design a crash handler that can run before C++ static |
| constructors have run. This means any initialization (whether manual or through |
| constructors) that your crash handler depends on should be manually invoked at |
| crash time. If an initialization step might not be safe, evaluate if it's |
| possible to omit the dependency. |
| |
| System Cleanup |
| -------------- |
| After collecting a snapshot, some parts of your system may benefit from some |
| cleanup before explicitly resetting a device. This might include flushing |
| buffers or safely shutting down attached hardware. The order of shutdown should |
| be deterministic, keeping in mind that any of these steps may have the potential |
| of causing a nested crash that skips the remainder of the handlers and forces |
| the device to immediately reset. |
| |
| ---------------------- |
| Snapshot Storage Setup |
| ---------------------- |
| Use a storage class with a ``pw::stream::Writer`` interface to simplify |
| capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore |
| <module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a |
| :ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that |
| lives in persistent memory. It's good practice to use lazy initialization for |
| storage objects used by your Snapshot capture codepath. |
| |
| .. code-block:: cpp |
| |
| // Persistent RAM objects are highly available. They don't rely on |
| // their constructor being run, and require no initialization. |
| PW_KEEP_IN_SECTION(".noinit") |
| pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot; |
| |
| void CaptureSnapshot(CrashInfo& crash_info) { |
| ... |
| persistent_snapshot.clear(); |
| PersistentBufferWriter& writer = persistent_snapshot.GetWriter(); |
| ... |
| } |
| |
| ---------------------- |
| Snapshot Capture Setup |
| ---------------------- |
| |
| .. note:: |
| |
| These instructions do not yet use the ``pw::protobuf::StreamingEncoder``. |
| |
| Capturing a snapshot is as simple as encoding any other proto message. Some |
| modules provide helper functions that will populate parts of a Snapshot, which |
| eases the burden of custom work that must be set up uniquely for each project. |
| |
| Capture Reason |
| ============== |
| A snapshot's "reason" should be considered the single most important field in a |
| captured snapshot. If a snapshot capture was triggered by a crash, this should |
| be the assert string. Other entry paths should describe here why the snapshot |
| was captured ("Host communication buffer full!", "Exception encountered at |
| 0x00000004", etc.). |
| |
| .. code-block:: cpp |
| |
| Status CaptureSnapshot(CrashData& crash_info) { |
| // Temporary buffer for encoding "reason" to. |
| static std::byte temp_buffer[500]; |
| // Temporary buffer to encode serialized proto to before dumping to the |
| // final ``pw::stream::Writer``. |
| static std::byte proto_encode_buffer[512]; |
| ... |
| pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer); |
| pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder); |
| size_t length = snprintf(temp_buffer, |
| sizeof(temp_buffer, |
| crash_info.reason_fmt), |
| *crash_info.reason_args); |
| snapshot_encoder.WriteReason(temp_buffer, length)); |
| |
| // Final encode and write. |
| Result<ConstByteSpan> encoded_proto = proto_encoder.Encode(); |
| PW_TRY(encoded_proto.status()); |
| PW_TRY(writer.Write(encoded_proto.value())); |
| ... |
| } |
| |
| Capture CPU State |
| ================= |
| When using pw_cpu_exception, exceptions will automatically collect CPU state |
| that can be directly dumped into a snapshot. As it's not always easy to describe |
| a CPU exception in a single "reason" string, this captures the information |
| needed to more verbosely automatically generate a descriptive reason at analysis |
| time once the snapshot is retrieved from the device. |
| |
| .. code-block:: cpp |
| |
| Status CaptureSnapshot(CrashData& crash_info) { |
| ... |
| |
| proto_encoder.clear(); |
| |
| // Write CPU state. |
| if (crash_info.cpu_state) { |
| PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(), |
| *crash_info.cpu_state)); |
| |
| // Final encode and write. |
| Result<ConstByteSpan> encoded_proto = proto_encoder.Encode(); |
| PW_TRY(encoded_proto.status()); |
| PW_TRY(writer.Write(encoded_proto.value())); |
| } |
| } |
| |
| ----------------------- |
| Snapshot Transfer Setup |
| ----------------------- |
| Pigweed’s pw_rpc system is well suited for retrieving a snapshot from a device. |
| Pigweed does not yet provide a generalized transfer service for moving files |
| to/from a device. When this feature is added to Pigweed, this section will be |
| updated to include guidance for connecting a storage system to a transfer |
| service. |
| |
| ---------------------- |
| Snapshot Tooling Setup |
| ---------------------- |
| Pigweed will provide Python tooling to dump snapshot protos as human-readable |
| text dumps. This section will be updated as this functionality is introduced. |