pw_snapshot: Add snapshot proto

Initial commit for the pw_snapshot module that introduces the proto
format for device snapshots, and documentation that introduces the
pw_snapshot module and how to use it.

Change-Id: I63e12d245073e82de03be995a001a0ee0cc1f443
Reviewed-on: https://pigweed-review.googlesource.com/c/pigweed/pigweed/+/38103
Commit-Queue: Armando Montanez <amontanez@google.com>
Reviewed-by: Ewout van Bekkum <ewout@google.com>
Reviewed-by: Keir Mierle <keir@google.com>
Reviewed-by: David Rogers <davidrogers@google.com>
diff --git a/pw_snapshot/setup.rst b/pw_snapshot/setup.rst
new file mode 100644
index 0000000..f8968f1
--- /dev/null
+++ b/pw_snapshot/setup.rst
@@ -0,0 +1,300 @@
+.. _module-pw_snapshot-setup:
+
+==============================
+Setting up a Snapshot Pipeline
+==============================
+
+.. contents:: Table of Contents
+
+-------------------
+Crash Handler Setup
+-------------------
+The Snapshot proto was designed first and foremost as a crash reporting format.
+This section covers how to set up a crash handler to capture Snapshots.
+
+.. image:: images/generic_crash_flow.svg
+  :width: 600
+  :alt: Generic crash handler flow
+
+A typical crash handler has two entry points:
+
+1. A software entry path through developer-written ASSERT() or CHECK() calls
+   that indicate a device should go down for a crash if a condition is not met.
+2. A hardware-triggered exception handler path that is initiated when a CPU
+   encounters a fault signal (invalid memory access, bad instruction, etc.).
+
+Before deferring to a common crash handler, these entry paths should disable
+interrupts to force the system into a single-threaded execution mode. This
+prevents other threads from operating on potentially bad data or clobbering
+system state that could be useful for debugging.
+
+The first step in a crash handler should always be a check for nested crashes to
+prevent infinitely recursive crashes. Once it's deemed it's safe to continue,
+the crash handler can re-initialize logging, initialize storage for crash report
+capture, and then build a snapshot to later be retrieved from the device. Once
+the crash report collection process is complete, some post-crash callbacks can
+be run on a best-effort basis to clean up the system before rebooting. For
+devices with debug port access, it's helpful to optionally hold the device in
+an infinite loop rather than resetting to allow developers to access the device
+via a hardware debugger.
+
+Assert Handler Setup
+====================
+:ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software
+crashes. Route any existing assert functions through pw_assert to centralize the
+software crash path. You’ll need to create a :ref:`pw_assert backend
+<module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler
+<module-pw_assert_basic-custom_handler>` to pass collected information to a more
+sophisticated crash handler. One way to do this is to collect the data into a
+statically allocated struct that is passed to a common crash handler. It’s
+important to immediately disable interrupts to prevent the system from doing
+other things while in an impacted state.
+
+.. code-block:: cpp
+
+  // This can be be directly accessed by a crash handler
+  static CrashData crash_data;
+  extern "C" void pw_assert_basic_HandleFailure(const char* file_name,
+                                                int line_number,
+                                                const char* format,
+                                                ...) {
+    // Always disable interrupts first! How this is done depends
+    // on your platform.
+    __disable_irq();
+
+    va_list args;
+    va_start(args, format);
+    crash_data.file_name = file_name;
+    crash_data.line_number = line_number;
+    crash_data.reason_fmt = format;
+    crash_data.reason_args = &args;
+    crash_data.cpu_state = nullptr;
+
+    HandleCrash(crash_data);
+    PW_UNREACHABLE;
+  }
+
+Exception Handler Setup
+=======================
+:ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry
+point for CPU-triggered faults (divide by zero, invalid memory access, etc.).
+You will need to provide a definition for pw_cpu_exception_DefaultHandler() that
+passes the exception state produced by pw_cpu_exception to your common crash
+handler.
+
+.. code-block:: cpp
+
+  static CrashData crash_data;
+  // This helper turns a format string to a va_list that can be used by the
+  // common crash handling path.
+  void HandleExceptionWithString(pw_cpu_exception_State& state,
+                                 const char* fmt,
+                                 ...) {
+    va_list args;
+    va_start(args, fmt);
+    crash_data.cpu_state = state;
+    crash_data.file_name = nullptr;
+    crash_data.reason_fmt = fmt;
+    crash_data.reason_args = &args;
+
+    HandleCrash(crash_data);
+    PW_UNREACHABLE;
+  }
+
+  extern "C" void pw_cpu_exception_DefaultHandler(
+      pw_cpu_exception_State* state) {
+    // Always disable interrupts first! How this is done depends
+    // on your platform.
+    __disable_irq();
+
+    crash_data.state = cpu_state;
+    // The CFSR is an extremely useful register for understanding ARMv7-M and
+    // ARMv8-M CPU faults. Other architectures should put something else here.
+    HandleExceptionWithString(crash_data,
+                              "Exception encountered, cfsr=0x%",
+                              cpu_state->extended.cfsr);
+  }
+
+Common Crash Handler Setup
+==========================
+To minimize duplication of crash handling logic, it's good practice to route the
+pw_assert and pw_cpu_exception handlers to a common crash handling codepath.
+Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert
+information to the shared handler.
+
+.. code-block:: cpp
+
+  struct CrashData {
+    pw_cpu_exception_State *cpu_state;
+    const char *reason_fmt;
+    const va_list *reason_args;
+    const char *file_name;
+    int line_number;
+  };
+
+  // This function assumes interrupts are properly disabled BEFORE it is called.
+  [[noreturn]] void HandleCrash(CrashData& crash_info) {
+    // Handle crash
+  }
+
+In the crash handler your project can re-initialize a minimal subset of the
+system needed to safely capture a snapshot before rebooting the device. The
+remainder of this section focuses on ways you can improve the reliability and
+usability of your project's crash handler.
+
+Check for Nested Crashes
+------------------------
+It’s important to include crash handler checks that prevent infinite recursive
+nesting of crashes. Maintain a static variable that checks the crash nesting
+depth. After one or two nested crashes, abort crash handling entirely and reset
+the device or sit in an infinite loop to wait for a hardware debugger to attach.
+It’s simpler to put this logic at the beginning of the shared crash handler, but
+if your assert/exception handlers are complex it might be safer to inject the
+checks earlier in both codepaths.
+
+.. code-block:: cpp
+
+  [[noreturn]] void HandleCrash(CrashData &crash_info) {
+    static size_t crash_depth = 0;
+    if (crash_depth > kMaxCrashDepth) {
+      Abort(/*run_callbacks=*/false);
+    }
+    crash_depth++;
+    ...
+  }
+
+Re-initialize Logging (Optional)
+--------------------------------
+Logging can be helpful for debugging your crash handler, but depending on your
+device/system design may be challenging to safely support at crash time. To
+re-initialize logging, you’ll need to re-construct C++ objects and re-initialize
+any systems/hardware in the logging codepath. You may even need an entirely
+separate logging pipeline that is single-threaded and interrupt-safe. Depending
+on your system’s design, this may be difficult to set up.
+
+Reinitialize Dependencies
+-------------------------
+It's good practice to design a crash handler that can run before C++ static
+constructors have run. This means any initialization (whether manual or through
+constructors) that your crash handler depends on should be manually invoked at
+crash time. If an initialization step might not be safe, evaluate if it's
+possible to omit the dependency.
+
+System Cleanup
+--------------
+After collecting a snapshot, some parts of your system may benefit from some
+cleanup before explicitly resetting a device. This might include flushing
+buffers or safely shutting down attached hardware. The order of shutdown should
+be deterministic, keeping in mind that any of these steps may have the potential
+of causing a nested crash that skips the remainder of the handlers and forces
+the device to immediately reset.
+
+----------------------
+Snapshot Storage Setup
+----------------------
+Use a storage class with a ``pw::stream::Writer`` interface to simplify
+capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore
+<module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a
+:ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that
+lives in persistent memory. It's good practice to use lazy initialization for
+storage objects used by your Snapshot capture codepath.
+
+.. code-block:: cpp
+
+  // Persistent RAM objects are highly available. They don't rely on
+  // their constructor being run, and require no initialization.
+  PW_KEEP_IN_SECTION(".noinit")
+  pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot;
+
+  void CaptureSnapshot(CrashInfo& crash_info) {
+    ...
+    persistent_snapshot.clear();
+    PersistentBufferWriter& writer = persistent_snapshot.GetWriter();
+    ...
+  }
+
+----------------------
+Snapshot Capture Setup
+----------------------
+
+.. note::
+
+  These instructions do not yet use the ``pw::protobuf::StreamingEncoder``.
+
+Capturing a snapshot is as simple as encoding any other proto message. Some
+modules provide helper functions that will populate parts of a Snapshot, which
+eases the burden of custom work that must be set up uniquely for each project.
+
+Capture Reason
+==============
+A snapshot's "reason" should be considered the single most important field in a
+captured snapshot. If a snapshot capture was triggered by a crash, this should
+be the assert string. Other entry paths should describe here why the snapshot
+was captured ("Host communication buffer full!", "Exception encountered at
+0x00000004", etc.).
+
+.. code-block:: cpp
+
+  Status CaptureSnapshot(CrashData& crash_info) {
+    // Temporary buffer for encoding "reason" to.
+    static std::byte temp_buffer[500];
+    // Temporary buffer to encode serialized proto to before dumping to the
+    // final ``pw::stream::Writer``.
+    static std::byte proto_encode_buffer[512];
+    ...
+    pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer);
+    pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder);
+    size_t length = snprintf(temp_buffer,
+                             sizeof(temp_buffer,
+                             crash_info.reason_fmt),
+                             *crash_info.reason_args);
+    snapshot_encoder.WriteReason(temp_buffer, length));
+
+    // Final encode and write.
+    Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
+    PW_TRY(encoded_proto.status());
+    PW_TRY(writer.Write(encoded_proto.value()));
+    ...
+  }
+
+Capture CPU State
+=================
+When using pw_cpu_exception, exceptions will automatically collect CPU state
+that can be directly dumped into a snapshot. As it's not always easy to describe
+a CPU exception in a single "reason" string, this captures the information
+needed to more verbosely automatically generate a descriptive reason at analysis
+time once the snapshot is retrieved from the device.
+
+.. code-block:: cpp
+
+  Status CaptureSnapshot(CrashData& crash_info) {
+    ...
+
+    proto_encoder.clear();
+
+    // Write CPU state.
+    if (crash_info.cpu_state) {
+      PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(),
+                               *crash_info.cpu_state));
+
+      // Final encode and write.
+      Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
+      PW_TRY(encoded_proto.status());
+      PW_TRY(writer.Write(encoded_proto.value()));
+    }
+  }
+
+-----------------------
+Snapshot Transfer Setup
+-----------------------
+Pigweed’s pw_rpc system is well suited for retrieving a snapshot from a device.
+Pigweed does not yet provide a generalized transfer service for moving files
+to/from a device. When this feature is added to Pigweed, this section will be
+updated to include guidance for connecting a storage system to a transfer
+service.
+
+----------------------
+Snapshot Tooling Setup
+----------------------
+Pigweed will provide Python tooling to dump snapshot protos as human-readable
+text dumps. This section will be updated as this functionality is introduced.