Armando Montanez | 179aa8e | 2021-03-10 11:46:35 -0800 | [diff] [blame] | 1 | .. _module-pw_snapshot-setup: |
| 2 | |
| 3 | ============================== |
| 4 | Setting up a Snapshot Pipeline |
| 5 | ============================== |
| 6 | |
| 7 | .. contents:: Table of Contents |
| 8 | |
| 9 | ------------------- |
| 10 | Crash Handler Setup |
| 11 | ------------------- |
| 12 | The Snapshot proto was designed first and foremost as a crash reporting format. |
| 13 | This section covers how to set up a crash handler to capture Snapshots. |
| 14 | |
| 15 | .. image:: images/generic_crash_flow.svg |
| 16 | :width: 600 |
| 17 | :alt: Generic crash handler flow |
| 18 | |
| 19 | A typical crash handler has two entry points: |
| 20 | |
| 21 | 1. A software entry path through developer-written ASSERT() or CHECK() calls |
| 22 | that indicate a device should go down for a crash if a condition is not met. |
| 23 | 2. A hardware-triggered exception handler path that is initiated when a CPU |
| 24 | encounters a fault signal (invalid memory access, bad instruction, etc.). |
| 25 | |
| 26 | Before deferring to a common crash handler, these entry paths should disable |
| 27 | interrupts to force the system into a single-threaded execution mode. This |
| 28 | prevents other threads from operating on potentially bad data or clobbering |
| 29 | system state that could be useful for debugging. |
| 30 | |
| 31 | The first step in a crash handler should always be a check for nested crashes to |
| 32 | prevent infinitely recursive crashes. Once it's deemed it's safe to continue, |
| 33 | the crash handler can re-initialize logging, initialize storage for crash report |
| 34 | capture, and then build a snapshot to later be retrieved from the device. Once |
| 35 | the crash report collection process is complete, some post-crash callbacks can |
| 36 | be run on a best-effort basis to clean up the system before rebooting. For |
| 37 | devices with debug port access, it's helpful to optionally hold the device in |
| 38 | an infinite loop rather than resetting to allow developers to access the device |
| 39 | via a hardware debugger. |
| 40 | |
| 41 | Assert Handler Setup |
| 42 | ==================== |
| 43 | :ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software |
| 44 | crashes. Route any existing assert functions through pw_assert to centralize the |
| 45 | software crash path. You’ll need to create a :ref:`pw_assert backend |
| 46 | <module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler |
| 47 | <module-pw_assert_basic-custom_handler>` to pass collected information to a more |
| 48 | sophisticated crash handler. One way to do this is to collect the data into a |
| 49 | statically allocated struct that is passed to a common crash handler. It’s |
| 50 | important to immediately disable interrupts to prevent the system from doing |
| 51 | other things while in an impacted state. |
| 52 | |
| 53 | .. code-block:: cpp |
| 54 | |
| 55 | // This can be be directly accessed by a crash handler |
| 56 | static CrashData crash_data; |
| 57 | extern "C" void pw_assert_basic_HandleFailure(const char* file_name, |
| 58 | int line_number, |
| 59 | const char* format, |
| 60 | ...) { |
| 61 | // Always disable interrupts first! How this is done depends |
| 62 | // on your platform. |
| 63 | __disable_irq(); |
| 64 | |
| 65 | va_list args; |
| 66 | va_start(args, format); |
| 67 | crash_data.file_name = file_name; |
| 68 | crash_data.line_number = line_number; |
| 69 | crash_data.reason_fmt = format; |
| 70 | crash_data.reason_args = &args; |
| 71 | crash_data.cpu_state = nullptr; |
| 72 | |
| 73 | HandleCrash(crash_data); |
| 74 | PW_UNREACHABLE; |
| 75 | } |
| 76 | |
| 77 | Exception Handler Setup |
| 78 | ======================= |
| 79 | :ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry |
| 80 | point for CPU-triggered faults (divide by zero, invalid memory access, etc.). |
| 81 | You will need to provide a definition for pw_cpu_exception_DefaultHandler() that |
| 82 | passes the exception state produced by pw_cpu_exception to your common crash |
| 83 | handler. |
| 84 | |
| 85 | .. code-block:: cpp |
| 86 | |
| 87 | static CrashData crash_data; |
| 88 | // This helper turns a format string to a va_list that can be used by the |
| 89 | // common crash handling path. |
| 90 | void HandleExceptionWithString(pw_cpu_exception_State& state, |
| 91 | const char* fmt, |
| 92 | ...) { |
| 93 | va_list args; |
| 94 | va_start(args, fmt); |
| 95 | crash_data.cpu_state = state; |
| 96 | crash_data.file_name = nullptr; |
| 97 | crash_data.reason_fmt = fmt; |
| 98 | crash_data.reason_args = &args; |
| 99 | |
| 100 | HandleCrash(crash_data); |
| 101 | PW_UNREACHABLE; |
| 102 | } |
| 103 | |
| 104 | extern "C" void pw_cpu_exception_DefaultHandler( |
| 105 | pw_cpu_exception_State* state) { |
| 106 | // Always disable interrupts first! How this is done depends |
| 107 | // on your platform. |
| 108 | __disable_irq(); |
| 109 | |
| 110 | crash_data.state = cpu_state; |
| 111 | // The CFSR is an extremely useful register for understanding ARMv7-M and |
| 112 | // ARMv8-M CPU faults. Other architectures should put something else here. |
| 113 | HandleExceptionWithString(crash_data, |
| 114 | "Exception encountered, cfsr=0x%", |
| 115 | cpu_state->extended.cfsr); |
| 116 | } |
| 117 | |
| 118 | Common Crash Handler Setup |
| 119 | ========================== |
| 120 | To minimize duplication of crash handling logic, it's good practice to route the |
| 121 | pw_assert and pw_cpu_exception handlers to a common crash handling codepath. |
| 122 | Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert |
| 123 | information to the shared handler. |
| 124 | |
| 125 | .. code-block:: cpp |
| 126 | |
| 127 | struct CrashData { |
| 128 | pw_cpu_exception_State *cpu_state; |
| 129 | const char *reason_fmt; |
| 130 | const va_list *reason_args; |
| 131 | const char *file_name; |
| 132 | int line_number; |
| 133 | }; |
| 134 | |
| 135 | // This function assumes interrupts are properly disabled BEFORE it is called. |
| 136 | [[noreturn]] void HandleCrash(CrashData& crash_info) { |
| 137 | // Handle crash |
| 138 | } |
| 139 | |
| 140 | In the crash handler your project can re-initialize a minimal subset of the |
| 141 | system needed to safely capture a snapshot before rebooting the device. The |
| 142 | remainder of this section focuses on ways you can improve the reliability and |
| 143 | usability of your project's crash handler. |
| 144 | |
| 145 | Check for Nested Crashes |
| 146 | ------------------------ |
| 147 | It’s important to include crash handler checks that prevent infinite recursive |
| 148 | nesting of crashes. Maintain a static variable that checks the crash nesting |
| 149 | depth. After one or two nested crashes, abort crash handling entirely and reset |
| 150 | the device or sit in an infinite loop to wait for a hardware debugger to attach. |
| 151 | It’s simpler to put this logic at the beginning of the shared crash handler, but |
| 152 | if your assert/exception handlers are complex it might be safer to inject the |
| 153 | checks earlier in both codepaths. |
| 154 | |
| 155 | .. code-block:: cpp |
| 156 | |
| 157 | [[noreturn]] void HandleCrash(CrashData &crash_info) { |
| 158 | static size_t crash_depth = 0; |
| 159 | if (crash_depth > kMaxCrashDepth) { |
| 160 | Abort(/*run_callbacks=*/false); |
| 161 | } |
| 162 | crash_depth++; |
| 163 | ... |
| 164 | } |
| 165 | |
| 166 | Re-initialize Logging (Optional) |
| 167 | -------------------------------- |
| 168 | Logging can be helpful for debugging your crash handler, but depending on your |
| 169 | device/system design may be challenging to safely support at crash time. To |
| 170 | re-initialize logging, you’ll need to re-construct C++ objects and re-initialize |
| 171 | any systems/hardware in the logging codepath. You may even need an entirely |
| 172 | separate logging pipeline that is single-threaded and interrupt-safe. Depending |
| 173 | on your system’s design, this may be difficult to set up. |
| 174 | |
| 175 | Reinitialize Dependencies |
| 176 | ------------------------- |
| 177 | It's good practice to design a crash handler that can run before C++ static |
| 178 | constructors have run. This means any initialization (whether manual or through |
| 179 | constructors) that your crash handler depends on should be manually invoked at |
| 180 | crash time. If an initialization step might not be safe, evaluate if it's |
| 181 | possible to omit the dependency. |
| 182 | |
| 183 | System Cleanup |
| 184 | -------------- |
| 185 | After collecting a snapshot, some parts of your system may benefit from some |
| 186 | cleanup before explicitly resetting a device. This might include flushing |
| 187 | buffers or safely shutting down attached hardware. The order of shutdown should |
| 188 | be deterministic, keeping in mind that any of these steps may have the potential |
| 189 | of causing a nested crash that skips the remainder of the handlers and forces |
| 190 | the device to immediately reset. |
| 191 | |
| 192 | ---------------------- |
| 193 | Snapshot Storage Setup |
| 194 | ---------------------- |
| 195 | Use a storage class with a ``pw::stream::Writer`` interface to simplify |
| 196 | capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore |
| 197 | <module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a |
| 198 | :ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that |
| 199 | lives in persistent memory. It's good practice to use lazy initialization for |
| 200 | storage objects used by your Snapshot capture codepath. |
| 201 | |
| 202 | .. code-block:: cpp |
| 203 | |
| 204 | // Persistent RAM objects are highly available. They don't rely on |
| 205 | // their constructor being run, and require no initialization. |
| 206 | PW_KEEP_IN_SECTION(".noinit") |
| 207 | pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot; |
| 208 | |
| 209 | void CaptureSnapshot(CrashInfo& crash_info) { |
| 210 | ... |
| 211 | persistent_snapshot.clear(); |
| 212 | PersistentBufferWriter& writer = persistent_snapshot.GetWriter(); |
| 213 | ... |
| 214 | } |
| 215 | |
| 216 | ---------------------- |
| 217 | Snapshot Capture Setup |
| 218 | ---------------------- |
| 219 | |
| 220 | .. note:: |
| 221 | |
| 222 | These instructions do not yet use the ``pw::protobuf::StreamingEncoder``. |
| 223 | |
| 224 | Capturing a snapshot is as simple as encoding any other proto message. Some |
| 225 | modules provide helper functions that will populate parts of a Snapshot, which |
| 226 | eases the burden of custom work that must be set up uniquely for each project. |
| 227 | |
| 228 | Capture Reason |
| 229 | ============== |
| 230 | A snapshot's "reason" should be considered the single most important field in a |
| 231 | captured snapshot. If a snapshot capture was triggered by a crash, this should |
| 232 | be the assert string. Other entry paths should describe here why the snapshot |
| 233 | was captured ("Host communication buffer full!", "Exception encountered at |
| 234 | 0x00000004", etc.). |
| 235 | |
| 236 | .. code-block:: cpp |
| 237 | |
| 238 | Status CaptureSnapshot(CrashData& crash_info) { |
| 239 | // Temporary buffer for encoding "reason" to. |
| 240 | static std::byte temp_buffer[500]; |
| 241 | // Temporary buffer to encode serialized proto to before dumping to the |
| 242 | // final ``pw::stream::Writer``. |
| 243 | static std::byte proto_encode_buffer[512]; |
| 244 | ... |
| 245 | pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer); |
| 246 | pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder); |
| 247 | size_t length = snprintf(temp_buffer, |
| 248 | sizeof(temp_buffer, |
| 249 | crash_info.reason_fmt), |
| 250 | *crash_info.reason_args); |
| 251 | snapshot_encoder.WriteReason(temp_buffer, length)); |
| 252 | |
| 253 | // Final encode and write. |
| 254 | Result<ConstByteSpan> encoded_proto = proto_encoder.Encode(); |
| 255 | PW_TRY(encoded_proto.status()); |
| 256 | PW_TRY(writer.Write(encoded_proto.value())); |
| 257 | ... |
| 258 | } |
| 259 | |
| 260 | Capture CPU State |
| 261 | ================= |
| 262 | When using pw_cpu_exception, exceptions will automatically collect CPU state |
| 263 | that can be directly dumped into a snapshot. As it's not always easy to describe |
| 264 | a CPU exception in a single "reason" string, this captures the information |
| 265 | needed to more verbosely automatically generate a descriptive reason at analysis |
| 266 | time once the snapshot is retrieved from the device. |
| 267 | |
| 268 | .. code-block:: cpp |
| 269 | |
| 270 | Status CaptureSnapshot(CrashData& crash_info) { |
| 271 | ... |
| 272 | |
| 273 | proto_encoder.clear(); |
| 274 | |
| 275 | // Write CPU state. |
| 276 | if (crash_info.cpu_state) { |
| 277 | PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(), |
| 278 | *crash_info.cpu_state)); |
| 279 | |
| 280 | // Final encode and write. |
| 281 | Result<ConstByteSpan> encoded_proto = proto_encoder.Encode(); |
| 282 | PW_TRY(encoded_proto.status()); |
| 283 | PW_TRY(writer.Write(encoded_proto.value())); |
| 284 | } |
| 285 | } |
| 286 | |
| 287 | ----------------------- |
| 288 | Snapshot Transfer Setup |
| 289 | ----------------------- |
| 290 | Pigweed’s pw_rpc system is well suited for retrieving a snapshot from a device. |
| 291 | Pigweed does not yet provide a generalized transfer service for moving files |
| 292 | to/from a device. When this feature is added to Pigweed, this section will be |
| 293 | updated to include guidance for connecting a storage system to a transfer |
| 294 | service. |
| 295 | |
| 296 | ---------------------- |
| 297 | Snapshot Tooling Setup |
| 298 | ---------------------- |
Armando Montanez | 3a8e729 | 2021-07-09 16:47:53 -0700 | [diff] [blame^] | 299 | When using the upstream ``Snapshot`` proto, you can directly use |
| 300 | ``pw_snapshot.process`` to process snapshots into human-readable dumps. If |
| 301 | you've opted to extend Pigweed's snapshot proto, you'll likely want to extend |
| 302 | the processing tooling to handle custom project data as well. This can be done |
| 303 | by creating a light wrapper around |
| 304 | ``pw_snapshot.processor.process_snapshots()``. |
| 305 | |
| 306 | .. code-block:: python |
| 307 | |
| 308 | def _process_hw_failures(serialized_snapshot: bytes) -> str: |
| 309 | """Custom handler that checks wheel state.""" |
| 310 | wheel_state = wheel_state_pb2.WheelStateSnapshot() |
| 311 | output = [] |
| 312 | wheel_state.ParseFromString(serialized_snapshot) |
| 313 | |
| 314 | if len(wheel_state.wheels) != 2: |
| 315 | output.append(f'Expected 2 wheels, found {len(wheel_state.wheels)}') |
| 316 | |
| 317 | if len(wheel_state.wheels) < 2: |
| 318 | output.append('Wheels fell off!') |
| 319 | |
| 320 | # And more... |
| 321 | |
| 322 | return '\n'.join(output) |
| 323 | |
| 324 | |
| 325 | def process_my_snapshots(serialized_snapshot: bytes) -> str: |
| 326 | """Runs the snapshot processor with a custom callback.""" |
| 327 | return pw_snaphsot.processor.process_snapshots( |
| 328 | serialized_snapshot, user_processing_callback=_process_hw_failures) |