blob: 7c39444d5b99b6fa5cc311b3d6fc4dc5b75d7f59 [file] [log] [blame]
Armando Montanez179aa8e2021-03-10 11:46:35 -08001.. _module-pw_snapshot-setup:
2
3==============================
4Setting up a Snapshot Pipeline
5==============================
6
7.. contents:: Table of Contents
8
9-------------------
10Crash Handler Setup
11-------------------
12The Snapshot proto was designed first and foremost as a crash reporting format.
13This section covers how to set up a crash handler to capture Snapshots.
14
15.. image:: images/generic_crash_flow.svg
16 :width: 600
17 :alt: Generic crash handler flow
18
19A typical crash handler has two entry points:
20
211. A software entry path through developer-written ASSERT() or CHECK() calls
22 that indicate a device should go down for a crash if a condition is not met.
232. A hardware-triggered exception handler path that is initiated when a CPU
24 encounters a fault signal (invalid memory access, bad instruction, etc.).
25
26Before deferring to a common crash handler, these entry paths should disable
27interrupts to force the system into a single-threaded execution mode. This
28prevents other threads from operating on potentially bad data or clobbering
29system state that could be useful for debugging.
30
31The first step in a crash handler should always be a check for nested crashes to
32prevent infinitely recursive crashes. Once it's deemed it's safe to continue,
33the crash handler can re-initialize logging, initialize storage for crash report
34capture, and then build a snapshot to later be retrieved from the device. Once
35the crash report collection process is complete, some post-crash callbacks can
36be run on a best-effort basis to clean up the system before rebooting. For
37devices with debug port access, it's helpful to optionally hold the device in
38an infinite loop rather than resetting to allow developers to access the device
39via a hardware debugger.
40
41Assert Handler Setup
42====================
43:ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software
44crashes. Route any existing assert functions through pw_assert to centralize the
45software crash path. Youll need to create a :ref:`pw_assert backend
46<module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler
47<module-pw_assert_basic-custom_handler>` to pass collected information to a more
48sophisticated crash handler. One way to do this is to collect the data into a
49statically allocated struct that is passed to a common crash handler. Its
50important to immediately disable interrupts to prevent the system from doing
51other things while in an impacted state.
52
53.. code-block:: cpp
54
55 // This can be be directly accessed by a crash handler
56 static CrashData crash_data;
57 extern "C" void pw_assert_basic_HandleFailure(const char* file_name,
58 int line_number,
59 const char* format,
60 ...) {
61 // Always disable interrupts first! How this is done depends
62 // on your platform.
63 __disable_irq();
64
65 va_list args;
66 va_start(args, format);
67 crash_data.file_name = file_name;
68 crash_data.line_number = line_number;
69 crash_data.reason_fmt = format;
70 crash_data.reason_args = &args;
71 crash_data.cpu_state = nullptr;
72
73 HandleCrash(crash_data);
74 PW_UNREACHABLE;
75 }
76
77Exception Handler Setup
78=======================
79:ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry
80point for CPU-triggered faults (divide by zero, invalid memory access, etc.).
81You will need to provide a definition for pw_cpu_exception_DefaultHandler() that
82passes the exception state produced by pw_cpu_exception to your common crash
83handler.
84
85.. code-block:: cpp
86
87 static CrashData crash_data;
88 // This helper turns a format string to a va_list that can be used by the
89 // common crash handling path.
90 void HandleExceptionWithString(pw_cpu_exception_State& state,
91 const char* fmt,
92 ...) {
93 va_list args;
94 va_start(args, fmt);
95 crash_data.cpu_state = state;
96 crash_data.file_name = nullptr;
97 crash_data.reason_fmt = fmt;
98 crash_data.reason_args = &args;
99
100 HandleCrash(crash_data);
101 PW_UNREACHABLE;
102 }
103
104 extern "C" void pw_cpu_exception_DefaultHandler(
105 pw_cpu_exception_State* state) {
106 // Always disable interrupts first! How this is done depends
107 // on your platform.
108 __disable_irq();
109
110 crash_data.state = cpu_state;
111 // The CFSR is an extremely useful register for understanding ARMv7-M and
112 // ARMv8-M CPU faults. Other architectures should put something else here.
113 HandleExceptionWithString(crash_data,
114 "Exception encountered, cfsr=0x%",
115 cpu_state->extended.cfsr);
116 }
117
118Common Crash Handler Setup
119==========================
120To minimize duplication of crash handling logic, it's good practice to route the
121pw_assert and pw_cpu_exception handlers to a common crash handling codepath.
122Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert
123information to the shared handler.
124
125.. code-block:: cpp
126
127 struct CrashData {
128 pw_cpu_exception_State *cpu_state;
129 const char *reason_fmt;
130 const va_list *reason_args;
131 const char *file_name;
132 int line_number;
133 };
134
135 // This function assumes interrupts are properly disabled BEFORE it is called.
136 [[noreturn]] void HandleCrash(CrashData& crash_info) {
137 // Handle crash
138 }
139
140In the crash handler your project can re-initialize a minimal subset of the
141system needed to safely capture a snapshot before rebooting the device. The
142remainder of this section focuses on ways you can improve the reliability and
143usability of your project's crash handler.
144
145Check for Nested Crashes
146------------------------
147It’s important to include crash handler checks that prevent infinite recursive
148nesting of crashes. Maintain a static variable that checks the crash nesting
149depth. After one or two nested crashes, abort crash handling entirely and reset
150the device or sit in an infinite loop to wait for a hardware debugger to attach.
151It’s simpler to put this logic at the beginning of the shared crash handler, but
152if your assert/exception handlers are complex it might be safer to inject the
153checks earlier in both codepaths.
154
155.. code-block:: cpp
156
157 [[noreturn]] void HandleCrash(CrashData &crash_info) {
158 static size_t crash_depth = 0;
159 if (crash_depth > kMaxCrashDepth) {
160 Abort(/*run_callbacks=*/false);
161 }
162 crash_depth++;
163 ...
164 }
165
166Re-initialize Logging (Optional)
167--------------------------------
168Logging can be helpful for debugging your crash handler, but depending on your
169device/system design may be challenging to safely support at crash time. To
170re-initialize logging, you’ll need to re-construct C++ objects and re-initialize
171any systems/hardware in the logging codepath. You may even need an entirely
172separate logging pipeline that is single-threaded and interrupt-safe. Depending
173on your system’s design, this may be difficult to set up.
174
175Reinitialize Dependencies
176-------------------------
177It's good practice to design a crash handler that can run before C++ static
178constructors have run. This means any initialization (whether manual or through
179constructors) that your crash handler depends on should be manually invoked at
180crash time. If an initialization step might not be safe, evaluate if it's
181possible to omit the dependency.
182
183System Cleanup
184--------------
185After collecting a snapshot, some parts of your system may benefit from some
186cleanup before explicitly resetting a device. This might include flushing
187buffers or safely shutting down attached hardware. The order of shutdown should
188be deterministic, keeping in mind that any of these steps may have the potential
189of causing a nested crash that skips the remainder of the handlers and forces
190the device to immediately reset.
191
192----------------------
193Snapshot Storage Setup
194----------------------
195Use a storage class with a ``pw::stream::Writer`` interface to simplify
196capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore
197<module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a
198:ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that
199lives in persistent memory. It's good practice to use lazy initialization for
200storage objects used by your Snapshot capture codepath.
201
202.. code-block:: cpp
203
204 // Persistent RAM objects are highly available. They don't rely on
205 // their constructor being run, and require no initialization.
206 PW_KEEP_IN_SECTION(".noinit")
207 pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot;
208
209 void CaptureSnapshot(CrashInfo& crash_info) {
210 ...
211 persistent_snapshot.clear();
212 PersistentBufferWriter& writer = persistent_snapshot.GetWriter();
213 ...
214 }
215
216----------------------
217Snapshot Capture Setup
218----------------------
219
220.. note::
221
222 These instructions do not yet use the ``pw::protobuf::StreamingEncoder``.
223
224Capturing a snapshot is as simple as encoding any other proto message. Some
225modules provide helper functions that will populate parts of a Snapshot, which
226eases the burden of custom work that must be set up uniquely for each project.
227
228Capture Reason
229==============
230A snapshot's "reason" should be considered the single most important field in a
231captured snapshot. If a snapshot capture was triggered by a crash, this should
232be the assert string. Other entry paths should describe here why the snapshot
233was captured ("Host communication buffer full!", "Exception encountered at
2340x00000004", etc.).
235
236.. code-block:: cpp
237
238 Status CaptureSnapshot(CrashData& crash_info) {
239 // Temporary buffer for encoding "reason" to.
240 static std::byte temp_buffer[500];
241 // Temporary buffer to encode serialized proto to before dumping to the
242 // final ``pw::stream::Writer``.
243 static std::byte proto_encode_buffer[512];
244 ...
245 pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer);
246 pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder);
247 size_t length = snprintf(temp_buffer,
248 sizeof(temp_buffer,
249 crash_info.reason_fmt),
250 *crash_info.reason_args);
251 snapshot_encoder.WriteReason(temp_buffer, length));
252
253 // Final encode and write.
254 Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
255 PW_TRY(encoded_proto.status());
256 PW_TRY(writer.Write(encoded_proto.value()));
257 ...
258 }
259
260Capture CPU State
261=================
262When using pw_cpu_exception, exceptions will automatically collect CPU state
263that can be directly dumped into a snapshot. As it's not always easy to describe
264a CPU exception in a single "reason" string, this captures the information
265needed to more verbosely automatically generate a descriptive reason at analysis
266time once the snapshot is retrieved from the device.
267
268.. code-block:: cpp
269
270 Status CaptureSnapshot(CrashData& crash_info) {
271 ...
272
273 proto_encoder.clear();
274
275 // Write CPU state.
276 if (crash_info.cpu_state) {
277 PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(),
278 *crash_info.cpu_state));
279
280 // Final encode and write.
281 Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
282 PW_TRY(encoded_proto.status());
283 PW_TRY(writer.Write(encoded_proto.value()));
284 }
285 }
286
287-----------------------
288Snapshot Transfer Setup
289-----------------------
290Pigweeds pw_rpc system is well suited for retrieving a snapshot from a device.
291Pigweed does not yet provide a generalized transfer service for moving files
292to/from a device. When this feature is added to Pigweed, this section will be
293updated to include guidance for connecting a storage system to a transfer
294service.
295
296----------------------
297Snapshot Tooling Setup
298----------------------
Armando Montanez3a8e7292021-07-09 16:47:53 -0700299When using the upstream ``Snapshot`` proto, you can directly use
300``pw_snapshot.process`` to process snapshots into human-readable dumps. If
301you've opted to extend Pigweed's snapshot proto, you'll likely want to extend
302the processing tooling to handle custom project data as well. This can be done
303by creating a light wrapper around
304``pw_snapshot.processor.process_snapshots()``.
305
306.. code-block:: python
307
308 def _process_hw_failures(serialized_snapshot: bytes) -> str:
309 """Custom handler that checks wheel state."""
310 wheel_state = wheel_state_pb2.WheelStateSnapshot()
311 output = []
312 wheel_state.ParseFromString(serialized_snapshot)
313
314 if len(wheel_state.wheels) != 2:
315 output.append(f'Expected 2 wheels, found {len(wheel_state.wheels)}')
316
317 if len(wheel_state.wheels) < 2:
318 output.append('Wheels fell off!')
319
320 # And more...
321
322 return '\n'.join(output)
323
324
325 def process_my_snapshots(serialized_snapshot: bytes) -> str:
326 """Runs the snapshot processor with a custom callback."""
327 return pw_snaphsot.processor.process_snapshots(
328 serialized_snapshot, user_processing_callback=_process_hw_failures)