Blame - pw_snapshot/setup.rst - platform/external/pigweed

blob: 7c39444d5b99b6fa5cc311b3d6fc4dc5b75d7f59 [file] [log] [blame]

Armando Montanez	179aa8e	2021-03-10 11:46:35 -0800	[diff] [blame]	1	.. _module-pw_snapshot-setup:
				2
				3	==============================
				4	Setting up a Snapshot Pipeline
				5	==============================
				6
				7	.. contents:: Table of Contents
				8
				9	-------------------
				10	Crash Handler Setup
				11	-------------------
				12	The Snapshot proto was designed first and foremost as a crash reporting format.
				13	This section covers how to set up a crash handler to capture Snapshots.
				14
				15	.. image:: images/generic_crash_flow.svg
				16	:width: 600
				17	:alt: Generic crash handler flow
				18
				19	A typical crash handler has two entry points:
				20
				21	1. A software entry path through developer-written ASSERT() or CHECK() calls
				22	that indicate a device should go down for a crash if a condition is not met.
				23	2. A hardware-triggered exception handler path that is initiated when a CPU
				24	encounters a fault signal (invalid memory access, bad instruction, etc.).
				25
				26	Before deferring to a common crash handler, these entry paths should disable
				27	interrupts to force the system into a single-threaded execution mode. This
				28	prevents other threads from operating on potentially bad data or clobbering
				29	system state that could be useful for debugging.
				30
				31	The first step in a crash handler should always be a check for nested crashes to
				32	prevent infinitely recursive crashes. Once it's deemed it's safe to continue,
				33	the crash handler can re-initialize logging, initialize storage for crash report
				34	capture, and then build a snapshot to later be retrieved from the device. Once
				35	the crash report collection process is complete, some post-crash callbacks can
				36	be run on a best-effort basis to clean up the system before rebooting. For
				37	devices with debug port access, it's helpful to optionally hold the device in
				38	an infinite loop rather than resetting to allow developers to access the device
				39	via a hardware debugger.
				40
				41	Assert Handler Setup
				42	====================
				43	:ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software
				44	crashes. Route any existing assert functions through pw_assert to centralize the
				45	software crash path. You’ll need to create a :ref:`pw_assert backend
				46	<module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler
				47	<module-pw_assert_basic-custom_handler>` to pass collected information to a more
				48	sophisticated crash handler. One way to do this is to collect the data into a
				49	statically allocated struct that is passed to a common crash handler. It’s
				50	important to immediately disable interrupts to prevent the system from doing
				51	other things while in an impacted state.
				52
				53	.. code-block:: cpp
				54
				55	// This can be be directly accessed by a crash handler
				56	static CrashData crash_data;
				57	extern "C" void pw_assert_basic_HandleFailure(const char* file_name,
				58	int line_number,
				59	const char* format,
				60	...) {
				61	// Always disable interrupts first! How this is done depends
				62	// on your platform.
				63	__disable_irq();
				64
				65	va_list args;
				66	va_start(args, format);
				67	crash_data.file_name = file_name;
				68	crash_data.line_number = line_number;
				69	crash_data.reason_fmt = format;
				70	crash_data.reason_args = &args;
				71	crash_data.cpu_state = nullptr;
				72
				73	HandleCrash(crash_data);
				74	PW_UNREACHABLE;
				75	}
				76
				77	Exception Handler Setup
				78	=======================
				79	:ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry
				80	point for CPU-triggered faults (divide by zero, invalid memory access, etc.).
				81	You will need to provide a definition for pw_cpu_exception_DefaultHandler() that
				82	passes the exception state produced by pw_cpu_exception to your common crash
				83	handler.
				84
				85	.. code-block:: cpp
				86
				87	static CrashData crash_data;
				88	// This helper turns a format string to a va_list that can be used by the
				89	// common crash handling path.
				90	void HandleExceptionWithString(pw_cpu_exception_State& state,
				91	const char* fmt,
				92	...) {
				93	va_list args;
				94	va_start(args, fmt);
				95	crash_data.cpu_state = state;
				96	crash_data.file_name = nullptr;
				97	crash_data.reason_fmt = fmt;
				98	crash_data.reason_args = &args;
				99
				100	HandleCrash(crash_data);
				101	PW_UNREACHABLE;
				102	}
				103
				104	extern "C" void pw_cpu_exception_DefaultHandler(
				105	pw_cpu_exception_State* state) {
				106	// Always disable interrupts first! How this is done depends
				107	// on your platform.
				108	__disable_irq();
				109
				110	crash_data.state = cpu_state;
				111	// The CFSR is an extremely useful register for understanding ARMv7-M and
				112	// ARMv8-M CPU faults. Other architectures should put something else here.
				113	HandleExceptionWithString(crash_data,
				114	"Exception encountered, cfsr=0x%",
				115	cpu_state->extended.cfsr);
				116	}
				117
				118	Common Crash Handler Setup
				119	==========================
				120	To minimize duplication of crash handling logic, it's good practice to route the
				121	pw_assert and pw_cpu_exception handlers to a common crash handling codepath.
				122	Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert
				123	information to the shared handler.
				124
				125	.. code-block:: cpp
				126
				127	struct CrashData {
				128	pw_cpu_exception_State *cpu_state;
				129	const char *reason_fmt;
				130	const va_list *reason_args;
				131	const char *file_name;
				132	int line_number;
				133	};
				134
				135	// This function assumes interrupts are properly disabled BEFORE it is called.
				136	[[noreturn]] void HandleCrash(CrashData& crash_info) {
				137	// Handle crash
				138	}
				139
				140	In the crash handler your project can re-initialize a minimal subset of the
				141	system needed to safely capture a snapshot before rebooting the device. The
				142	remainder of this section focuses on ways you can improve the reliability and
				143	usability of your project's crash handler.
				144
				145	Check for Nested Crashes
				146	------------------------
				147	It’s important to include crash handler checks that prevent infinite recursive
				148	nesting of crashes. Maintain a static variable that checks the crash nesting
				149	depth. After one or two nested crashes, abort crash handling entirely and reset
				150	the device or sit in an infinite loop to wait for a hardware debugger to attach.
				151	It’s simpler to put this logic at the beginning of the shared crash handler, but
				152	if your assert/exception handlers are complex it might be safer to inject the
				153	checks earlier in both codepaths.
				154
				155	.. code-block:: cpp
				156
				157	[[noreturn]] void HandleCrash(CrashData &crash_info) {
				158	static size_t crash_depth = 0;
				159	if (crash_depth > kMaxCrashDepth) {
				160	Abort(/run_callbacks=/false);
				161	}
				162	crash_depth++;
				163	...
				164	}
				165
				166	Re-initialize Logging (Optional)
				167	--------------------------------
				168	Logging can be helpful for debugging your crash handler, but depending on your
				169	device/system design may be challenging to safely support at crash time. To
				170	re-initialize logging, you’ll need to re-construct C++ objects and re-initialize
				171	any systems/hardware in the logging codepath. You may even need an entirely
				172	separate logging pipeline that is single-threaded and interrupt-safe. Depending
				173	on your system’s design, this may be difficult to set up.
				174
				175	Reinitialize Dependencies
				176	-------------------------
				177	It's good practice to design a crash handler that can run before C++ static
				178	constructors have run. This means any initialization (whether manual or through
				179	constructors) that your crash handler depends on should be manually invoked at
				180	crash time. If an initialization step might not be safe, evaluate if it's
				181	possible to omit the dependency.
				182
				183	System Cleanup
				184	--------------
				185	After collecting a snapshot, some parts of your system may benefit from some
				186	cleanup before explicitly resetting a device. This might include flushing
				187	buffers or safely shutting down attached hardware. The order of shutdown should
				188	be deterministic, keeping in mind that any of these steps may have the potential
				189	of causing a nested crash that skips the remainder of the handlers and forces
				190	the device to immediately reset.
				191
				192	----------------------
				193	Snapshot Storage Setup
				194	----------------------
				195	Use a storage class with a ``pw::stream::Writer`` interface to simplify
				196	capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore
				197	<module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a
				198	:ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that
				199	lives in persistent memory. It's good practice to use lazy initialization for
				200	storage objects used by your Snapshot capture codepath.
				201
				202	.. code-block:: cpp
				203
				204	// Persistent RAM objects are highly available. They don't rely on
				205	// their constructor being run, and require no initialization.
				206	PW_KEEP_IN_SECTION(".noinit")
				207	pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot;
				208
				209	void CaptureSnapshot(CrashInfo& crash_info) {
				210	...
				211	persistent_snapshot.clear();
				212	PersistentBufferWriter& writer = persistent_snapshot.GetWriter();
				213	...
				214	}
				215
				216	----------------------
				217	Snapshot Capture Setup
				218	----------------------
				219
				220	.. note::
				221
				222	These instructions do not yet use the ``pw::protobuf::StreamingEncoder``.
				223
				224	Capturing a snapshot is as simple as encoding any other proto message. Some
				225	modules provide helper functions that will populate parts of a Snapshot, which
				226	eases the burden of custom work that must be set up uniquely for each project.
				227
				228	Capture Reason
				229	==============
				230	A snapshot's "reason" should be considered the single most important field in a
				231	captured snapshot. If a snapshot capture was triggered by a crash, this should
				232	be the assert string. Other entry paths should describe here why the snapshot
				233	was captured ("Host communication buffer full!", "Exception encountered at
				234	0x00000004", etc.).
				235
				236	.. code-block:: cpp
				237
				238	Status CaptureSnapshot(CrashData& crash_info) {
				239	// Temporary buffer for encoding "reason" to.
				240	static std::byte temp_buffer[500];
				241	// Temporary buffer to encode serialized proto to before dumping to the
				242	// final ``pw::stream::Writer``.
				243	static std::byte proto_encode_buffer[512];
				244	...
				245	pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer);
				246	pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder);
				247	size_t length = snprintf(temp_buffer,
				248	sizeof(temp_buffer,
				249	crash_info.reason_fmt),
				250	*crash_info.reason_args);
				251	snapshot_encoder.WriteReason(temp_buffer, length));
				252
				253	// Final encode and write.
				254	Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
				255	PW_TRY(encoded_proto.status());
				256	PW_TRY(writer.Write(encoded_proto.value()));
				257	...
				258	}
				259
				260	Capture CPU State
				261	=================
				262	When using pw_cpu_exception, exceptions will automatically collect CPU state
				263	that can be directly dumped into a snapshot. As it's not always easy to describe
				264	a CPU exception in a single "reason" string, this captures the information
				265	needed to more verbosely automatically generate a descriptive reason at analysis
				266	time once the snapshot is retrieved from the device.
				267
				268	.. code-block:: cpp
				269
				270	Status CaptureSnapshot(CrashData& crash_info) {
				271	...
				272
				273	proto_encoder.clear();
				274
				275	// Write CPU state.
				276	if (crash_info.cpu_state) {
				277	PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(),
				278	*crash_info.cpu_state));
				279
				280	// Final encode and write.
				281	Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
				282	PW_TRY(encoded_proto.status());
				283	PW_TRY(writer.Write(encoded_proto.value()));
				284	}
				285	}
				286
				287	-----------------------
				288	Snapshot Transfer Setup
				289	-----------------------
				290	Pigweed’s pw_rpc system is well suited for retrieving a snapshot from a device.
				291	Pigweed does not yet provide a generalized transfer service for moving files
				292	to/from a device. When this feature is added to Pigweed, this section will be
				293	updated to include guidance for connecting a storage system to a transfer
				294	service.
				295
				296	----------------------
				297	Snapshot Tooling Setup
				298	----------------------
Armando Montanez	3a8e729	2021-07-09 16:47:53 -0700	[diff] [blame]	299	When using the upstream ``Snapshot`` proto, you can directly use
				300	``pw_snapshot.process`` to process snapshots into human-readable dumps. If
				301	you've opted to extend Pigweed's snapshot proto, you'll likely want to extend
				302	the processing tooling to handle custom project data as well. This can be done
				303	by creating a light wrapper around
				304	``pw_snapshot.processor.process_snapshots()``.
				305
				306	.. code-block:: python
				307
				308	def _process_hw_failures(serialized_snapshot: bytes) -> str:
				309	"""Custom handler that checks wheel state."""
				310	wheel_state = wheel_state_pb2.WheelStateSnapshot()
				311	output = []
				312	wheel_state.ParseFromString(serialized_snapshot)
				313
				314	if len(wheel_state.wheels) != 2:
				315	output.append(f'Expected 2 wheels, found {len(wheel_state.wheels)}')
				316
				317	if len(wheel_state.wheels) < 2:
				318	output.append('Wheels fell off!')
				319
				320	# And more...
				321
				322	return '\n'.join(output)
				323
				324
				325	def process_my_snapshots(serialized_snapshot: bytes) -> str:
				326	"""Runs the snapshot processor with a custom callback."""
				327	return pw_snaphsot.processor.process_snapshots(
				328	serialized_snapshot, user_processing_callback=_process_hw_failures)