Blame - Documentation/arm/msm/system_health_monitor.txt - kernel/msm-4.9

blob: 4e2e4c6c6281f4cc1d9b07def81e29e1248d8185 [file] [log] [blame]

Karthikeyan Ramasubramanian	04e3f90	2016-09-19 09:24:36 -0600	[diff] [blame]	1	Introduction
				2	============
				3
				4	System Health Monitor (SHM) passively monitors the health of the
				5	peripherals connected to the application processor. Software components
				6	in the application processor that experience communication failure can
				7	request the SHM to perform a system-wide health check. If any failures
				8	are detected during the health-check, then a subsystem restart will be
				9	triggered for the failed subsystem.
				10
				11	Hardware description
				12	====================
				13
				14	SHM is solely a software component and it interfaces with peripherals
				15	through QMI communication. SHM does not control any hardware blocks and
				16	it uses subsystem_restart to restart any peripheral.
				17
				18	Software description
				19	====================
				20
				21	SHM hosts a QMI service in the kernel that is connected to the Health
				22	Monitor Agents (HMA) hosted in the peripherals. HMAs in the peripherals
				23	are initialized along with other critical services in the peripherals and
				24	hence the connection between SHM and HMAs are established during the early
				25	stages of the peripheral boot-up procedure. Software components within the
				26	application processor, either user-space or kernel-space, identify any
				27	communication failure with the peripheral by a lack of response and report
				28	that failure to SHM. SHM checks the health of the entire system through
				29	HMAs that are connected to it. If all the HMAs respond in time, then the
				30	failure report by the software component is ignored. If any HMAs do not
				31	respond in time, then SHM will restart the concerned peripheral. Figure 1
				32	shows a high level design diagram and Figure 2 shows a flow diagram of the
				33	design.
				34
				35	Figure 1 - System Health Monitor Overview:
				36
				37	+------------------------------------+ +----------------------+
				38	\| Application Processor \| \| Peripheral 1 \|
				39	\| +--------------+ \| \| +----------------+ \|
				40	\| \| Applications \| \| \| \| Health Monitor \| \|
				41	\| +------+-------+ \| +------->\| Agent 1 \| \|
				42	\| User-space \| \| \| \| +----------------+ \|
				43	+-------------------------\|----------+ \| +----------------------+
				44	\| Kernel-space v \| QMI .
				45	\| +---------+ +---------------+ \| \| .
				46	\| \| Kernel \|----->\| System Health \|<----+ .
				47	\| \| Drivers \| \| Monitor \| \| \|
				48	\| +---------+ +---------------+ \| QMI +----------------------+
				49	\| \| \| \| Peripheral N \|
				50	\| \| \| \| +----------------+ \|
				51	\| \| \| \| \| Health Monitor \| \|
				52	\| \| +------->\| Agent N \| \|
				53	\| \| \| +----------------+ \|
				54	+------------------------------------+ +----------------------+
				55
				56
				57	Figure 2 - System Health Monitor Message Flow with 2 peripherals:
				58
				59	+-----------+ +-------+ +-------+ +-------+
				60	\|Application\| \| SHM \| \| HMA 1 \| \| HMA 2 \|
				61	+-----+-----+ +-------+ +---+---+ +---+---+
				62	\| \| \| \|
				63	\| \| \| \|
				64	\| check_system \| \| \|
				65	\|------------------->\| \| \|
				66	\| _health() \| Report_ \| \|
				67	\| \|---------------->\| \|
				68	\| \| health_req(1) \| \|
				69	\| \| \| \|
				70	\| \| Report_ \| \|
				71	\| \|---------------------------------->\|
				72	\| +-+ health_req(2) \| \|
				73	\| \|T\| \| \|
				74	\| \|i\| \| \|
				75	\| \|m\| \| \|
				76	\| \|e\| Report_ \| \|
				77	\| \|o\|<---------------\| \|
				78	\| \|u\| health_resp(1) \| \|
				79	\| \|t\| \| \|
				80	\| +-+ \| \|
				81	\| \| subsystem_ \| \|
				82	\| \|---------------------------------->\|
				83	\| \| restart(2) \| \|
				84	+ + + +
				85
				86	HMAs can be extended to monitor the health of individual software services
				87	executing in their concerned peripherals. HMAs can restore the services
				88	that are not responding to a responsive state.
				89
				90	Design
				91	======
				92
				93	The design goal of SHM is to:
				94	* Restore the unresponsive peripheral to a responsive state.
				95	* Restore the unresponsive software services in a peripheral to a
				96	responsive state.
				97	* Perform power-efficient monitoring of the system health.
				98
				99	The alternate design discussion includes sending keepalive messages in
				100	IPC protocols at Transport Layer. This approach requires rolling out the
				101	protocol update in all the peripherals together and hence has considerable
				102	coupling unless a suitable feature negotiation algorithm is implemented.
				103	This approach also requires all the IPC protocols at transport layer to be
				104	updated and hence replication of effort. There are multiple link-layer
				105	protocols and adding keep-alive at the link-layer protocols does not solve
				106	issues at the client layer which is solved by SHM. Restoring a peripheral
				107	or a remote software service by an IPC protocol has not been an industry
				108	standard practice. Industry standard IPC protocols only terminate the
				109	connection if there is any communication failure and rely upon other
				110	mechanisms to restore the system to full operation.
				111
				112	Power Management
				113	================
				114
				115	This driver ensures that the health monitor messages are sent only upon
				116	request and hence does not wake up application processor or any peripheral
				117	unnecessarily.
				118
				119	SMP/multi-core
				120	==============
				121
				122	This driver uses standard kernel mutexes and wait queues to achieve any
				123	required synchronization.
				124
				125	Security
				126	========
				127
				128	Denial of Service (DoS) attack by an application that keeps requesting
				129	health checks at a high rate can be throttled by the SHM to minimize the
				130	impact of the misbehaving application.
				131
				132	Interface
				133	=========
				134
				135	Kernel-space APIs:
				136	------------------
				137	/**
				138	* kern_check_system_health() - Check the system health
				139	*
				140	* @return: 0 on success, standard Linux error codes on failure.
				141	*
				142	* This function is used by the kernel drivers to initiate the
				143	* system health check. This function in turn trigger SHM to send
				144	* QMI message to all the HMAs connected to it.
				145	*/
				146	int kern_check_system_health(void);
				147
				148	User-space Interface:
				149	---------------------
				150	This driver provides a devfs interface(/dev/system_health_monitor) to the
				151	user-space. A wrapper API library will be provided to the user-space
				152	applications in order to initiate the system health check. The API in turn
				153	will interface with the driver through the sysfs interface provided by the
				154	driver.
				155
				156	/**
				157	* check_system_health() - Check the system health
				158	*
				159	* @return: 0 on success, -1 on failure.
				160	*
				161	* This function is used by the user-space applications to initiate the
				162	* system health check. This function in turn trigger SHM to send QMI
				163	* message to all the HMAs connected to it.
				164	*/
				165	int check_system_health(void);
				166
				167	The above mentioned interface function works by opening the sysfs
				168	interface provided by SHM, perform an ioctl operation and then close the
				169	sysfs interface. The concerned ioctl command(CHECK_SYS_HEALTH_IOCTL) does
				170	not take any argument. This function performs the health check, handles the
				171	response and timeout in an asynchronous manner.
				172
				173	Driver parameters
				174	=================
				175
				176	The time duration for which the SHM has to wait before a response
				177	arrives from HMAs can be configured using a module parameter. This
				178	parameter will be used only for debugging purposes. The default SHM health
				179	check timeout is 2s, which can be overwritten by the timeout provided by
				180	HMA during the connection establishment.
				181
				182	Config options
				183	==============
				184
				185	This driver is enabled through kernel config option
				186	CONFIG_SYSTEM_HEALTH_MONITOR.
				187
				188	Dependencies
				189	============
				190
				191	This driver depends on the following kernel modules for its complete
				192	functionality:
				193	* Kernel QMI interface
				194	* Subsystem Restart support
				195
				196	User space utilities
				197	====================
				198
				199	Any user-space or kernel-space modules that experience communication
				200	failure with peripherals will interface with this driver. Some of the
				201	modules include:
				202	* RIL
				203	* Location Manager
				204	* Data Services
				205
				206	Other
				207	=====
				208
				209	SHM provides a debug interface to enumerate some information regarding the
				210	recent health checks. The debug information includes, but not limited to:
				211	* application name that triggered the health check.
				212	* time of the health check.
				213	* status of the health check.