blob: 4e2e4c6c6281f4cc1d9b07def81e29e1248d8185 [file] [log] [blame]
Karthikeyan Ramasubramanian04e3f902016-09-19 09:24:36 -06001Introduction
2============
3
4System Health Monitor (SHM) passively monitors the health of the
5peripherals connected to the application processor. Software components
6in the application processor that experience communication failure can
7request the SHM to perform a system-wide health check. If any failures
8are detected during the health-check, then a subsystem restart will be
9triggered for the failed subsystem.
10
11Hardware description
12====================
13
14SHM is solely a software component and it interfaces with peripherals
15through QMI communication. SHM does not control any hardware blocks and
16it uses subsystem_restart to restart any peripheral.
17
18Software description
19====================
20
21SHM hosts a QMI service in the kernel that is connected to the Health
22Monitor Agents (HMA) hosted in the peripherals. HMAs in the peripherals
23are initialized along with other critical services in the peripherals and
24hence the connection between SHM and HMAs are established during the early
25stages of the peripheral boot-up procedure. Software components within the
26application processor, either user-space or kernel-space, identify any
27communication failure with the peripheral by a lack of response and report
28that failure to SHM. SHM checks the health of the entire system through
29HMAs that are connected to it. If all the HMAs respond in time, then the
30failure report by the software component is ignored. If any HMAs do not
31respond in time, then SHM will restart the concerned peripheral. Figure 1
32shows a high level design diagram and Figure 2 shows a flow diagram of the
33design.
34
35Figure 1 - System Health Monitor Overview:
36
37 +------------------------------------+ +----------------------+
38 | Application Processor | | Peripheral 1 |
39 | +--------------+ | | +----------------+ |
40 | | Applications | | | | Health Monitor | |
41 | +------+-------+ | +------->| Agent 1 | |
42 | User-space | | | | +----------------+ |
43 +-------------------------|----------+ | +----------------------+
44 | Kernel-space v | QMI .
45 | +---------+ +---------------+ | | .
46 | | Kernel |----->| System Health |<----+ .
47 | | Drivers | | Monitor | | |
48 | +---------+ +---------------+ | QMI +----------------------+
49 | | | | Peripheral N |
50 | | | | +----------------+ |
51 | | | | | Health Monitor | |
52 | | +------->| Agent N | |
53 | | | +----------------+ |
54 +------------------------------------+ +----------------------+
55
56
57Figure 2 - System Health Monitor Message Flow with 2 peripherals:
58
59 +-----------+ +-------+ +-------+ +-------+
60 |Application| | SHM | | HMA 1 | | HMA 2 |
61 +-----+-----+ +-------+ +---+---+ +---+---+
62 | | | |
63 | | | |
64 | check_system | | |
65 |------------------->| | |
66 | _health() | Report_ | |
67 | |---------------->| |
68 | | health_req(1) | |
69 | | | |
70 | | Report_ | |
71 | |---------------------------------->|
72 | +-+ health_req(2) | |
73 | |T| | |
74 | |i| | |
75 | |m| | |
76 | |e| Report_ | |
77 | |o|<---------------| |
78 | |u| health_resp(1) | |
79 | |t| | |
80 | +-+ | |
81 | | subsystem_ | |
82 | |---------------------------------->|
83 | | restart(2) | |
84 + + + +
85
86HMAs can be extended to monitor the health of individual software services
87executing in their concerned peripherals. HMAs can restore the services
88that are not responding to a responsive state.
89
90Design
91======
92
93The design goal of SHM is to:
94 * Restore the unresponsive peripheral to a responsive state.
95 * Restore the unresponsive software services in a peripheral to a
96 responsive state.
97 * Perform power-efficient monitoring of the system health.
98
99The alternate design discussion includes sending keepalive messages in
100IPC protocols at Transport Layer. This approach requires rolling out the
101protocol update in all the peripherals together and hence has considerable
102coupling unless a suitable feature negotiation algorithm is implemented.
103This approach also requires all the IPC protocols at transport layer to be
104updated and hence replication of effort. There are multiple link-layer
105protocols and adding keep-alive at the link-layer protocols does not solve
106issues at the client layer which is solved by SHM. Restoring a peripheral
107or a remote software service by an IPC protocol has not been an industry
108standard practice. Industry standard IPC protocols only terminate the
109connection if there is any communication failure and rely upon other
110mechanisms to restore the system to full operation.
111
112Power Management
113================
114
115This driver ensures that the health monitor messages are sent only upon
116request and hence does not wake up application processor or any peripheral
117unnecessarily.
118
119SMP/multi-core
120==============
121
122This driver uses standard kernel mutexes and wait queues to achieve any
123required synchronization.
124
125Security
126========
127
128Denial of Service (DoS) attack by an application that keeps requesting
129health checks at a high rate can be throttled by the SHM to minimize the
130impact of the misbehaving application.
131
132Interface
133=========
134
135Kernel-space APIs:
136------------------
137/**
138 * kern_check_system_health() - Check the system health
139 *
140 * @return: 0 on success, standard Linux error codes on failure.
141 *
142 * This function is used by the kernel drivers to initiate the
143 * system health check. This function in turn trigger SHM to send
144 * QMI message to all the HMAs connected to it.
145 */
146int kern_check_system_health(void);
147
148User-space Interface:
149---------------------
150This driver provides a devfs interface(/dev/system_health_monitor) to the
151user-space. A wrapper API library will be provided to the user-space
152applications in order to initiate the system health check. The API in turn
153will interface with the driver through the sysfs interface provided by the
154driver.
155
156/**
157 * check_system_health() - Check the system health
158 *
159 * @return: 0 on success, -1 on failure.
160 *
161 * This function is used by the user-space applications to initiate the
162 * system health check. This function in turn trigger SHM to send QMI
163 * message to all the HMAs connected to it.
164 */
165int check_system_health(void);
166
167The above mentioned interface function works by opening the sysfs
168interface provided by SHM, perform an ioctl operation and then close the
169sysfs interface. The concerned ioctl command(CHECK_SYS_HEALTH_IOCTL) does
170not take any argument. This function performs the health check, handles the
171response and timeout in an asynchronous manner.
172
173Driver parameters
174=================
175
176The time duration for which the SHM has to wait before a response
177arrives from HMAs can be configured using a module parameter. This
178parameter will be used only for debugging purposes. The default SHM health
179check timeout is 2s, which can be overwritten by the timeout provided by
180HMA during the connection establishment.
181
182Config options
183==============
184
185This driver is enabled through kernel config option
186CONFIG_SYSTEM_HEALTH_MONITOR.
187
188Dependencies
189============
190
191This driver depends on the following kernel modules for its complete
192functionality:
193 * Kernel QMI interface
194 * Subsystem Restart support
195
196User space utilities
197====================
198
199Any user-space or kernel-space modules that experience communication
200failure with peripherals will interface with this driver. Some of the
201modules include:
202 * RIL
203 * Location Manager
204 * Data Services
205
206Other
207=====
208
209SHM provides a debug interface to enumerate some information regarding the
210recent health checks. The debug information includes, but not limited to:
211* application name that triggered the health check.
212* time of the health check.
213* status of the health check.