Karthikeyan Ramasubramanian | 04e3f90 | 2016-09-19 09:24:36 -0600 | [diff] [blame] | 1 | Introduction |
| 2 | ============ |
| 3 | |
| 4 | System Health Monitor (SHM) passively monitors the health of the |
| 5 | peripherals connected to the application processor. Software components |
| 6 | in the application processor that experience communication failure can |
| 7 | request the SHM to perform a system-wide health check. If any failures |
| 8 | are detected during the health-check, then a subsystem restart will be |
| 9 | triggered for the failed subsystem. |
| 10 | |
| 11 | Hardware description |
| 12 | ==================== |
| 13 | |
| 14 | SHM is solely a software component and it interfaces with peripherals |
| 15 | through QMI communication. SHM does not control any hardware blocks and |
| 16 | it uses subsystem_restart to restart any peripheral. |
| 17 | |
| 18 | Software description |
| 19 | ==================== |
| 20 | |
| 21 | SHM hosts a QMI service in the kernel that is connected to the Health |
| 22 | Monitor Agents (HMA) hosted in the peripherals. HMAs in the peripherals |
| 23 | are initialized along with other critical services in the peripherals and |
| 24 | hence the connection between SHM and HMAs are established during the early |
| 25 | stages of the peripheral boot-up procedure. Software components within the |
| 26 | application processor, either user-space or kernel-space, identify any |
| 27 | communication failure with the peripheral by a lack of response and report |
| 28 | that failure to SHM. SHM checks the health of the entire system through |
| 29 | HMAs that are connected to it. If all the HMAs respond in time, then the |
| 30 | failure report by the software component is ignored. If any HMAs do not |
| 31 | respond in time, then SHM will restart the concerned peripheral. Figure 1 |
| 32 | shows a high level design diagram and Figure 2 shows a flow diagram of the |
| 33 | design. |
| 34 | |
| 35 | Figure 1 - System Health Monitor Overview: |
| 36 | |
| 37 | +------------------------------------+ +----------------------+ |
| 38 | | Application Processor | | Peripheral 1 | |
| 39 | | +--------------+ | | +----------------+ | |
| 40 | | | Applications | | | | Health Monitor | | |
| 41 | | +------+-------+ | +------->| Agent 1 | | |
| 42 | | User-space | | | | +----------------+ | |
| 43 | +-------------------------|----------+ | +----------------------+ |
| 44 | | Kernel-space v | QMI . |
| 45 | | +---------+ +---------------+ | | . |
| 46 | | | Kernel |----->| System Health |<----+ . |
| 47 | | | Drivers | | Monitor | | | |
| 48 | | +---------+ +---------------+ | QMI +----------------------+ |
| 49 | | | | | Peripheral N | |
| 50 | | | | | +----------------+ | |
| 51 | | | | | | Health Monitor | | |
| 52 | | | +------->| Agent N | | |
| 53 | | | | +----------------+ | |
| 54 | +------------------------------------+ +----------------------+ |
| 55 | |
| 56 | |
| 57 | Figure 2 - System Health Monitor Message Flow with 2 peripherals: |
| 58 | |
| 59 | +-----------+ +-------+ +-------+ +-------+ |
| 60 | |Application| | SHM | | HMA 1 | | HMA 2 | |
| 61 | +-----+-----+ +-------+ +---+---+ +---+---+ |
| 62 | | | | | |
| 63 | | | | | |
| 64 | | check_system | | | |
| 65 | |------------------->| | | |
| 66 | | _health() | Report_ | | |
| 67 | | |---------------->| | |
| 68 | | | health_req(1) | | |
| 69 | | | | | |
| 70 | | | Report_ | | |
| 71 | | |---------------------------------->| |
| 72 | | +-+ health_req(2) | | |
| 73 | | |T| | | |
| 74 | | |i| | | |
| 75 | | |m| | | |
| 76 | | |e| Report_ | | |
| 77 | | |o|<---------------| | |
| 78 | | |u| health_resp(1) | | |
| 79 | | |t| | | |
| 80 | | +-+ | | |
| 81 | | | subsystem_ | | |
| 82 | | |---------------------------------->| |
| 83 | | | restart(2) | | |
| 84 | + + + + |
| 85 | |
| 86 | HMAs can be extended to monitor the health of individual software services |
| 87 | executing in their concerned peripherals. HMAs can restore the services |
| 88 | that are not responding to a responsive state. |
| 89 | |
| 90 | Design |
| 91 | ====== |
| 92 | |
| 93 | The design goal of SHM is to: |
| 94 | * Restore the unresponsive peripheral to a responsive state. |
| 95 | * Restore the unresponsive software services in a peripheral to a |
| 96 | responsive state. |
| 97 | * Perform power-efficient monitoring of the system health. |
| 98 | |
| 99 | The alternate design discussion includes sending keepalive messages in |
| 100 | IPC protocols at Transport Layer. This approach requires rolling out the |
| 101 | protocol update in all the peripherals together and hence has considerable |
| 102 | coupling unless a suitable feature negotiation algorithm is implemented. |
| 103 | This approach also requires all the IPC protocols at transport layer to be |
| 104 | updated and hence replication of effort. There are multiple link-layer |
| 105 | protocols and adding keep-alive at the link-layer protocols does not solve |
| 106 | issues at the client layer which is solved by SHM. Restoring a peripheral |
| 107 | or a remote software service by an IPC protocol has not been an industry |
| 108 | standard practice. Industry standard IPC protocols only terminate the |
| 109 | connection if there is any communication failure and rely upon other |
| 110 | mechanisms to restore the system to full operation. |
| 111 | |
| 112 | Power Management |
| 113 | ================ |
| 114 | |
| 115 | This driver ensures that the health monitor messages are sent only upon |
| 116 | request and hence does not wake up application processor or any peripheral |
| 117 | unnecessarily. |
| 118 | |
| 119 | SMP/multi-core |
| 120 | ============== |
| 121 | |
| 122 | This driver uses standard kernel mutexes and wait queues to achieve any |
| 123 | required synchronization. |
| 124 | |
| 125 | Security |
| 126 | ======== |
| 127 | |
| 128 | Denial of Service (DoS) attack by an application that keeps requesting |
| 129 | health checks at a high rate can be throttled by the SHM to minimize the |
| 130 | impact of the misbehaving application. |
| 131 | |
| 132 | Interface |
| 133 | ========= |
| 134 | |
| 135 | Kernel-space APIs: |
| 136 | ------------------ |
| 137 | /** |
| 138 | * kern_check_system_health() - Check the system health |
| 139 | * |
| 140 | * @return: 0 on success, standard Linux error codes on failure. |
| 141 | * |
| 142 | * This function is used by the kernel drivers to initiate the |
| 143 | * system health check. This function in turn trigger SHM to send |
| 144 | * QMI message to all the HMAs connected to it. |
| 145 | */ |
| 146 | int kern_check_system_health(void); |
| 147 | |
| 148 | User-space Interface: |
| 149 | --------------------- |
| 150 | This driver provides a devfs interface(/dev/system_health_monitor) to the |
| 151 | user-space. A wrapper API library will be provided to the user-space |
| 152 | applications in order to initiate the system health check. The API in turn |
| 153 | will interface with the driver through the sysfs interface provided by the |
| 154 | driver. |
| 155 | |
| 156 | /** |
| 157 | * check_system_health() - Check the system health |
| 158 | * |
| 159 | * @return: 0 on success, -1 on failure. |
| 160 | * |
| 161 | * This function is used by the user-space applications to initiate the |
| 162 | * system health check. This function in turn trigger SHM to send QMI |
| 163 | * message to all the HMAs connected to it. |
| 164 | */ |
| 165 | int check_system_health(void); |
| 166 | |
| 167 | The above mentioned interface function works by opening the sysfs |
| 168 | interface provided by SHM, perform an ioctl operation and then close the |
| 169 | sysfs interface. The concerned ioctl command(CHECK_SYS_HEALTH_IOCTL) does |
| 170 | not take any argument. This function performs the health check, handles the |
| 171 | response and timeout in an asynchronous manner. |
| 172 | |
| 173 | Driver parameters |
| 174 | ================= |
| 175 | |
| 176 | The time duration for which the SHM has to wait before a response |
| 177 | arrives from HMAs can be configured using a module parameter. This |
| 178 | parameter will be used only for debugging purposes. The default SHM health |
| 179 | check timeout is 2s, which can be overwritten by the timeout provided by |
| 180 | HMA during the connection establishment. |
| 181 | |
| 182 | Config options |
| 183 | ============== |
| 184 | |
| 185 | This driver is enabled through kernel config option |
| 186 | CONFIG_SYSTEM_HEALTH_MONITOR. |
| 187 | |
| 188 | Dependencies |
| 189 | ============ |
| 190 | |
| 191 | This driver depends on the following kernel modules for its complete |
| 192 | functionality: |
| 193 | * Kernel QMI interface |
| 194 | * Subsystem Restart support |
| 195 | |
| 196 | User space utilities |
| 197 | ==================== |
| 198 | |
| 199 | Any user-space or kernel-space modules that experience communication |
| 200 | failure with peripherals will interface with this driver. Some of the |
| 201 | modules include: |
| 202 | * RIL |
| 203 | * Location Manager |
| 204 | * Data Services |
| 205 | |
| 206 | Other |
| 207 | ===== |
| 208 | |
| 209 | SHM provides a debug interface to enumerate some information regarding the |
| 210 | recent health checks. The debug information includes, but not limited to: |
| 211 | * application name that triggered the health check. |
| 212 | * time of the health check. |
| 213 | * status of the health check. |