Jacob Pan | d6d71ee | 2013-01-21 04:37:57 -0800 | [diff] [blame] | 1 | ======================= |
| 2 | INTEL POWERCLAMP DRIVER |
| 3 | ======================= |
| 4 | By: Arjan van de Ven <arjan@linux.intel.com> |
| 5 | Jacob Pan <jacob.jun.pan@linux.intel.com> |
| 6 | |
| 7 | Contents: |
| 8 | (*) Introduction |
| 9 | - Goals and Objectives |
| 10 | |
| 11 | (*) Theory of Operation |
| 12 | - Idle Injection |
| 13 | - Calibration |
| 14 | |
| 15 | (*) Performance Analysis |
| 16 | - Effectiveness and Limitations |
| 17 | - Power vs Performance |
| 18 | - Scalability |
| 19 | - Calibration |
| 20 | - Comparison with Alternative Techniques |
| 21 | |
| 22 | (*) Usage and Interfaces |
| 23 | - Generic Thermal Layer (sysfs) |
| 24 | - Kernel APIs (TBD) |
| 25 | |
| 26 | ============ |
| 27 | INTRODUCTION |
| 28 | ============ |
| 29 | |
| 30 | Consider the situation where a system’s power consumption must be |
| 31 | reduced at runtime, due to power budget, thermal constraint, or noise |
| 32 | level, and where active cooling is not preferred. Software managed |
| 33 | passive power reduction must be performed to prevent the hardware |
| 34 | actions that are designed for catastrophic scenarios. |
| 35 | |
| 36 | Currently, P-states, T-states (clock modulation), and CPU offlining |
| 37 | are used for CPU throttling. |
| 38 | |
| 39 | On Intel CPUs, C-states provide effective power reduction, but so far |
| 40 | they’re only used opportunistically, based on workload. With the |
| 41 | development of intel_powerclamp driver, the method of synchronizing |
| 42 | idle injection across all online CPU threads was introduced. The goal |
| 43 | is to achieve forced and controllable C-state residency. |
| 44 | |
| 45 | Test/Analysis has been made in the areas of power, performance, |
| 46 | scalability, and user experience. In many cases, clear advantage is |
| 47 | shown over taking the CPU offline or modulating the CPU clock. |
| 48 | |
| 49 | |
| 50 | =================== |
| 51 | THEORY OF OPERATION |
| 52 | =================== |
| 53 | |
| 54 | Idle Injection |
| 55 | -------------- |
| 56 | |
| 57 | On modern Intel processors (Nehalem or later), package level C-state |
| 58 | residency is available in MSRs, thus also available to the kernel. |
| 59 | |
| 60 | These MSRs are: |
| 61 | #define MSR_PKG_C2_RESIDENCY 0x60D |
| 62 | #define MSR_PKG_C3_RESIDENCY 0x3F8 |
| 63 | #define MSR_PKG_C6_RESIDENCY 0x3F9 |
| 64 | #define MSR_PKG_C7_RESIDENCY 0x3FA |
| 65 | |
| 66 | If the kernel can also inject idle time to the system, then a |
| 67 | closed-loop control system can be established that manages package |
| 68 | level C-state. The intel_powerclamp driver is conceived as such a |
| 69 | control system, where the target set point is a user-selected idle |
| 70 | ratio (based on power reduction), and the error is the difference |
| 71 | between the actual package level C-state residency ratio and the target idle |
| 72 | ratio. |
| 73 | |
| 74 | Injection is controlled by high priority kernel threads, spawned for |
| 75 | each online CPU. |
| 76 | |
| 77 | These kernel threads, with SCHED_FIFO class, are created to perform |
| 78 | clamping actions of controlled duty ratio and duration. Each per-CPU |
| 79 | thread synchronizes its idle time and duration, based on the rounding |
| 80 | of jiffies, so accumulated errors can be prevented to avoid a jittery |
| 81 | effect. Threads are also bound to the CPU such that they cannot be |
| 82 | migrated, unless the CPU is taken offline. In this case, threads |
| 83 | belong to the offlined CPUs will be terminated immediately. |
| 84 | |
| 85 | Running as SCHED_FIFO and relatively high priority, also allows such |
| 86 | scheme to work for both preemptable and non-preemptable kernels. |
| 87 | Alignment of idle time around jiffies ensures scalability for HZ |
| 88 | values. This effect can be better visualized using a Perf timechart. |
| 89 | The following diagram shows the behavior of kernel thread |
| 90 | kidle_inject/cpu. During idle injection, it runs monitor/mwait idle |
| 91 | for a given "duration", then relinquishes the CPU to other tasks, |
| 92 | until the next time interval. |
| 93 | |
| 94 | The NOHZ schedule tick is disabled during idle time, but interrupts |
| 95 | are not masked. Tests show that the extra wakeups from scheduler tick |
| 96 | have a dramatic impact on the effectiveness of the powerclamp driver |
| 97 | on large scale systems (Westmere system with 80 processors). |
| 98 | |
| 99 | CPU0 |
| 100 | ____________ ____________ |
| 101 | kidle_inject/0 | sleep | mwait | sleep | |
| 102 | _________| |________| |_______ |
| 103 | duration |
| 104 | CPU1 |
| 105 | ____________ ____________ |
| 106 | kidle_inject/1 | sleep | mwait | sleep | |
| 107 | _________| |________| |_______ |
| 108 | ^ |
| 109 | | |
| 110 | | |
| 111 | roundup(jiffies, interval) |
| 112 | |
| 113 | Only one CPU is allowed to collect statistics and update global |
| 114 | control parameters. This CPU is referred to as the controlling CPU in |
| 115 | this document. The controlling CPU is elected at runtime, with a |
| 116 | policy that favors BSP, taking into account the possibility of a CPU |
| 117 | hot-plug. |
| 118 | |
| 119 | In terms of dynamics of the idle control system, package level idle |
| 120 | time is considered largely as a non-causal system where its behavior |
| 121 | cannot be based on the past or current input. Therefore, the |
| 122 | intel_powerclamp driver attempts to enforce the desired idle time |
| 123 | instantly as given input (target idle ratio). After injection, |
Masanari Iida | 05d0066 | 2016-06-29 18:05:56 +0900 | [diff] [blame] | 124 | powerclamp monitors the actual idle for a given time window and adjust |
Jacob Pan | d6d71ee | 2013-01-21 04:37:57 -0800 | [diff] [blame] | 125 | the next injection accordingly to avoid over/under correction. |
| 126 | |
| 127 | When used in a causal control system, such as a temperature control, |
| 128 | it is up to the user of this driver to implement algorithms where |
| 129 | past samples and outputs are included in the feedback. For example, a |
| 130 | PID-based thermal controller can use the powerclamp driver to |
| 131 | maintain a desired target temperature, based on integral and |
| 132 | derivative gains of the past samples. |
| 133 | |
| 134 | |
| 135 | |
| 136 | Calibration |
| 137 | ----------- |
| 138 | During scalability testing, it is observed that synchronized actions |
| 139 | among CPUs become challenging as the number of cores grows. This is |
| 140 | also true for the ability of a system to enter package level C-states. |
| 141 | |
| 142 | To make sure the intel_powerclamp driver scales well, online |
| 143 | calibration is implemented. The goals for doing such a calibration |
| 144 | are: |
| 145 | |
| 146 | a) determine the effective range of idle injection ratio |
| 147 | b) determine the amount of compensation needed at each target ratio |
| 148 | |
| 149 | Compensation to each target ratio consists of two parts: |
| 150 | |
| 151 | a) steady state error compensation |
| 152 | This is to offset the error occurring when the system can |
| 153 | enter idle without extra wakeups (such as external interrupts). |
| 154 | |
| 155 | b) dynamic error compensation |
| 156 | When an excessive amount of wakeups occurs during idle, an |
| 157 | additional idle ratio can be added to quiet interrupts, by |
| 158 | slowing down CPU activities. |
| 159 | |
| 160 | A debugfs file is provided for the user to examine compensation |
| 161 | progress and results, such as on a Westmere system. |
| 162 | [jacob@nex01 ~]$ cat |
| 163 | /sys/kernel/debug/intel_powerclamp/powerclamp_calib |
| 164 | controlling cpu: 0 |
| 165 | pct confidence steady dynamic (compensation) |
| 166 | 0 0 0 0 |
| 167 | 1 1 0 0 |
| 168 | 2 1 1 0 |
| 169 | 3 3 1 0 |
| 170 | 4 3 1 0 |
| 171 | 5 3 1 0 |
| 172 | 6 3 1 0 |
| 173 | 7 3 1 0 |
| 174 | 8 3 1 0 |
| 175 | ... |
| 176 | 30 3 2 0 |
| 177 | 31 3 2 0 |
| 178 | 32 3 1 0 |
| 179 | 33 3 2 0 |
| 180 | 34 3 1 0 |
| 181 | 35 3 2 0 |
| 182 | 36 3 1 0 |
| 183 | 37 3 2 0 |
| 184 | 38 3 1 0 |
| 185 | 39 3 2 0 |
| 186 | 40 3 3 0 |
| 187 | 41 3 1 0 |
| 188 | 42 3 2 0 |
| 189 | 43 3 1 0 |
| 190 | 44 3 1 0 |
| 191 | 45 3 2 0 |
| 192 | 46 3 3 0 |
| 193 | 47 3 0 0 |
| 194 | 48 3 2 0 |
| 195 | 49 3 3 0 |
| 196 | |
| 197 | Calibration occurs during runtime. No offline method is available. |
| 198 | Steady state compensation is used only when confidence levels of all |
| 199 | adjacent ratios have reached satisfactory level. A confidence level |
| 200 | is accumulated based on clean data collected at runtime. Data |
| 201 | collected during a period without extra interrupts is considered |
| 202 | clean. |
| 203 | |
| 204 | To compensate for excessive amounts of wakeup during idle, additional |
| 205 | idle time is injected when such a condition is detected. Currently, |
| 206 | we have a simple algorithm to double the injection ratio. A possible |
| 207 | enhancement might be to throttle the offending IRQ, such as delaying |
| 208 | EOI for level triggered interrupts. But it is a challenge to be |
| 209 | non-intrusive to the scheduler or the IRQ core code. |
| 210 | |
| 211 | |
| 212 | CPU Online/Offline |
| 213 | ------------------ |
| 214 | Per-CPU kernel threads are started/stopped upon receiving |
| 215 | notifications of CPU hotplug activities. The intel_powerclamp driver |
| 216 | keeps track of clamping kernel threads, even after they are migrated |
| 217 | to other CPUs, after a CPU offline event. |
| 218 | |
| 219 | |
| 220 | ===================== |
| 221 | Performance Analysis |
| 222 | ===================== |
| 223 | This section describes the general performance data collected on |
| 224 | multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). |
| 225 | |
| 226 | Effectiveness and Limitations |
| 227 | ----------------------------- |
| 228 | The maximum range that idle injection is allowed is capped at 50 |
| 229 | percent. As mentioned earlier, since interrupts are allowed during |
| 230 | forced idle time, excessive interrupts could result in less |
| 231 | effectiveness. The extreme case would be doing a ping -f to generated |
| 232 | flooded network interrupts without much CPU acknowledgement. In this |
| 233 | case, little can be done from the idle injection threads. In most |
| 234 | normal cases, such as scp a large file, applications can be throttled |
| 235 | by the powerclamp driver, since slowing down the CPU also slows down |
| 236 | network protocol processing, which in turn reduces interrupts. |
| 237 | |
| 238 | When control parameters change at runtime by the controlling CPU, it |
| 239 | may take an additional period for the rest of the CPUs to catch up |
| 240 | with the changes. During this time, idle injection is out of sync, |
| 241 | thus not able to enter package C- states at the expected ratio. But |
| 242 | this effect is minor, in that in most cases change to the target |
| 243 | ratio is updated much less frequently than the idle injection |
| 244 | frequency. |
| 245 | |
| 246 | Scalability |
| 247 | ----------- |
| 248 | Tests also show a minor, but measurable, difference between the 4P/8P |
| 249 | Ivy Bridge system and the 80P Westmere server under 50% idle ratio. |
| 250 | More compensation is needed on Westmere for the same amount of |
| 251 | target idle ratio. The compensation also increases as the idle ratio |
| 252 | gets larger. The above reason constitutes the need for the |
| 253 | calibration code. |
| 254 | |
| 255 | On the IVB 8P system, compared to an offline CPU, powerclamp can |
| 256 | achieve up to 40% better performance per watt. (measured by a spin |
| 257 | counter summed over per CPU counting threads spawned for all running |
| 258 | CPUs). |
| 259 | |
| 260 | ==================== |
| 261 | Usage and Interfaces |
| 262 | ==================== |
| 263 | The powerclamp driver is registered to the generic thermal layer as a |
| 264 | cooling device. Currently, it’s not bound to any thermal zones. |
| 265 | |
| 266 | jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * |
| 267 | cur_state:0 |
| 268 | max_state:50 |
| 269 | type:intel_powerclamp |
| 270 | |
Jacob Pan | d733505 | 2017-04-14 11:19:24 -0700 | [diff] [blame] | 271 | cur_state allows user to set the desired idle percentage. Writing 0 to |
| 272 | cur_state will stop idle injection. Writing a value between 1 and |
| 273 | max_state will start the idle injection. Reading cur_state returns the |
| 274 | actual and current idle percentage. This may not be the same value |
| 275 | set by the user in that current idle percentage depends on workload |
| 276 | and includes natural idle. When idle injection is disabled, reading |
| 277 | cur_state returns value -1 instead of 0 which is to avoid confusing |
| 278 | 100% busy state with the disabled state. |
| 279 | |
Jacob Pan | d6d71ee | 2013-01-21 04:37:57 -0800 | [diff] [blame] | 280 | Example usage: |
| 281 | - To inject 25% idle time |
| 282 | $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state |
| 283 | " |
| 284 | |
| 285 | If the system is not busy and has more than 25% idle time already, |
| 286 | then the powerclamp driver will not start idle injection. Using Top |
| 287 | will not show idle injection kernel threads. |
| 288 | |
| 289 | If the system is busy (spin test below) and has less than 25% natural |
Jacob Pan | d733505 | 2017-04-14 11:19:24 -0700 | [diff] [blame] | 290 | idle time, powerclamp kernel threads will do idle injection. Forced |
| 291 | idle time is accounted as normal idle in that common code path is |
| 292 | taken as the idle task. |
| 293 | |
| 294 | In this example, 24.1% idle is shown. This helps the system admin or |
| 295 | user determine the cause of slowdown, when a powerclamp driver is in action. |
Jacob Pan | d6d71ee | 2013-01-21 04:37:57 -0800 | [diff] [blame] | 296 | |
| 297 | |
| 298 | Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie |
| 299 | Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st |
| 300 | Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers |
| 301 | Swap: 4087804k total, 0k used, 4087804k free, 945336k cached |
| 302 | |
| 303 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| 304 | 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin |
| 305 | 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 |
| 306 | 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 |
| 307 | 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 |
| 308 | 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 |
| 309 | 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox |
| 310 | 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg |
| 311 | 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz |
| 312 | |
| 313 | Tests have shown that by using the powerclamp driver as a cooling |
| 314 | device, a PID based userspace thermal controller can manage to |
| 315 | control CPU temperature effectively, when no other thermal influence |
| 316 | is added. For example, a UltraBook user can compile the kernel under |
| 317 | certain temperature (below most active trip points). |