Morten Rasmussen | 577daca | 2015-01-13 13:43:28 +0000 | [diff] [blame] | 1 | Energy cost model for energy-aware scheduling (EXPERIMENTAL) |
| 2 | |
| 3 | Introduction |
| 4 | ============= |
| 5 | |
| 6 | The basic energy model uses platform energy data stored in sched_group_energy |
| 7 | data structures attached to the sched_groups in the sched_domain hierarchy. The |
| 8 | energy cost model offers two functions that can be used to guide scheduling |
| 9 | decisions: |
| 10 | |
| 11 | 1. static unsigned int sched_group_energy(struct energy_env *eenv) |
| 12 | 2. static int energy_diff(struct energy_env *eenv) |
| 13 | |
| 14 | sched_group_energy() estimates the energy consumed by all cpus in a specific |
| 15 | sched_group including any shared resources owned exclusively by this group of |
| 16 | cpus. Resources shared with other cpus are excluded (e.g. later level caches). |
| 17 | |
| 18 | energy_diff() estimates the total energy impact of a utilization change. That |
| 19 | is, adding, removing, or migrating utilization (tasks). |
| 20 | |
| 21 | Both functions use a struct energy_env to specify the scenario to be evaluated: |
| 22 | |
| 23 | struct energy_env { |
| 24 | struct sched_group *sg_top; |
| 25 | struct sched_group *sg_cap; |
| 26 | int cap_idx; |
| 27 | int util_delta; |
| 28 | int src_cpu; |
| 29 | int dst_cpu; |
| 30 | int energy; |
| 31 | }; |
| 32 | |
| 33 | sg_top: sched_group to be evaluated. Not used by energy_diff(). |
| 34 | |
| 35 | sg_cap: sched_group covering the cpus in the same frequency domain. Set by |
| 36 | sched_group_energy(). |
| 37 | |
| 38 | cap_idx: Capacity state to be used for energy calculations. Set by |
| 39 | find_new_capacity(). |
| 40 | |
| 41 | util_delta: Amount of utilization to be added, removed, or migrated. |
| 42 | |
| 43 | src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be |
| 44 | -1 if no source (e.g. task wake-up). |
| 45 | |
| 46 | dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1 |
| 47 | if utilization is removed (e.g. terminating tasks). |
| 48 | |
| 49 | energy: Result of sched_group_energy(). |
| 50 | |
| 51 | The metric used to represent utilization is the actual per-entity running time |
| 52 | averaged over time using a geometric series. Very similar to the existing |
| 53 | per-entity load-tracking, but _not_ scaled by task priority and capped by the |
| 54 | capacity of the cpu. The latter property does mean that utilization may |
| 55 | underestimate the compute requirements for task on fully/over utilized cpus. |
| 56 | The greatest potential for energy savings without affecting performance too much |
| 57 | is scenarios where the system isn't fully utilized. If the system is deemed |
| 58 | fully utilized load-balancing should be done with task load (includes task |
| 59 | priority) instead in the interest of fairness and performance. |
| 60 | |
| 61 | |
| 62 | Background and Terminology |
| 63 | =========================== |
| 64 | |
| 65 | To make it clear from the start: |
| 66 | |
| 67 | energy = [joule] (resource like a battery on powered devices) |
| 68 | power = energy/time = [joule/second] = [watt] |
| 69 | |
| 70 | The goal of energy-aware scheduling is to minimize energy, while still getting |
| 71 | the job done. That is, we want to maximize: |
| 72 | |
| 73 | performance [inst/s] |
| 74 | -------------------- |
| 75 | power [W] |
| 76 | |
| 77 | which is equivalent to minimizing: |
| 78 | |
| 79 | energy [J] |
| 80 | ----------- |
| 81 | instruction |
| 82 | |
| 83 | while still getting 'good' performance. It is essentially an alternative |
| 84 | optimization objective to the current performance-only objective for the |
| 85 | scheduler. This alternative considers two objectives: energy-efficiency and |
| 86 | performance. Hence, there needs to be a user controllable knob to switch the |
| 87 | objective. Since it is early days, this is currently a sched_feature |
| 88 | (ENERGY_AWARE). |
| 89 | |
| 90 | The idea behind introducing an energy cost model is to allow the scheduler to |
| 91 | evaluate the implications of its decisions rather than applying energy-saving |
| 92 | techniques blindly that may only have positive effects on some platforms. At |
| 93 | the same time, the energy cost model must be as simple as possible to minimize |
| 94 | the scheduler latency impact. |
| 95 | |
| 96 | Platform topology |
| 97 | ------------------ |
| 98 | |
| 99 | The system topology (cpus, caches, and NUMA information, not peripherals) is |
| 100 | represented in the scheduler by the sched_domain hierarchy which has |
| 101 | sched_groups attached at each level that covers one or more cpus (see |
| 102 | sched-domains.txt for more details). To add energy awareness to the scheduler |
| 103 | we need to consider power and frequency domains. |
| 104 | |
| 105 | Power domain: |
| 106 | |
| 107 | A power domain is a part of the system that can be powered on/off |
| 108 | independently. Power domains are typically organized in a hierarchy where you |
| 109 | may be able to power down just a cpu or a group of cpus along with any |
| 110 | associated resources (e.g. shared caches). Powering up a cpu means that all |
| 111 | power domains it is a part of in the hierarchy must be powered up. Hence, it is |
| 112 | more expensive to power up the first cpu that belongs to a higher level power |
| 113 | domain than powering up additional cpus in the same high level domain. Two |
| 114 | level power domain hierarchy example: |
| 115 | |
| 116 | Power source |
| 117 | +-------------------------------+----... |
| 118 | per group PD G G |
| 119 | | +----------+ | |
| 120 | +--------+-------| Shared | (other groups) |
| 121 | per-cpu PD G G | resource | |
| 122 | | | +----------+ |
| 123 | +-------+ +-------+ |
| 124 | | CPU 0 | | CPU 1 | |
| 125 | +-------+ +-------+ |
| 126 | |
| 127 | Frequency domain: |
| 128 | |
| 129 | Frequency domains (P-states) typically cover the same group of cpus as one of |
| 130 | the power domain levels. That is, there might be several smaller power domains |
| 131 | sharing the same frequency (P-state) or there might be a power domain spanning |
| 132 | multiple frequency domains. |
| 133 | |
| 134 | From a scheduling point of view there is no need to know the actual frequencies |
| 135 | [Hz]. All the scheduler cares about is the compute capacity available at the |
| 136 | current state (P-state) the cpu is in and any other available states. For that |
| 137 | reason, and to also factor in any cpu micro-architecture differences, compute |
| 138 | capacity scaling states are called 'capacity states' in this document. For SMP |
| 139 | systems this is equivalent to P-states. For mixed micro-architecture systems |
| 140 | (like ARM big.LITTLE) it is P-states scaled according to the micro-architecture |
| 141 | performance relative to the other cpus in the system. |
| 142 | |
| 143 | Energy modelling: |
| 144 | ------------------ |
| 145 | |
| 146 | Due to the hierarchical nature of the power domains, the most obvious way to |
| 147 | model energy costs is therefore to associate power and energy costs with |
| 148 | domains (groups of cpus). Energy costs of shared resources are associated with |
| 149 | the group of cpus that share the resources, only the cost of powering the |
| 150 | cpu itself and any private resources (e.g. private L1 caches) is associated |
| 151 | with the per-cpu groups (lowest level). |
| 152 | |
| 153 | For example, for an SMP system with per-cpu power domains and a cluster level |
| 154 | (group of cpus) power domain we get the overall energy costs to be: |
| 155 | |
| 156 | energy = energy_cluster + n * energy_cpu |
| 157 | |
| 158 | where 'n' is the number of cpus powered up and energy_cluster is the cost paid |
| 159 | as soon as any cpu in the cluster is powered up. |
| 160 | |
| 161 | The power and frequency domains can naturally be mapped onto the existing |
| 162 | sched_domain hierarchy and sched_groups by adding the necessary data to the |
| 163 | existing data structures. |
| 164 | |
| 165 | The energy model considers energy consumption from two contributors (shown in |
| 166 | the illustration below): |
| 167 | |
| 168 | 1. Busy energy: Energy consumed while a cpu and the higher level groups that it |
| 169 | belongs to are busy running tasks. Busy energy is associated with the state of |
| 170 | the cpu, not an event. The time the cpu spends in this state varies. Thus, the |
| 171 | most obvious platform parameter for this contribution is busy power |
| 172 | (energy/time). |
| 173 | |
| 174 | 2. Idle energy: Energy consumed while a cpu and higher level groups that it |
| 175 | belongs to are idle (in a C-state). Like busy energy, idle energy is associated |
| 176 | with the state of the cpu. Thus, the platform parameter for this contribution |
| 177 | is idle power (energy/time). |
| 178 | |
| 179 | Energy consumed during transitions from an idle-state (C-state) to a busy state |
| 180 | (P-state) or going the other way is ignored by the model to simplify the energy |
| 181 | model calculations. |
| 182 | |
| 183 | |
| 184 | Power |
| 185 | ^ |
| 186 | | busy->idle idle->busy |
| 187 | | transition transition |
| 188 | | |
| 189 | | _ __ |
| 190 | | / \ / \__________________ |
| 191 | |______________/ \ / |
| 192 | | \ / |
| 193 | | Busy \ Idle / Busy |
| 194 | | low P-state \____________/ high P-state |
| 195 | | |
| 196 | +------------------------------------------------------------> time |
| 197 | |
| 198 | Busy |--------------| |-----------------| |
| 199 | |
| 200 | Wakeup |------| |------| |
| 201 | |
| 202 | Idle |------------| |
| 203 | |
| 204 | |
| 205 | The basic algorithm |
| 206 | ==================== |
| 207 | |
| 208 | The basic idea is to determine the total energy impact when utilization is |
| 209 | added or removed by estimating the impact at each level in the sched_domain |
| 210 | hierarchy starting from the bottom (sched_group contains just a single cpu). |
| 211 | The energy cost comes from busy time (sched_group is awake because one or more |
| 212 | cpus are busy) and idle time (in an idle-state). Energy model numbers account |
| 213 | for energy costs associated with all cpus in the sched_group as a group. |
| 214 | |
| 215 | for_each_domain(cpu, sd) { |
| 216 | sg = sched_group_of(cpu) |
| 217 | energy_before = curr_util(sg) * busy_power(sg) |
| 218 | + (1-curr_util(sg)) * idle_power(sg) |
| 219 | energy_after = new_util(sg) * busy_power(sg) |
| 220 | + (1-new_util(sg)) * idle_power(sg) |
| 221 | energy_diff += energy_before - energy_after |
| 222 | |
| 223 | } |
| 224 | |
| 225 | return energy_diff |
| 226 | |
| 227 | {curr, new}_util: The cpu utilization at the lowest level and the overall |
| 228 | non-idle time for the entire group for higher levels. Utilization is in the |
| 229 | range 0.0 to 1.0 in the pseudo-code. |
| 230 | |
| 231 | busy_power: The power consumption of the sched_group. |
| 232 | |
| 233 | idle_power: The power consumption of the sched_group when idle. |
| 234 | |
| 235 | Note: It is a fundamental assumption that the utilization is (roughly) scale |
| 236 | invariant. Task utilization tracking factors in any frequency scaling and |
| 237 | performance scaling differences due to difference cpu microarchitectures such |
| 238 | that task utilization can be used across the entire system. |
| 239 | |
| 240 | |
| 241 | Platform energy data |
| 242 | ===================== |
| 243 | |
| 244 | struct sched_group_energy can be attached to sched_groups in the sched_domain |
| 245 | hierarchy and has the following members: |
| 246 | |
| 247 | cap_states: |
| 248 | List of struct capacity_state representing the supported capacity states |
| 249 | (P-states). struct capacity_state has two members: cap and power, which |
| 250 | represents the compute capacity and the busy_power of the state. The |
| 251 | list must be ordered by capacity low->high. |
| 252 | |
| 253 | nr_cap_states: |
| 254 | Number of capacity states in cap_states list. |
| 255 | |
| 256 | idle_states: |
| 257 | List of struct idle_state containing idle_state power cost for each |
| 258 | idle-state supported by the system orderd by shallowest state first. |
| 259 | All states must be included at all level in the hierarchy, i.e. a |
| 260 | sched_group spanning just a single cpu must also include coupled |
| 261 | idle-states (cluster states). In addition to the cpuidle idle-states, |
| 262 | the list must also contain an entry for the idling using the arch |
| 263 | default idle (arch_idle_cpu()). Despite this state may not be a true |
| 264 | hardware idle-state it is considered the shallowest idle-state in the |
| 265 | energy model and must be the first entry. cpus may enter this state |
| 266 | (possibly 'active idling') if cpuidle decides not enter a cpuidle |
| 267 | idle-state. Default idle may not be used when cpuidle is enabled. |
| 268 | In this case, it should just be a copy of the first cpuidle idle-state. |
| 269 | |
| 270 | nr_idle_states: |
| 271 | Number of idle states in idle_states list. |
| 272 | |
| 273 | There are no unit requirements for the energy cost data. Data can be normalized |
| 274 | with any reference, however, the normalization must be consistent across all |
| 275 | energy cost data. That is, one bogo-joule/watt must be the same quantity for |
| 276 | data, but we don't care what it is. |
| 277 | |
| 278 | A recipe for platform characterization |
| 279 | ======================================= |
| 280 | |
| 281 | Obtaining the actual model data for a particular platform requires some way of |
| 282 | measuring power/energy. There isn't a tool to help with this (yet). This |
| 283 | section provides a recipe for use as reference. It covers the steps used to |
| 284 | characterize the ARM TC2 development platform. This sort of measurements is |
| 285 | expected to be done anyway when tuning cpuidle and cpufreq for a given |
| 286 | platform. |
| 287 | |
| 288 | The energy model needs two types of data (struct sched_group_energy holds |
| 289 | these) for each sched_group where energy costs should be taken into account: |
| 290 | |
| 291 | 1. Capacity state information |
| 292 | |
| 293 | A list containing the compute capacity and power consumption when fully |
| 294 | utilized attributed to the group as a whole for each available capacity state. |
| 295 | At the lowest level (group contains just a single cpu) this is the power of the |
| 296 | cpu alone without including power consumed by resources shared with other cpus. |
| 297 | It basically needs to fit the basic modelling approach described in "Background |
| 298 | and Terminology" section: |
| 299 | |
| 300 | energy_system = energy_shared + n * energy_cpu |
| 301 | |
| 302 | for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at |
| 303 | the lowest level. 'energy_shared' is included at the next level which |
| 304 | represents the group of cpus among which the resources are shared. |
| 305 | |
| 306 | This model is, of course, a simplification of reality. Thus, power/energy |
| 307 | attributions might not always exactly represent how the hardware is designed. |
| 308 | Also, busy power is likely to depend on the workload. It is therefore |
| 309 | recommended to use a representative mix of workloads when characterizing the |
| 310 | capacity states. |
| 311 | |
| 312 | If the group has no capacity scaling support, the list will contain a single |
| 313 | state where power is the busy power attributed to the group. The capacity |
| 314 | should be set to a default value (1024). |
| 315 | |
| 316 | When frequency domains include multiple power domains, the group representing |
| 317 | the frequency domain and all child groups share capacity states. This must be |
| 318 | indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at |
| 319 | all levels that share the capacity state must have the list of capacity states |
| 320 | with the power set to the contribution of the individual group. |
| 321 | |
| 322 | 2. Idle power information |
| 323 | |
| 324 | Stored in the idle_states list. The power number is the group idle power |
| 325 | consumption in each idle state as well when the group is idle but has not |
| 326 | entered an idle-state ('active idle' as mentioned earlier). Due to the way the |
| 327 | energy model is defined, the idle power of the deepest group idle state can |
| 328 | alternatively be accounted for in the parent group busy power. In that case the |
| 329 | group idle state power values are offset such that the idle power of the |
| 330 | deepest state is zero. It is less intuitive, but it is easier to measure as |
| 331 | idle power consumed by the group and the busy/idle power of the parent group |
| 332 | cannot be distinguished without per group measurement points. |
| 333 | |
| 334 | Measuring capacity states and idle power: |
| 335 | |
| 336 | The capacity states' capacity and power can be estimated by running a benchmark |
| 337 | workload at each available capacity state. By restricting the benchmark to run |
| 338 | on subsets of cpus it is possible to extrapolate the power consumption of |
| 339 | shared resources. |
| 340 | |
| 341 | ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a |
| 342 | shared L2 cache. TC2 has on-chip energy counters per cluster. Running a |
| 343 | benchmark workload on just one cpu in a cluster means that power is consumed in |
| 344 | the cluster (higher level group) and a single cpu (lowest level group). Adding |
| 345 | another benchmark task to another cpu increases the power consumption by the |
| 346 | amount consumed by the additional cpu. Hence, it is possible to extrapolate the |
| 347 | cluster busy power. |
| 348 | |
| 349 | For platforms that don't have energy counters or equivalent instrumentation |
| 350 | built-in, it may be possible to use an external DAQ to acquire similar data. |
| 351 | |
| 352 | If the benchmark includes some performance score (for example sysbench cpu |
| 353 | benchmark), this can be used to record the compute capacity. |
| 354 | |
| 355 | Measuring idle power requires insight into the idle state implementation on the |
| 356 | particular platform. Specifically, if the platform has coupled idle-states (or |
| 357 | package states). To measure non-coupled per-cpu idle-states it is necessary to |
| 358 | keep one cpu busy to keep any shared resources alive to isolate the idle power |
| 359 | of the cpu from idle/busy power of the shared resources. The cpu can be tricked |
| 360 | into different per-cpu idle states by disabling the other states. Based on |
| 361 | various combinations of measurements with specific cpus busy and disabling |
| 362 | idle-states it is possible to extrapolate the idle-state power. |