blob: dab2f9088b336f4a703933c86b46a2c103a5a20a [file] [log] [blame]
Morten Rasmussen577daca2015-01-13 13:43:28 +00001Energy cost model for energy-aware scheduling (EXPERIMENTAL)
2
3Introduction
4=============
5
6The basic energy model uses platform energy data stored in sched_group_energy
7data structures attached to the sched_groups in the sched_domain hierarchy. The
8energy cost model offers two functions that can be used to guide scheduling
9decisions:
10
111. static unsigned int sched_group_energy(struct energy_env *eenv)
122. static int energy_diff(struct energy_env *eenv)
13
14sched_group_energy() estimates the energy consumed by all cpus in a specific
15sched_group including any shared resources owned exclusively by this group of
16cpus. Resources shared with other cpus are excluded (e.g. later level caches).
17
18energy_diff() estimates the total energy impact of a utilization change. That
19is, adding, removing, or migrating utilization (tasks).
20
21Both functions use a struct energy_env to specify the scenario to be evaluated:
22
23 struct energy_env {
24 struct sched_group *sg_top;
25 struct sched_group *sg_cap;
26 int cap_idx;
27 int util_delta;
28 int src_cpu;
29 int dst_cpu;
30 int energy;
31 };
32
33sg_top: sched_group to be evaluated. Not used by energy_diff().
34
35sg_cap: sched_group covering the cpus in the same frequency domain. Set by
36sched_group_energy().
37
38cap_idx: Capacity state to be used for energy calculations. Set by
39find_new_capacity().
40
41util_delta: Amount of utilization to be added, removed, or migrated.
42
43src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
44-1 if no source (e.g. task wake-up).
45
46dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
47if utilization is removed (e.g. terminating tasks).
48
49energy: Result of sched_group_energy().
50
51The metric used to represent utilization is the actual per-entity running time
52averaged over time using a geometric series. Very similar to the existing
53per-entity load-tracking, but _not_ scaled by task priority and capped by the
54capacity of the cpu. The latter property does mean that utilization may
55underestimate the compute requirements for task on fully/over utilized cpus.
56The greatest potential for energy savings without affecting performance too much
57is scenarios where the system isn't fully utilized. If the system is deemed
58fully utilized load-balancing should be done with task load (includes task
59priority) instead in the interest of fairness and performance.
60
61
62Background and Terminology
63===========================
64
65To make it clear from the start:
66
67energy = [joule] (resource like a battery on powered devices)
68power = energy/time = [joule/second] = [watt]
69
70The goal of energy-aware scheduling is to minimize energy, while still getting
71the job done. That is, we want to maximize:
72
73 performance [inst/s]
74 --------------------
75 power [W]
76
77which is equivalent to minimizing:
78
79 energy [J]
80 -----------
81 instruction
82
83while still getting 'good' performance. It is essentially an alternative
84optimization objective to the current performance-only objective for the
85scheduler. This alternative considers two objectives: energy-efficiency and
86performance. Hence, there needs to be a user controllable knob to switch the
87objective. Since it is early days, this is currently a sched_feature
88(ENERGY_AWARE).
89
90The idea behind introducing an energy cost model is to allow the scheduler to
91evaluate the implications of its decisions rather than applying energy-saving
92techniques blindly that may only have positive effects on some platforms. At
93the same time, the energy cost model must be as simple as possible to minimize
94the scheduler latency impact.
95
96Platform topology
97------------------
98
99The system topology (cpus, caches, and NUMA information, not peripherals) is
100represented in the scheduler by the sched_domain hierarchy which has
101sched_groups attached at each level that covers one or more cpus (see
102sched-domains.txt for more details). To add energy awareness to the scheduler
103we need to consider power and frequency domains.
104
105Power domain:
106
107A power domain is a part of the system that can be powered on/off
108independently. Power domains are typically organized in a hierarchy where you
109may be able to power down just a cpu or a group of cpus along with any
110associated resources (e.g. shared caches). Powering up a cpu means that all
111power domains it is a part of in the hierarchy must be powered up. Hence, it is
112more expensive to power up the first cpu that belongs to a higher level power
113domain than powering up additional cpus in the same high level domain. Two
114level power domain hierarchy example:
115
116 Power source
117 +-------------------------------+----...
118per group PD G G
119 | +----------+ |
120 +--------+-------| Shared | (other groups)
121per-cpu PD G G | resource |
122 | | +----------+
123 +-------+ +-------+
124 | CPU 0 | | CPU 1 |
125 +-------+ +-------+
126
127Frequency domain:
128
129Frequency domains (P-states) typically cover the same group of cpus as one of
130the power domain levels. That is, there might be several smaller power domains
131sharing the same frequency (P-state) or there might be a power domain spanning
132multiple frequency domains.
133
134From a scheduling point of view there is no need to know the actual frequencies
135[Hz]. All the scheduler cares about is the compute capacity available at the
136current state (P-state) the cpu is in and any other available states. For that
137reason, and to also factor in any cpu micro-architecture differences, compute
138capacity scaling states are called 'capacity states' in this document. For SMP
139systems this is equivalent to P-states. For mixed micro-architecture systems
140(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
141performance relative to the other cpus in the system.
142
143Energy modelling:
144------------------
145
146Due to the hierarchical nature of the power domains, the most obvious way to
147model energy costs is therefore to associate power and energy costs with
148domains (groups of cpus). Energy costs of shared resources are associated with
149the group of cpus that share the resources, only the cost of powering the
150cpu itself and any private resources (e.g. private L1 caches) is associated
151with the per-cpu groups (lowest level).
152
153For example, for an SMP system with per-cpu power domains and a cluster level
154(group of cpus) power domain we get the overall energy costs to be:
155
156 energy = energy_cluster + n * energy_cpu
157
158where 'n' is the number of cpus powered up and energy_cluster is the cost paid
159as soon as any cpu in the cluster is powered up.
160
161The power and frequency domains can naturally be mapped onto the existing
162sched_domain hierarchy and sched_groups by adding the necessary data to the
163existing data structures.
164
165The energy model considers energy consumption from two contributors (shown in
166the illustration below):
167
1681. Busy energy: Energy consumed while a cpu and the higher level groups that it
169belongs to are busy running tasks. Busy energy is associated with the state of
170the cpu, not an event. The time the cpu spends in this state varies. Thus, the
171most obvious platform parameter for this contribution is busy power
172(energy/time).
173
1742. Idle energy: Energy consumed while a cpu and higher level groups that it
175belongs to are idle (in a C-state). Like busy energy, idle energy is associated
176with the state of the cpu. Thus, the platform parameter for this contribution
177is idle power (energy/time).
178
179Energy consumed during transitions from an idle-state (C-state) to a busy state
180(P-state) or going the other way is ignored by the model to simplify the energy
181model calculations.
182
183
184 Power
185 ^
186 | busy->idle idle->busy
187 | transition transition
188 |
189 | _ __
190 | / \ / \__________________
191 |______________/ \ /
192 | \ /
193 | Busy \ Idle / Busy
194 | low P-state \____________/ high P-state
195 |
196 +------------------------------------------------------------> time
197
198Busy |--------------| |-----------------|
199
200Wakeup |------| |------|
201
202Idle |------------|
203
204
205The basic algorithm
206====================
207
208The basic idea is to determine the total energy impact when utilization is
209added or removed by estimating the impact at each level in the sched_domain
210hierarchy starting from the bottom (sched_group contains just a single cpu).
211The energy cost comes from busy time (sched_group is awake because one or more
212cpus are busy) and idle time (in an idle-state). Energy model numbers account
213for energy costs associated with all cpus in the sched_group as a group.
214
215 for_each_domain(cpu, sd) {
216 sg = sched_group_of(cpu)
217 energy_before = curr_util(sg) * busy_power(sg)
218 + (1-curr_util(sg)) * idle_power(sg)
219 energy_after = new_util(sg) * busy_power(sg)
220 + (1-new_util(sg)) * idle_power(sg)
221 energy_diff += energy_before - energy_after
222
223 }
224
225 return energy_diff
226
227{curr, new}_util: The cpu utilization at the lowest level and the overall
228non-idle time for the entire group for higher levels. Utilization is in the
229range 0.0 to 1.0 in the pseudo-code.
230
231busy_power: The power consumption of the sched_group.
232
233idle_power: The power consumption of the sched_group when idle.
234
235Note: It is a fundamental assumption that the utilization is (roughly) scale
236invariant. Task utilization tracking factors in any frequency scaling and
237performance scaling differences due to difference cpu microarchitectures such
238that task utilization can be used across the entire system.
239
240
241Platform energy data
242=====================
243
244struct sched_group_energy can be attached to sched_groups in the sched_domain
245hierarchy and has the following members:
246
247cap_states:
248 List of struct capacity_state representing the supported capacity states
249 (P-states). struct capacity_state has two members: cap and power, which
250 represents the compute capacity and the busy_power of the state. The
251 list must be ordered by capacity low->high.
252
253nr_cap_states:
254 Number of capacity states in cap_states list.
255
256idle_states:
257 List of struct idle_state containing idle_state power cost for each
258 idle-state supported by the system orderd by shallowest state first.
259 All states must be included at all level in the hierarchy, i.e. a
260 sched_group spanning just a single cpu must also include coupled
261 idle-states (cluster states). In addition to the cpuidle idle-states,
262 the list must also contain an entry for the idling using the arch
263 default idle (arch_idle_cpu()). Despite this state may not be a true
264 hardware idle-state it is considered the shallowest idle-state in the
265 energy model and must be the first entry. cpus may enter this state
266 (possibly 'active idling') if cpuidle decides not enter a cpuidle
267 idle-state. Default idle may not be used when cpuidle is enabled.
268 In this case, it should just be a copy of the first cpuidle idle-state.
269
270nr_idle_states:
271 Number of idle states in idle_states list.
272
273There are no unit requirements for the energy cost data. Data can be normalized
274with any reference, however, the normalization must be consistent across all
275energy cost data. That is, one bogo-joule/watt must be the same quantity for
276data, but we don't care what it is.
277
278A recipe for platform characterization
279=======================================
280
281Obtaining the actual model data for a particular platform requires some way of
282measuring power/energy. There isn't a tool to help with this (yet). This
283section provides a recipe for use as reference. It covers the steps used to
284characterize the ARM TC2 development platform. This sort of measurements is
285expected to be done anyway when tuning cpuidle and cpufreq for a given
286platform.
287
288The energy model needs two types of data (struct sched_group_energy holds
289these) for each sched_group where energy costs should be taken into account:
290
2911. Capacity state information
292
293A list containing the compute capacity and power consumption when fully
294utilized attributed to the group as a whole for each available capacity state.
295At the lowest level (group contains just a single cpu) this is the power of the
296cpu alone without including power consumed by resources shared with other cpus.
297It basically needs to fit the basic modelling approach described in "Background
298and Terminology" section:
299
300 energy_system = energy_shared + n * energy_cpu
301
302for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
303the lowest level. 'energy_shared' is included at the next level which
304represents the group of cpus among which the resources are shared.
305
306This model is, of course, a simplification of reality. Thus, power/energy
307attributions might not always exactly represent how the hardware is designed.
308Also, busy power is likely to depend on the workload. It is therefore
309recommended to use a representative mix of workloads when characterizing the
310capacity states.
311
312If the group has no capacity scaling support, the list will contain a single
313state where power is the busy power attributed to the group. The capacity
314should be set to a default value (1024).
315
316When frequency domains include multiple power domains, the group representing
317the frequency domain and all child groups share capacity states. This must be
318indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
319all levels that share the capacity state must have the list of capacity states
320with the power set to the contribution of the individual group.
321
3222. Idle power information
323
324Stored in the idle_states list. The power number is the group idle power
325consumption in each idle state as well when the group is idle but has not
326entered an idle-state ('active idle' as mentioned earlier). Due to the way the
327energy model is defined, the idle power of the deepest group idle state can
328alternatively be accounted for in the parent group busy power. In that case the
329group idle state power values are offset such that the idle power of the
330deepest state is zero. It is less intuitive, but it is easier to measure as
331idle power consumed by the group and the busy/idle power of the parent group
332cannot be distinguished without per group measurement points.
333
334Measuring capacity states and idle power:
335
336The capacity states' capacity and power can be estimated by running a benchmark
337workload at each available capacity state. By restricting the benchmark to run
338on subsets of cpus it is possible to extrapolate the power consumption of
339shared resources.
340
341ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
342shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
343benchmark workload on just one cpu in a cluster means that power is consumed in
344the cluster (higher level group) and a single cpu (lowest level group). Adding
345another benchmark task to another cpu increases the power consumption by the
346amount consumed by the additional cpu. Hence, it is possible to extrapolate the
347cluster busy power.
348
349For platforms that don't have energy counters or equivalent instrumentation
350built-in, it may be possible to use an external DAQ to acquire similar data.
351
352If the benchmark includes some performance score (for example sysbench cpu
353benchmark), this can be used to record the compute capacity.
354
355Measuring idle power requires insight into the idle state implementation on the
356particular platform. Specifically, if the platform has coupled idle-states (or
357package states). To measure non-coupled per-cpu idle-states it is necessary to
358keep one cpu busy to keep any shared resources alive to isolate the idle power
359of the cpu from idle/busy power of the shared resources. The cpu can be tricked
360into different per-cpu idle states by disabling the other states. Based on
361various combinations of measurements with specific cpus busy and disabling
362idle-states it is possible to extrapolate the idle-state power.