Blame - Documentation/scheduler/sched-energy.txt - kernel/msm-4.9

blob: dab2f9088b336f4a703933c86b46a2c103a5a20a [file] [log] [blame]

Morten Rasmussen	577daca	2015-01-13 13:43:28 +0000	[diff] [blame]	1	Energy cost model for energy-aware scheduling (EXPERIMENTAL)
				2
				3	Introduction
				4	=============
				5
				6	The basic energy model uses platform energy data stored in sched_group_energy
				7	data structures attached to the sched_groups in the sched_domain hierarchy. The
				8	energy cost model offers two functions that can be used to guide scheduling
				9	decisions:
				10
				11	1. static unsigned int sched_group_energy(struct energy_env *eenv)
				12	2. static int energy_diff(struct energy_env *eenv)
				13
				14	sched_group_energy() estimates the energy consumed by all cpus in a specific
				15	sched_group including any shared resources owned exclusively by this group of
				16	cpus. Resources shared with other cpus are excluded (e.g. later level caches).
				17
				18	energy_diff() estimates the total energy impact of a utilization change. That
				19	is, adding, removing, or migrating utilization (tasks).
				20
				21	Both functions use a struct energy_env to specify the scenario to be evaluated:
				22
				23	struct energy_env {
				24	struct sched_group *sg_top;
				25	struct sched_group *sg_cap;
				26	int cap_idx;
				27	int util_delta;
				28	int src_cpu;
				29	int dst_cpu;
				30	int energy;
				31	};
				32
				33	sg_top: sched_group to be evaluated. Not used by energy_diff().
				34
				35	sg_cap: sched_group covering the cpus in the same frequency domain. Set by
				36	sched_group_energy().
				37
				38	cap_idx: Capacity state to be used for energy calculations. Set by
				39	find_new_capacity().
				40
				41	util_delta: Amount of utilization to be added, removed, or migrated.
				42
				43	src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
				44	-1 if no source (e.g. task wake-up).
				45
				46	dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
				47	if utilization is removed (e.g. terminating tasks).
				48
				49	energy: Result of sched_group_energy().
				50
				51	The metric used to represent utilization is the actual per-entity running time
				52	averaged over time using a geometric series. Very similar to the existing
				53	per-entity load-tracking, but _not_ scaled by task priority and capped by the
				54	capacity of the cpu. The latter property does mean that utilization may
				55	underestimate the compute requirements for task on fully/over utilized cpus.
				56	The greatest potential for energy savings without affecting performance too much
				57	is scenarios where the system isn't fully utilized. If the system is deemed
				58	fully utilized load-balancing should be done with task load (includes task
				59	priority) instead in the interest of fairness and performance.
				60
				61
				62	Background and Terminology
				63	===========================
				64
				65	To make it clear from the start:
				66
				67	energy = [joule] (resource like a battery on powered devices)
				68	power = energy/time = [joule/second] = [watt]
				69
				70	The goal of energy-aware scheduling is to minimize energy, while still getting
				71	the job done. That is, we want to maximize:
				72
				73	performance [inst/s]
				74	--------------------
				75	power [W]
				76
				77	which is equivalent to minimizing:
				78
				79	energy [J]
				80	-----------
				81	instruction
				82
				83	while still getting 'good' performance. It is essentially an alternative
				84	optimization objective to the current performance-only objective for the
				85	scheduler. This alternative considers two objectives: energy-efficiency and
				86	performance. Hence, there needs to be a user controllable knob to switch the
				87	objective. Since it is early days, this is currently a sched_feature
				88	(ENERGY_AWARE).
				89
				90	The idea behind introducing an energy cost model is to allow the scheduler to
				91	evaluate the implications of its decisions rather than applying energy-saving
				92	techniques blindly that may only have positive effects on some platforms. At
				93	the same time, the energy cost model must be as simple as possible to minimize
				94	the scheduler latency impact.
				95
				96	Platform topology
				97	------------------
				98
				99	The system topology (cpus, caches, and NUMA information, not peripherals) is
				100	represented in the scheduler by the sched_domain hierarchy which has
				101	sched_groups attached at each level that covers one or more cpus (see
				102	sched-domains.txt for more details). To add energy awareness to the scheduler
				103	we need to consider power and frequency domains.
				104
				105	Power domain:
				106
				107	A power domain is a part of the system that can be powered on/off
				108	independently. Power domains are typically organized in a hierarchy where you
				109	may be able to power down just a cpu or a group of cpus along with any
				110	associated resources (e.g. shared caches). Powering up a cpu means that all
				111	power domains it is a part of in the hierarchy must be powered up. Hence, it is
				112	more expensive to power up the first cpu that belongs to a higher level power
				113	domain than powering up additional cpus in the same high level domain. Two
				114	level power domain hierarchy example:
				115
				116	Power source
				117	+-------------------------------+----...
				118	per group PD G G
				119	\| +----------+ \|
				120	+--------+-------\| Shared \| (other groups)
				121	per-cpu PD G G \| resource \|
				122	\| \| +----------+
				123	+-------+ +-------+
				124	\| CPU 0 \| \| CPU 1 \|
				125	+-------+ +-------+
				126
				127	Frequency domain:
				128
				129	Frequency domains (P-states) typically cover the same group of cpus as one of
				130	the power domain levels. That is, there might be several smaller power domains
				131	sharing the same frequency (P-state) or there might be a power domain spanning
				132	multiple frequency domains.
				133
				134	From a scheduling point of view there is no need to know the actual frequencies
				135	[Hz]. All the scheduler cares about is the compute capacity available at the
				136	current state (P-state) the cpu is in and any other available states. For that
				137	reason, and to also factor in any cpu micro-architecture differences, compute
				138	capacity scaling states are called 'capacity states' in this document. For SMP
				139	systems this is equivalent to P-states. For mixed micro-architecture systems
				140	(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
				141	performance relative to the other cpus in the system.
				142
				143	Energy modelling:
				144	------------------
				145
				146	Due to the hierarchical nature of the power domains, the most obvious way to
				147	model energy costs is therefore to associate power and energy costs with
				148	domains (groups of cpus). Energy costs of shared resources are associated with
				149	the group of cpus that share the resources, only the cost of powering the
				150	cpu itself and any private resources (e.g. private L1 caches) is associated
				151	with the per-cpu groups (lowest level).
				152
				153	For example, for an SMP system with per-cpu power domains and a cluster level
				154	(group of cpus) power domain we get the overall energy costs to be:
				155
				156	energy = energy_cluster + n * energy_cpu
				157
				158	where 'n' is the number of cpus powered up and energy_cluster is the cost paid
				159	as soon as any cpu in the cluster is powered up.
				160
				161	The power and frequency domains can naturally be mapped onto the existing
				162	sched_domain hierarchy and sched_groups by adding the necessary data to the
				163	existing data structures.
				164
				165	The energy model considers energy consumption from two contributors (shown in
				166	the illustration below):
				167
				168	1. Busy energy: Energy consumed while a cpu and the higher level groups that it
				169	belongs to are busy running tasks. Busy energy is associated with the state of
				170	the cpu, not an event. The time the cpu spends in this state varies. Thus, the
				171	most obvious platform parameter for this contribution is busy power
				172	(energy/time).
				173
				174	2. Idle energy: Energy consumed while a cpu and higher level groups that it
				175	belongs to are idle (in a C-state). Like busy energy, idle energy is associated
				176	with the state of the cpu. Thus, the platform parameter for this contribution
				177	is idle power (energy/time).
				178
				179	Energy consumed during transitions from an idle-state (C-state) to a busy state
				180	(P-state) or going the other way is ignored by the model to simplify the energy
				181	model calculations.
				182
				183
				184	Power
				185	^
				186	\| busy->idle idle->busy
				187	\| transition transition
				188	\|
				189	\| _ __
				190	\| / \ / \__________________
				191	\|______________/ \ /
				192	\| \ /
				193	\| Busy \ Idle / Busy
				194	\| low P-state \____________/ high P-state
				195	\|
				196	+------------------------------------------------------------> time
				197
				198	Busy \|--------------\| \|-----------------\|
				199
				200	Wakeup \|------\| \|------\|
				201
				202	Idle \|------------\|
				203
				204
				205	The basic algorithm
				206	====================
				207
				208	The basic idea is to determine the total energy impact when utilization is
				209	added or removed by estimating the impact at each level in the sched_domain
				210	hierarchy starting from the bottom (sched_group contains just a single cpu).
				211	The energy cost comes from busy time (sched_group is awake because one or more
				212	cpus are busy) and idle time (in an idle-state). Energy model numbers account
				213	for energy costs associated with all cpus in the sched_group as a group.
				214
				215	for_each_domain(cpu, sd) {
				216	sg = sched_group_of(cpu)
				217	energy_before = curr_util(sg) * busy_power(sg)
				218	+ (1-curr_util(sg)) * idle_power(sg)
				219	energy_after = new_util(sg) * busy_power(sg)
				220	+ (1-new_util(sg)) * idle_power(sg)
				221	energy_diff += energy_before - energy_after
				222
				223	}
				224
				225	return energy_diff
				226
				227	{curr, new}_util: The cpu utilization at the lowest level and the overall
				228	non-idle time for the entire group for higher levels. Utilization is in the
				229	range 0.0 to 1.0 in the pseudo-code.
				230
				231	busy_power: The power consumption of the sched_group.
				232
				233	idle_power: The power consumption of the sched_group when idle.
				234
				235	Note: It is a fundamental assumption that the utilization is (roughly) scale
				236	invariant. Task utilization tracking factors in any frequency scaling and
				237	performance scaling differences due to difference cpu microarchitectures such
				238	that task utilization can be used across the entire system.
				239
				240
				241	Platform energy data
				242	=====================
				243
				244	struct sched_group_energy can be attached to sched_groups in the sched_domain
				245	hierarchy and has the following members:
				246
				247	cap_states:
				248	List of struct capacity_state representing the supported capacity states
				249	(P-states). struct capacity_state has two members: cap and power, which
				250	represents the compute capacity and the busy_power of the state. The
				251	list must be ordered by capacity low->high.
				252
				253	nr_cap_states:
				254	Number of capacity states in cap_states list.
				255
				256	idle_states:
				257	List of struct idle_state containing idle_state power cost for each
				258	idle-state supported by the system orderd by shallowest state first.
				259	All states must be included at all level in the hierarchy, i.e. a
				260	sched_group spanning just a single cpu must also include coupled
				261	idle-states (cluster states). In addition to the cpuidle idle-states,
				262	the list must also contain an entry for the idling using the arch
				263	default idle (arch_idle_cpu()). Despite this state may not be a true
				264	hardware idle-state it is considered the shallowest idle-state in the
				265	energy model and must be the first entry. cpus may enter this state
				266	(possibly 'active idling') if cpuidle decides not enter a cpuidle
				267	idle-state. Default idle may not be used when cpuidle is enabled.
				268	In this case, it should just be a copy of the first cpuidle idle-state.
				269
				270	nr_idle_states:
				271	Number of idle states in idle_states list.
				272
				273	There are no unit requirements for the energy cost data. Data can be normalized
				274	with any reference, however, the normalization must be consistent across all
				275	energy cost data. That is, one bogo-joule/watt must be the same quantity for
				276	data, but we don't care what it is.
				277
				278	A recipe for platform characterization
				279	=======================================
				280
				281	Obtaining the actual model data for a particular platform requires some way of
				282	measuring power/energy. There isn't a tool to help with this (yet). This
				283	section provides a recipe for use as reference. It covers the steps used to
				284	characterize the ARM TC2 development platform. This sort of measurements is
				285	expected to be done anyway when tuning cpuidle and cpufreq for a given
				286	platform.
				287
				288	The energy model needs two types of data (struct sched_group_energy holds
				289	these) for each sched_group where energy costs should be taken into account:
				290
				291	1. Capacity state information
				292
				293	A list containing the compute capacity and power consumption when fully
				294	utilized attributed to the group as a whole for each available capacity state.
				295	At the lowest level (group contains just a single cpu) this is the power of the
				296	cpu alone without including power consumed by resources shared with other cpus.
				297	It basically needs to fit the basic modelling approach described in "Background
				298	and Terminology" section:
				299
				300	energy_system = energy_shared + n * energy_cpu
				301
				302	for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
				303	the lowest level. 'energy_shared' is included at the next level which
				304	represents the group of cpus among which the resources are shared.
				305
				306	This model is, of course, a simplification of reality. Thus, power/energy
				307	attributions might not always exactly represent how the hardware is designed.
				308	Also, busy power is likely to depend on the workload. It is therefore
				309	recommended to use a representative mix of workloads when characterizing the
				310	capacity states.
				311
				312	If the group has no capacity scaling support, the list will contain a single
				313	state where power is the busy power attributed to the group. The capacity
				314	should be set to a default value (1024).
				315
				316	When frequency domains include multiple power domains, the group representing
				317	the frequency domain and all child groups share capacity states. This must be
				318	indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
				319	all levels that share the capacity state must have the list of capacity states
				320	with the power set to the contribution of the individual group.
				321
				322	2. Idle power information
				323
				324	Stored in the idle_states list. The power number is the group idle power
				325	consumption in each idle state as well when the group is idle but has not
				326	entered an idle-state ('active idle' as mentioned earlier). Due to the way the
				327	energy model is defined, the idle power of the deepest group idle state can
				328	alternatively be accounted for in the parent group busy power. In that case the
				329	group idle state power values are offset such that the idle power of the
				330	deepest state is zero. It is less intuitive, but it is easier to measure as
				331	idle power consumed by the group and the busy/idle power of the parent group
				332	cannot be distinguished without per group measurement points.
				333
				334	Measuring capacity states and idle power:
				335
				336	The capacity states' capacity and power can be estimated by running a benchmark
				337	workload at each available capacity state. By restricting the benchmark to run
				338	on subsets of cpus it is possible to extrapolate the power consumption of
				339	shared resources.
				340
				341	ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
				342	shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
				343	benchmark workload on just one cpu in a cluster means that power is consumed in
				344	the cluster (higher level group) and a single cpu (lowest level group). Adding
				345	another benchmark task to another cpu increases the power consumption by the
				346	amount consumed by the additional cpu. Hence, it is possible to extrapolate the
				347	cluster busy power.
				348
				349	For platforms that don't have energy counters or equivalent instrumentation
				350	built-in, it may be possible to use an external DAQ to acquire similar data.
				351
				352	If the benchmark includes some performance score (for example sysbench cpu
				353	benchmark), this can be used to record the compute capacity.
				354
				355	Measuring idle power requires insight into the idle state implementation on the
				356	particular platform. Specifically, if the platform has coupled idle-states (or
				357	package states). To measure non-coupled per-cpu idle-states it is necessary to
				358	keep one cpu busy to keep any shared resources alive to isolate the idle power
				359	of the cpu from idle/busy power of the shared resources. The cpu can be tricked
				360	into different per-cpu idle states by disabling the other states. Based on
				361	various combinations of measurements with specific cpus busy and disabling
				362	idle-states it is possible to extrapolate the idle-state power.