Blame - Documentation/scheduler/sched-tune.txt - kernel/msm-4.9

blob: 9bd2231c01b1466a23e49de6cc0f7432474c0c98 [file] [log] [blame]

Patrick Bellasi	afe22aa	2015-06-30 12:03:26 +0100	[diff] [blame]	1	Central, scheduler-driven, power-performance control
				2	(EXPERIMENTAL)
				3
				4	Abstract
				5	========
				6
				7	The topic of a single simple power-performance tunable, that is wholly
				8	scheduler centric, and has well defined and predictable properties has come up
				9	on several occasions in the past [1,2]. With techniques such as a scheduler
				10	driven DVFS [3], we now have a good framework for implementing such a tunable.
				11	This document describes the overall ideas behind its design and implementation.
				12
				13
				14	Table of Contents
				15	=================
				16
				17	1. Motivation
				18	2. Introduction
				19	3. Signal Boosting Strategy
				20	4. OPP selection using boosted CPU utilization
				21	5. Per task group boosting
				22	6. Question and Answers
				23	- What about "auto" mode?
				24	- What about boosting on a congested system?
				25	- How CPUs are boosted when we have tasks with multiple boost values?
				26	7. References
				27
				28
				29	1. Motivation
				30	=============
				31
				32	Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
				33	scheduler to select the optimal DVFS operating point (OPP) for running a task
				34	allocated to a CPU. The introduction of sched-DVFS enables running workloads at
				35	the most energy efficient OPPs.
				36
				37	However, sometimes it may be desired to intentionally boost the performance of
				38	a workload even if that could imply a reasonable increase in energy
				39	consumption. For example, in order to reduce the response time of a task, we
				40	may want to run the task at a higher OPP than the one that is actually required
				41	by it's CPU bandwidth demand.
				42
				43	This last requirement is especially important if we consider that one of the
				44	main goals of the sched-DVFS component is to replace all currently available
				45	CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
				46	driven governors we currently have, it is already more responsive at selecting
				47	the optimal OPP to run tasks allocated to a CPU. However, just tracking the
				48	actual task load demand may not be enough from a performance standpoint. For
				49	example, it is not possible to get behaviors similar to those provided by the
				50	"performance" and "interactive" CPUFreq governors.
				51
				52	This document describes an implementation of a tunable, stacked on top of the
				53	sched-DVFS which extends its functionality to support task performance
				54	boosting.
				55
				56	By "performance boosting" we mean the reduction of the time required to
				57	complete a task activation, i.e. the time elapsed from a task wakeup to its
				58	next deactivation (e.g. because it goes back to sleep or it terminates). For
				59	example, if we consider a simple periodic task which executes the same workload
				60	for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
				61	that task must complete each of its activations in less than 5[s].
				62
				63	A previous attempt [5] to introduce such a boosting feature has not been
				64	successful mainly because of the complexity of the proposed solution. The
				65	approach described in this document exposes a single simple interface to
				66	user-space. This single tunable knob allows the tuning of system wide
				67	scheduler behaviours ranging from energy efficiency at one end through to
				68	incremental performance boosting at the other end. This first tunable affects
				69	all tasks. However, a more advanced extension of the concept is also provided
				70	which uses CGroups to boost the performance of only selected tasks while using
				71	the energy efficient default for all others.
				72
				73	The rest of this document introduces in more details the proposed solution
				74	which has been named SchedTune.
				75
				76
				77	2. Introduction
				78	===============
				79
				80	SchedTune exposes a simple user-space interface with a single power-performance
				81	tunable:
				82
				83	/proc/sys/kernel/sched_cfs_boost
				84
				85	This permits expressing a boost value as an integer in the range [0..100].
				86
				87	A value of 0 (default) configures the CFS scheduler for maximum energy
				88	efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
				89	required to satisfy their workload demand.
				90	A value of 100 configures scheduler for maximum performance, which translates
				91	to the selection of the maximum OPP on that CPU.
				92
				93	The range between 0 and 100 can be set to satisfy other scenarios suitably. For
				94	example to satisfy interactive response or depending on other system events
				95	(battery level etc).
				96
				97	A CGroup based extension is also provided, which permits further user-space
				98	defined task classification to tune the scheduler for different goals depending
				99	on the specific nature of the task, e.g. background vs interactive vs
				100	low-priority.
				101
				102	The overall design of the SchedTune module is built on top of "Per-Entity Load
				103	Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
				104	Performance Point (OPP) selection.
				105	Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
				106	the operating frequency of that CPU to better match the workload demand. The
				107	selection of the actual OPP being activated is influenced by the global boost
				108	value, or the boost value for the task CGroup when in use.
				109
				110	This simple biasing approach leverages existing frameworks, which means minimal
				111	modifications to the scheduler, and yet it allows to achieve a range of
				112	different behaviours all from a single simple tunable knob.
				113	The only new concept introduced is that of signal boosting.
				114
				115
				116	3. Signal Boosting Strategy
				117	===========================
				118
				119	The whole PELT machinery works based on the value of a few load tracking signals
				120	which basically track the CPU bandwidth requirements for tasks and the capacity
				121	of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
				122	some of these load tracking signals to make a task or RQ appears more demanding
				123	that it actually is.
				124
				125	Which signals have to be inflated depends on the specific "consumer". However,
				126	independently from the specific (signal, consumer) pair, it is important to
				127	define a simple and possibly consistent strategy for the concept of boosting a
				128	signal.
				129
				130	A boosting strategy defines how the "abstract" user-space defined
				131	sched_cfs_boost value is translated into an internal "margin" value to be added
				132	to a signal to get its inflated value:
				133
				134	margin := boosting_strategy(sched_cfs_boost, signal)
				135	boosted_signal := signal + margin
				136
				137	Different boosting strategies were identified and analyzed before selecting the
				138	one found to be most effective.
				139
				140	Signal Proportional Compensation (SPC)
				141	--------------------------------------
				142
				143	In this boosting strategy the sched_cfs_boost value is used to compute a
				144	margin which is proportional to the complement of the original signal.
				145	When a signal has a maximum possible value, its complement is defined as
				146	the delta from the actual value and its possible maximum.
				147
				148	Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
				149	the maximum possible value, the margin becomes:
				150
				151	margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
				152
				153	Using this boosting strategy:
				154	- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
				155	- each value in the range of sched_cfs_boost effectively inflates the signal in
				156	question by a quantity which is proportional to the maximum value.
				157
				158	For example, by applying the SPC boosting strategy to the selection of the OPP
				159	to run a task it is possible to achieve these behaviors:
				160
				161	- 0% boosting: run the task at the minimum OPP required by its workload
				162	- 100% boosting: run the task at the maximum OPP available for the CPU
				163	- 50% boosting: run at the half-way OPP between minimum and maximum
				164
				165	Which means that, at 50% boosting, a task will be scheduled to run at half of
				166	the maximum theoretically achievable performance on the specific target
				167	platform.
				168
				169	A graphical representation of an SPC boosted signal is represented in the
				170	following figure where:
				171	a) "-" represents the original signal
				172	b) "b" represents a 50% boosted signal
				173	c) "p" represents a 100% boosted signal
				174
				175
				176	^
				177	\| SCHED_LOAD_SCALE
				178	+-----------------------------------------------------------------+
				179	\|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
				180	\|
				181	\| boosted_signal
				182	\| bbbbbbbbbbbbbbbbbbbbbbbb
				183	\|
				184	\| original signal
				185	\| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
				186	\| \|
				187	\|bbbbbbbbbbbbbbbbbb \|
				188	\| \|
				189	\| \|
				190	\| \|
				191	\| +-----------------------+
				192	\| \|
				193	\| \|
				194	\| \|
				195	\|------------------+
				196	\|
				197	\|
				198	+----------------------------------------------------------------------->
				199
				200	The plot above shows a ramped load signal (titled 'original_signal') and it's
				201	boosted equivalent. For each step of the original signal the boosted signal
				202	corresponding to a 50% boost is midway from the original signal and the upper
				203	bound. Boosting by 100% generates a boosted signal which is always saturated to
				204	the upper bound.
				205
				206
				207	4. OPP selection using boosted CPU utilization
				208	==============================================
				209
				210	It is worth calling out that the implementation does not introduce any new load
				211	signals. Instead, it provides an API to tune existing signals. This tuning is
				212	done on demand and only in scheduler code paths where it is sensible to do so.
				213	The new API calls are defined to return either the default signal or a boosted
				214	one, depending on the value of sched_cfs_boost. This is a clean an non invasive
				215	modification of the existing existing code paths.
				216
				217	The signal representing a CPU's utilization is boosted according to the
				218	previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
				219	(ie CFS run-queue) to appear more used then it actually is.
				220
				221	Thus, with the sched_cfs_boost enabled we have the following main functions to
				222	get the current utilization of a CPU:
				223
				224	cpu_util()
				225	boosted_cpu_util()
				226
				227	The new boosted_cpu_util() is similar to the first but returns a boosted
				228	utilization signal which is a function of the sched_cfs_boost value.
				229
				230	This function is used in the CFS scheduler code paths where sched-DVFS needs to
				231	decide the OPP to run a CPU at.
				232	For example, this allows selecting the highest OPP for a CPU which has
				233	the boost value set to 100%.
				234
				235
				236	5. Per task group boosting
				237	==========================
				238
				239	The availability of a single knob which is used to boost all tasks in the
				240	system is certainly a simple solution but it quite likely doesn't fit many
				241	utilization scenarios, especially in the mobile device space.
				242
				243	For example, on battery powered devices there usually are many background
				244	services which are long running and need energy efficient scheduling. On the
				245	other hand, some applications are more performance sensitive and require an
				246	interactive response and/or maximum performance, regardless of the energy cost.
				247	To better service such scenarios, the SchedTune implementation has an extension
				248	that provides a more fine grained boosting interface.
				249
				250	A new CGroup controller, namely "schedtune", could be enabled which allows to
				251	defined and configure task groups with different boosting values.
				252	Tasks that require special performance can be put into separate CGroups.
				253	The value of the boost associated with the tasks in this group can be specified
				254	using a single knob exposed by the CGroup controller:
				255
				256	schedtune.boost
				257
				258	This knob allows the definition of a boost value that is to be used for
				259	SPC boosting of all tasks attached to this group.
				260
				261	The current schedtune controller implementation is really simple and has these
				262	main characteristics:
				263
				264	1) It is only possible to create 1 level depth hierarchies
				265
				266	The root control groups define the system-wide boost value to be applied
				267	by default to all tasks. Its direct subgroups are named "boost groups" and
				268	they define the boost value for specific set of tasks.
				269	Further nested subgroups are not allowed since they do not have a sensible
				270	meaning from a user-space standpoint.
				271
				272	2) It is possible to define only a limited number of "boost groups"
				273
				274	This number is defined at compile time and by default configured to 16.
				275	This is a design decision motivated by two main reasons:
				276	a) In a real system we do not expect utilization scenarios with more then few
				277	boost groups. For example, a reasonable collection of groups could be
				278	just "background", "interactive" and "performance".
				279	b) It simplifies the implementation considerably, especially for the code
				280	which has to compute the per CPU boosting once there are multiple
				281	RUNNABLE tasks with different boost values.
				282
				283	Such a simple design should allow servicing the main utilization scenarios identified
				284	so far. It provides a simple interface which can be used to manage the
				285	power-performance of all tasks or only selected tasks.
				286	Moreover, this interface can be easily integrated by user-space run-times (e.g.
				287	Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
				288	classification, which has been a long standing requirement.
				289
				290	Setup and usage
				291	---------------
				292
				293	0. Use a kernel with CGROUP_SCHEDTUNE support enabled
				294
				295	1. Check that the "schedtune" CGroup controller is available:
				296
				297	root@linaro-nano:~# cat /proc/cgroups
				298	#subsys_name hierarchy num_cgroups enabled
				299	cpuset 0 1 1
				300	cpu 0 1 1
				301	schedtune 0 1 1
				302
				303	2. Mount a tmpfs to create the CGroups mount point (Optional)
				304
				305	root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
				306
				307	3. Mount the "schedtune" controller
				308
				309	root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
				310	root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
				311
				312	4. Setup the system-wide boost value (Optional)
				313
				314	If not configured the root control group has a 0% boost value, which
				315	basically disables boosting for all tasks in the system thus running in
				316	an energy-efficient mode.
				317
				318	root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
				319
				320	5. Create task groups and configure their specific boost value (Optional)
				321
				322	For example here we create a "performance" boost group configure to boost
				323	all its tasks to 100%
				324
				325	root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
				326	root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
				327
				328	6. Move tasks into the boost group
				329
				330	For example, the following moves the tasks with PID $TASKPID (and all its
				331	threads) into the "performance" boost group.
				332
				333	root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
				334
				335	This simple configuration allows only the threads of the $TASKPID task to run,
				336	when needed, at the highest OPP in the most capable CPU of the system.
				337
				338
				339	6. Question and Answers
				340	=======================
				341
				342	What about "auto" mode?
				343	-----------------------
				344
				345	The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
				346	with some suitable user-space element. This element could use the exposed
				347	system-wide or cgroup based interface.
				348
				349	How are multiple groups of tasks with different boost values managed?
				350	---------------------------------------------------------------------
				351
				352	The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
				353	on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
				354	is boosted with a value which is the maximum of the boost values of the
				355	currently RUNNABLE tasks in its RQ.
				356
				357	This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
				358	to run and switch back to the energy efficient mode as soon as the last boosted
				359	task is dequeued.
				360
				361
				362	7. References
				363	=============
				364	[1] http://lwn.net/Articles/552889
				365	[2] http://lkml.org/lkml/2012/5/18/91
				366	[3] http://lkml.org/lkml/2015/6/26/620