blob: 9bd2231c01b1466a23e49de6cc0f7432474c0c98 [file] [log] [blame]
Patrick Bellasiafe22aa2015-06-30 12:03:26 +01001 Central, scheduler-driven, power-performance control
2 (EXPERIMENTAL)
3
4Abstract
5========
6
7The topic of a single simple power-performance tunable, that is wholly
8scheduler centric, and has well defined and predictable properties has come up
9on several occasions in the past [1,2]. With techniques such as a scheduler
10driven DVFS [3], we now have a good framework for implementing such a tunable.
11This document describes the overall ideas behind its design and implementation.
12
13
14Table of Contents
15=================
16
171. Motivation
182. Introduction
193. Signal Boosting Strategy
204. OPP selection using boosted CPU utilization
215. Per task group boosting
226. Question and Answers
23 - What about "auto" mode?
24 - What about boosting on a congested system?
25 - How CPUs are boosted when we have tasks with multiple boost values?
267. References
27
28
291. Motivation
30=============
31
32Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
33scheduler to select the optimal DVFS operating point (OPP) for running a task
34allocated to a CPU. The introduction of sched-DVFS enables running workloads at
35the most energy efficient OPPs.
36
37However, sometimes it may be desired to intentionally boost the performance of
38a workload even if that could imply a reasonable increase in energy
39consumption. For example, in order to reduce the response time of a task, we
40may want to run the task at a higher OPP than the one that is actually required
41by it's CPU bandwidth demand.
42
43This last requirement is especially important if we consider that one of the
44main goals of the sched-DVFS component is to replace all currently available
45CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
46driven governors we currently have, it is already more responsive at selecting
47the optimal OPP to run tasks allocated to a CPU. However, just tracking the
48actual task load demand may not be enough from a performance standpoint. For
49example, it is not possible to get behaviors similar to those provided by the
50"performance" and "interactive" CPUFreq governors.
51
52This document describes an implementation of a tunable, stacked on top of the
53sched-DVFS which extends its functionality to support task performance
54boosting.
55
56By "performance boosting" we mean the reduction of the time required to
57complete a task activation, i.e. the time elapsed from a task wakeup to its
58next deactivation (e.g. because it goes back to sleep or it terminates). For
59example, if we consider a simple periodic task which executes the same workload
60for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
61that task must complete each of its activations in less than 5[s].
62
63A previous attempt [5] to introduce such a boosting feature has not been
64successful mainly because of the complexity of the proposed solution. The
65approach described in this document exposes a single simple interface to
66user-space. This single tunable knob allows the tuning of system wide
67scheduler behaviours ranging from energy efficiency at one end through to
68incremental performance boosting at the other end. This first tunable affects
69all tasks. However, a more advanced extension of the concept is also provided
70which uses CGroups to boost the performance of only selected tasks while using
71the energy efficient default for all others.
72
73The rest of this document introduces in more details the proposed solution
74which has been named SchedTune.
75
76
772. Introduction
78===============
79
80SchedTune exposes a simple user-space interface with a single power-performance
81tunable:
82
83 /proc/sys/kernel/sched_cfs_boost
84
85This permits expressing a boost value as an integer in the range [0..100].
86
87A value of 0 (default) configures the CFS scheduler for maximum energy
88efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
89required to satisfy their workload demand.
90A value of 100 configures scheduler for maximum performance, which translates
91to the selection of the maximum OPP on that CPU.
92
93The range between 0 and 100 can be set to satisfy other scenarios suitably. For
94example to satisfy interactive response or depending on other system events
95(battery level etc).
96
97A CGroup based extension is also provided, which permits further user-space
98defined task classification to tune the scheduler for different goals depending
99on the specific nature of the task, e.g. background vs interactive vs
100low-priority.
101
102The overall design of the SchedTune module is built on top of "Per-Entity Load
103Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
104Performance Point (OPP) selection.
105Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
106the operating frequency of that CPU to better match the workload demand. The
107selection of the actual OPP being activated is influenced by the global boost
108value, or the boost value for the task CGroup when in use.
109
110This simple biasing approach leverages existing frameworks, which means minimal
111modifications to the scheduler, and yet it allows to achieve a range of
112different behaviours all from a single simple tunable knob.
113The only new concept introduced is that of signal boosting.
114
115
1163. Signal Boosting Strategy
117===========================
118
119The whole PELT machinery works based on the value of a few load tracking signals
120which basically track the CPU bandwidth requirements for tasks and the capacity
121of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
122some of these load tracking signals to make a task or RQ appears more demanding
123that it actually is.
124
125Which signals have to be inflated depends on the specific "consumer". However,
126independently from the specific (signal, consumer) pair, it is important to
127define a simple and possibly consistent strategy for the concept of boosting a
128signal.
129
130A boosting strategy defines how the "abstract" user-space defined
131sched_cfs_boost value is translated into an internal "margin" value to be added
132to a signal to get its inflated value:
133
134 margin := boosting_strategy(sched_cfs_boost, signal)
135 boosted_signal := signal + margin
136
137Different boosting strategies were identified and analyzed before selecting the
138one found to be most effective.
139
140Signal Proportional Compensation (SPC)
141--------------------------------------
142
143In this boosting strategy the sched_cfs_boost value is used to compute a
144margin which is proportional to the complement of the original signal.
145When a signal has a maximum possible value, its complement is defined as
146the delta from the actual value and its possible maximum.
147
148Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
149the maximum possible value, the margin becomes:
150
151 margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
152
153Using this boosting strategy:
154- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
155- each value in the range of sched_cfs_boost effectively inflates the signal in
156 question by a quantity which is proportional to the maximum value.
157
158For example, by applying the SPC boosting strategy to the selection of the OPP
159to run a task it is possible to achieve these behaviors:
160
161- 0% boosting: run the task at the minimum OPP required by its workload
162- 100% boosting: run the task at the maximum OPP available for the CPU
163- 50% boosting: run at the half-way OPP between minimum and maximum
164
165Which means that, at 50% boosting, a task will be scheduled to run at half of
166the maximum theoretically achievable performance on the specific target
167platform.
168
169A graphical representation of an SPC boosted signal is represented in the
170following figure where:
171 a) "-" represents the original signal
172 b) "b" represents a 50% boosted signal
173 c) "p" represents a 100% boosted signal
174
175
176 ^
177 | SCHED_LOAD_SCALE
178 +-----------------------------------------------------------------+
179 |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
180 |
181 | boosted_signal
182 | bbbbbbbbbbbbbbbbbbbbbbbb
183 |
184 | original signal
185 | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
186 | |
187 |bbbbbbbbbbbbbbbbbb |
188 | |
189 | |
190 | |
191 | +-----------------------+
192 | |
193 | |
194 | |
195 |------------------+
196 |
197 |
198 +----------------------------------------------------------------------->
199
200The plot above shows a ramped load signal (titled 'original_signal') and it's
201boosted equivalent. For each step of the original signal the boosted signal
202corresponding to a 50% boost is midway from the original signal and the upper
203bound. Boosting by 100% generates a boosted signal which is always saturated to
204the upper bound.
205
206
2074. OPP selection using boosted CPU utilization
208==============================================
209
210It is worth calling out that the implementation does not introduce any new load
211signals. Instead, it provides an API to tune existing signals. This tuning is
212done on demand and only in scheduler code paths where it is sensible to do so.
213The new API calls are defined to return either the default signal or a boosted
214one, depending on the value of sched_cfs_boost. This is a clean an non invasive
215modification of the existing existing code paths.
216
217The signal representing a CPU's utilization is boosted according to the
218previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
219(ie CFS run-queue) to appear more used then it actually is.
220
221Thus, with the sched_cfs_boost enabled we have the following main functions to
222get the current utilization of a CPU:
223
224 cpu_util()
225 boosted_cpu_util()
226
227The new boosted_cpu_util() is similar to the first but returns a boosted
228utilization signal which is a function of the sched_cfs_boost value.
229
230This function is used in the CFS scheduler code paths where sched-DVFS needs to
231decide the OPP to run a CPU at.
232For example, this allows selecting the highest OPP for a CPU which has
233the boost value set to 100%.
234
235
2365. Per task group boosting
237==========================
238
239The availability of a single knob which is used to boost all tasks in the
240system is certainly a simple solution but it quite likely doesn't fit many
241utilization scenarios, especially in the mobile device space.
242
243For example, on battery powered devices there usually are many background
244services which are long running and need energy efficient scheduling. On the
245other hand, some applications are more performance sensitive and require an
246interactive response and/or maximum performance, regardless of the energy cost.
247To better service such scenarios, the SchedTune implementation has an extension
248that provides a more fine grained boosting interface.
249
250A new CGroup controller, namely "schedtune", could be enabled which allows to
251defined and configure task groups with different boosting values.
252Tasks that require special performance can be put into separate CGroups.
253The value of the boost associated with the tasks in this group can be specified
254using a single knob exposed by the CGroup controller:
255
256 schedtune.boost
257
258This knob allows the definition of a boost value that is to be used for
259SPC boosting of all tasks attached to this group.
260
261The current schedtune controller implementation is really simple and has these
262main characteristics:
263
264 1) It is only possible to create 1 level depth hierarchies
265
266 The root control groups define the system-wide boost value to be applied
267 by default to all tasks. Its direct subgroups are named "boost groups" and
268 they define the boost value for specific set of tasks.
269 Further nested subgroups are not allowed since they do not have a sensible
270 meaning from a user-space standpoint.
271
272 2) It is possible to define only a limited number of "boost groups"
273
274 This number is defined at compile time and by default configured to 16.
275 This is a design decision motivated by two main reasons:
276 a) In a real system we do not expect utilization scenarios with more then few
277 boost groups. For example, a reasonable collection of groups could be
278 just "background", "interactive" and "performance".
279 b) It simplifies the implementation considerably, especially for the code
280 which has to compute the per CPU boosting once there are multiple
281 RUNNABLE tasks with different boost values.
282
283Such a simple design should allow servicing the main utilization scenarios identified
284so far. It provides a simple interface which can be used to manage the
285power-performance of all tasks or only selected tasks.
286Moreover, this interface can be easily integrated by user-space run-times (e.g.
287Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
288classification, which has been a long standing requirement.
289
290Setup and usage
291---------------
292
2930. Use a kernel with CGROUP_SCHEDTUNE support enabled
294
2951. Check that the "schedtune" CGroup controller is available:
296
297 root@linaro-nano:~# cat /proc/cgroups
298 #subsys_name hierarchy num_cgroups enabled
299 cpuset 0 1 1
300 cpu 0 1 1
301 schedtune 0 1 1
302
3032. Mount a tmpfs to create the CGroups mount point (Optional)
304
305 root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
306
3073. Mount the "schedtune" controller
308
309 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
310 root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
311
3124. Setup the system-wide boost value (Optional)
313
314 If not configured the root control group has a 0% boost value, which
315 basically disables boosting for all tasks in the system thus running in
316 an energy-efficient mode.
317
318 root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
319
3205. Create task groups and configure their specific boost value (Optional)
321
322 For example here we create a "performance" boost group configure to boost
323 all its tasks to 100%
324
325 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
326 root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
327
3286. Move tasks into the boost group
329
330 For example, the following moves the tasks with PID $TASKPID (and all its
331 threads) into the "performance" boost group.
332
333 root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
334
335This simple configuration allows only the threads of the $TASKPID task to run,
336when needed, at the highest OPP in the most capable CPU of the system.
337
338
3396. Question and Answers
340=======================
341
342What about "auto" mode?
343-----------------------
344
345The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
346with some suitable user-space element. This element could use the exposed
347system-wide or cgroup based interface.
348
349How are multiple groups of tasks with different boost values managed?
350---------------------------------------------------------------------
351
352The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
353on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
354is boosted with a value which is the maximum of the boost values of the
355currently RUNNABLE tasks in its RQ.
356
357This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
358to run and switch back to the energy efficient mode as soon as the last boosted
359task is dequeued.
360
361
3627. References
363=============
364[1] http://lwn.net/Articles/552889
365[2] http://lkml.org/lkml/2012/5/18/91
366[3] http://lkml.org/lkml/2015/6/26/620