Blame - Documentation/scheduler/sched-hmp.txt - kernel/msm-4.9

blob: f485dc85c03e7b6a3a87514971fe454ebbccd3e4 [file] [log] [blame]

Syed Rameez Mustafa	dddcab7	2016-09-07 16:18:27 -0700	[diff] [blame]	1	CONTENTS
				2
				3	1. Introduction
				4	1.1 Heterogeneous Systems
				5	1.2 CPU Frequency Guidance
				6	2. Window-Based Load Tracking Scheme
				7	2.1 Synchronized Windows
				8	2.2 struct ravg
				9	2.3 Scaling Load Statistics
				10	2.4 sched_window_stats_policy
				11	2.5 Task Events
				12	2.6 update_task_ravg()
				13	2.7 update_history()
				14	2.8 Per-task 'initial task load'
				15	3. CPU Capacity
				16	3.1 Load scale factor
				17	3.2 CPU Power
				18	4. CPU Power
				19	5. HMP Scheduler
				20	5.1 Classification of Tasks and CPUs
				21	5.2 select_best_cpu()
				22	5.2.1 sched_boost
				23	5.2.2 task_will_fit()
				24	5.2.3 Tunables affecting select_best_cpu()
				25	5.2.4 Wakeup Logic
				26	5.3 Scheduler Tick
				27	5.4 Load Balancer
				28	5.5 Real Time Tasks
				29	5.6 Task packing
				30	6. Frequency Guidance
				31	6.1 Per-CPU Window-Based Stats
				32	6.2 Per-task Window-Based Stats
				33	6.3 Effect of various task events
				34	7. Tunables
				35	8. HMP Scheduler Trace Points
				36	8.1 sched_enq_deq_task
				37	8.2 sched_task_load
				38	8.3 sched_cpu_load_*
				39	8.4 sched_update_task_ravg
				40	8.5 sched_update_history
				41	8.6 sched_reset_all_windows_stats
				42	8.7 sched_migration_update_sum
				43	8.8 sched_get_busy
				44	8.9 sched_freq_alert
				45	8.10 sched_set_boost
Pavankumar Kondeti	8de9ac6	2016-10-01 11:06:54 +0530	[diff] [blame]	46	9. Device Tree bindings
Syed Rameez Mustafa	dddcab7	2016-09-07 16:18:27 -0700	[diff] [blame]	47
				48	===============
				49	1. INTRODUCTION
				50	===============
				51
				52	Scheduler extensions described in this document serves two goals:
				53
				54	1) handle heterogeneous multi-processor (HMP) systems
				55	2) guide cpufreq governor on proactive changes to cpu frequency
				56
				57	*** 1.1 Heterogeneous systems
				58
				59	Heterogeneous systems have cpus that differ with regard to their performance and
				60	power characteristics. Some cpus could offer peak performance better than
				61	others, although at cost of consuming more power. We shall refer such cpus as
				62	"high performance" or "performance efficient" cpus. Other cpus that offer lesser
				63	peak performance are referred to as "power efficient".
				64
				65	In this situation the scheduler is tasked with the responsibility of assigning
				66	tasks to run on the right cpus where their performance requirements can be met
				67	at the least expense of power.
				68
				69	Achieving that goal is made complicated by the fact that the scheduler has
				70	little clue about performance requirements of tasks and how they may change by
				71	running on power or performance efficient cpus! One simplifying assumption here
				72	could be that a task's desire for more performance is expressed by its cpu
				73	utilization. A task demanding high cpu utilization on a power-efficient cpu
				74	would likely improve in its performance by running on a performance-efficient
				75	cpu. This idea forms the basis for HMP-related scheduler extensions.
				76
				77	Key inputs required by the HMP scheduler for its task placement decisions are:
				78
				79	a) task load - this reflects cpu utilization or demand of tasks
				80	b) CPU capacity - this reflects peak performance offered by cpus
				81	c) CPU power - this reflects power or energy cost of cpus
				82
				83	Once all 3 pieces of information are available, the HMP scheduler can place
				84	tasks on the lowest power cpus where their demand can be satisfied.
				85
				86	*** 1.2 CPU Frequency guidance
				87
				88	A somewhat separate but related goal of the scheduler extensions described here
				89	is to provide guidance to the cpufreq governor on the need to change cpu
				90	frequency. Most governors that control cpu frequency work on a reactive basis.
				91	CPU utilization is sampled at regular intervals, based on which the need to
				92	change frequency is determined. Higher utilization leads to a frequency increase
				93	and vice-versa. There are several problems with this approach that scheduler
				94	can help resolve.
				95
				96	a) latency
				97
				98	Reactive nature introduces latency for cpus to ramp up to desired speed
				99	which can hurt application performance. This is inevitable as cpufreq
				100	governors can only track cpu utilization as a whole and not tasks which
				101	are driving that demand. Scheduler can however keep track of individual
				102	task demand and can alert the governor on changing task activity. For
				103	example, request raise in frequency when tasks activity is increasing on
				104	a cpu because of wakeup or migration or request frequency to be lowered
				105	when task activity is decreasing because of sleep/exit or migration.
				106
				107	b) part-picture
				108
				109	Most governors track utilization of each CPU independently. When a task
				110	migrates from one cpu to another the task's execution time is split
				111	across the two cpus. The governor can fail to see the full picture of
				112	task demand in this case and thus the need for increasing frequency,
				113	affecting the task's performance. Scheduler can keep track of task
				114	migrations, fix up busy time upon migration and report per-cpu busy time
				115	to the governor that reflects task demand accurately.
				116
				117	The rest of this document explains key enhancements made to the scheduler to
				118	accomplish both of the aforementioned goals.
				119
				120	====================================
				121	2. WINDOW-BASED LOAD TRACKING SCHEME
				122	====================================
				123
				124	As mentioned in the introduction section, knowledge of the CPU demand exerted by
				125	a task is a prerequisite to knowing where to best place the task in an HMP
				126	system. The per-entity load tracking (PELT) scheme, present in Linux kernel
				127	since v3.7, has some perceived shortcomings when used to place tasks on HMP
				128	systems or provide recommendations on CPU frequency.
				129
				130	Per-entity load tracking does not make a distinction between the ramp up
				131	vs ramp down time of task load. It also decays task load without exception when
				132	a task sleeps. As an example, a cpu bound task at its peak load (LOAD_AVG_MAX or
				133	47742) can see its load decay to 0 after a sleep of just 213ms! A cpu-bound task
				134	running on a performance-efficient cpu could thus get re-classified as not
				135	requiring such a cpu after a short sleep. In the case of mobile workloads, tasks
				136	could go to sleep due to a lack of user input. When they wakeup it is very
				137	likely their cpu utilization pattern repeats. Resetting their load across sleep
				138	and incurring latency to reclassify them as requiring a high performance cpu can
				139	hurt application performance.
				140
				141	The window-based load tracking scheme described in this document avoids these
				142	drawbacks. It keeps track of N windows of execution for every task. Windows
				143	where a task had no activity are ignored and not recorded. N can be tuned at
				144	compile time (RAVG_HIST_SIZE defined in include/linux/sched.h) or at runtime
				145	(/proc/sys/kernel/sched_ravg_hist_size). The window size, W, is common for all
				146	tasks and currently defaults to 10ms ('sched_ravg_window' defined in
				147	kernel/sched/core.c). The window size can be tuned at boot time via the
				148	sched_ravg_window=W argument to kernel. Alternately it can be tuned after boot
				149	via tunables provided by the interactive governor. More on this later.
				150
				151	Based on the N samples available per-task, a per-task "demand" attribute is
				152	calculated which represents the cpu demand of that task. The demand attribute is
				153	used to classify tasks as to whether or not they need a performance-efficient
				154	CPU and also serves to provide inputs on frequency to the cpufreq governor. More
				155	on this later. The 'sched_window_stats_policy' tunable (defined in
				156	kernel/sched/core.c) controls how the demand field for a task is derived from
				157	its N past samples.
				158
				159	*** 2.1 Synchronized windows
				160
				161	Windows of observation for task activity are synchronized across cpus. This
				162	greatly aids in the scheduler's frequency guidance feature. Scheduler currently
				163	relies on a synchronized clock (sched_clock()) for this feature to work. It may
				164	be possible to extend this feature to work on systems having an unsynchronized
				165	sched_clock().
				166
				167	struct rq {
				168
				169	..
				170
				171	u64 window_start;
				172
				173	..
				174	};
				175
				176	The 'window_start' attribute represents the time when current window began on a
				177	cpu. It is updated when key task events such as wakeup or context-switch call
				178	update_task_ravg() to record task activity. The window_start value is expected
				179	to be the same for all cpus, although it could be behind on some cpus where it
				180	has not yet been updated because update_task_ravg() has not been recently
				181	called. For example, when a cpu is idle for a long time its window_start could
				182	be stale. The window_start value for such cpus is rolled forward upon
				183	occurrence of a task event resulting in a call to update_task_ravg().
				184
				185	*** 2.2 struct ravg
				186
				187	The ravg struct contains information tracked per-task.
				188
				189	struct ravg {
				190	u64 mark_start;
				191	u32 sum, demand;
				192	u32 sum_history[RAVG_HIST_SIZE];
				193	};
				194
				195	struct task_struct {
				196
				197	..
				198
				199	struct ravg ravg;
				200
				201	..
				202	};
				203
				204	sum_history[] - stores cpu utilization samples from N previous windows
				205	where task had activity
				206
				207	sum - stores cpu utilization of the task in its most recently
				208	tracked window. Once the corresponding window terminates,
				209	'sum' will be pushed into the sum_history[] array and is then
				210	reset to 0. It is possible that the window corresponding to
				211	sum is not the current window being tracked on a cpu. For
				212	example, a task could go to sleep in window X and wakeup in
				213	window Y (Y > X). In this case, sum would correspond to the
				214	task's activity seen in window X. When update_task_ravg() is
				215	called during the task's wakeup event it will be seen that
				216	window X has elapsed. The sum value will be pushed to
				217	'sum_history[]' array before being reset to 0.
				218
				219	demand - represents task's cpu demand and is derived from the
				220	elements in sum_history[]. The section on
				221	'sched_window_stats_policy' provides more details on how
				222	'demand' is derived from elements in sum_history[] array
				223
				224	mark_start - records timestamp of the beginning of the most recent task
				225	event. See section on 'Task events' for possible events that
				226	update 'mark_start'
				227
				228	curr_window - this is described in the section on 'Frequency guidance'
				229
				230	prev_window - this is described in the section on 'Frequency guidance'
				231
				232
				233	*** 2.3 Scaling load statistics
				234
				235	Time required for a task to complete its work (and hence its load) depends on,
				236	among various other factors, cpu frequency and its efficiency. In a HMP system,
				237	some cpus are more performance efficient than others. Performance efficiency of
				238	a cpu can be described by its "instructions-per-cycle" (IPC) attribute. History
				239	of task execution could involve task having run at different frequencies and on
				240	cpus with different IPC attributes. To avoid ambiguity of how task load relates
				241	to the frequency and IPC of cpus on which a task has run, task load is captured
				242	in a scaled form, with scaling being done in reference to an "ideal" cpu that
				243	has best possible IPC and frequency. Such an "ideal" cpu, having the best
				244	possible frequency and IPC, may or may not exist in system.
				245
				246	As an example, consider a HMP system, with two types of cpus, A53 and A57. A53
				247	has IPC count of 1024 and can run at maximum frequency of 1 GHz, while A57 has
				248	IPC count of 2048 and can run at maximum frequency of 2 GHz. Ideal cpu in this
				249	case is A57 running at 2 GHz.
				250
				251	A unit of work that takes 100ms to finish on A53 running at 100MHz would get
				252	done in 10ms on A53 running at 1GHz, in 5 ms running on A57 at 1 GHz and 2.5ms
				253	on A57 running at 2 GHz. Thus a load of 100ms can be expressed as 2.5ms in
				254	reference to ideal cpu of A57 running at 2 GHz.
				255
				256	In order to understand how much load a task will consume on a given cpu, its
				257	scaled load needs to be multiplied by a factor (load scale factor). In above
				258	example, scaled load of 2.5ms needs to be multiplied by a factor of 4 in order
				259	to estimate the load of task on A53 running at 1 GHz.
				260
				261	/proc/sched_debug provides IPC attribute and load scale factor for every cpu.
				262
				263	In summary, task load information stored in a task's sum_history[] array is
				264	scaled for both frequency and efficiency. If a task runs for X ms, then the
				265	value stored in its 'sum' field is derived as:
				266
				267	X_s = X * (f_cur / max_possible_freq) *
				268	(efficiency / max_possible_efficiency)
				269
				270	where:
				271
				272	X = cpu utilization that needs to be accounted
				273	X_s = Scaled derivative of X
				274	f_cur = current frequency of the cpu where the task was
				275	running
				276	max_possible_freq = maximum possible frequency (across all cpus)
				277	efficiency = instructions per cycle (IPC) of cpu where task was
				278	running
				279	max_possible_efficiency = maximum IPC offered by any cpu in system
				280
				281
				282	*** 2.4 sched_window_stats_policy
				283
				284	sched_window_stats_policy controls how the 'demand' attribute for a task is
				285	derived from elements in its 'sum_history[]' array.
				286
				287	WINDOW_STATS_RECENT (0)
				288	demand = recent
				289
				290	WINDOW_STATS_MAX (1)
				291	demand = max
				292
				293	WINDOW_STATS_MAX_RECENT_AVG (2)
				294	demand = maximum(average, recent)
				295
				296	WINDOW_STATS_AVG (3)
				297	demand = average
				298
				299	where:
				300	M = history size specified by
				301	/proc/sys/kernel/sched_ravg_hist_size
				302	average = average of first M samples found in the sum_history[] array
				303	max = maximum value of first M samples found in the sum_history[]
				304	array
				305	recent = most recent sample (sum_history[0])
				306	demand = demand attribute found in 'struct ravg'
				307
				308	This policy can be changed at runtime via
				309	/proc/sys/kernel/sched_window_stats_policy. For example, the command
				310	below would select WINDOW_STATS_USE_MAX policy
				311
				312	echo 1 > /proc/sys/kernel/sched_window_stats_policy
				313
				314	*** 2.5 Task events
				315
				316	A number of events results in the window-based stats of a task being
				317	updated. These are:
				318
				319	PICK_NEXT_TASK - the task is about to start running on a cpu
				320	PUT_PREV_TASK - the task stopped running on a cpu
				321	TASK_WAKE - the task is waking from sleep
				322	TASK_MIGRATE - the task is migrating from one cpu to another
				323	TASK_UPDATE - this event is invoked on a currently running task to
				324	update the task's window-stats and also the cpu's
				325	window-stats such as 'window_start'
				326	IRQ_UPDATE - event to record the busy time spent by an idle cpu
				327	processing interrupts
				328
				329	*** 2.6 update_task_ravg()
				330
				331	update_task_ravg() is called to mark the beginning of an event for a task or a
				332	cpu. It serves to accomplish these functions:
				333
				334	a. Update a cpu's window_start value
				335	b. Update a task's window-stats (sum, sum_history[], demand and mark_start)
				336
				337	In addition update_task_ravg() updates the busy time information for the given
				338	cpu, which is used for frequency guidance. This is described further in section
				339	6.
				340
				341	*** 2.7 update_history()
				342
				343	update_history() is called on a task to record its activity in an elapsed
				344	window. 'sum', which represents task's cpu demand in its elapsed window is
				345	pushed onto sum_history[] array and its 'demand' attribute is updated based on
				346	the sched_window_stats_policy in effect.
				347
				348	*** 2.8 Initial task load attribute for a task (init_load_pct)
				349
				350	In some cases, it may be desirable for children of a task to be assigned a
				351	"high" load so that they can start running on best capacity cluster. By default,
				352	newly created tasks are assigned a load defined by tunable sched_init_task_load
				353	(Sec 7.8). Some specialized tasks may need a higher value than the global
				354	default for their child tasks. This will let child tasks run on cpus with best
				355	capacity. This is accomplished by setting the 'initial task load' attribute
				356	(init_load_pct) for a task. Child tasks starting load (ravg.demand and
				357	ravg.sum_history[]) is initialized from their parent's 'initial task load'
				358	attribute. Note that child task's 'initial task load' attribute itself will be 0
				359	by default (i.e it is not inherited from parent).
				360
				361	A task's 'initial task load' attribute can be set in two ways:
				362
				363	**** /proc interface
				364
				365	/proc/[pid]/sched_init_task_load can be written to for setting a task's 'initial
				366	task load' attribute. A numeric value between 0 - 100 (in percent scale) is
				367	accepted for task's 'initial task load' attribute.
				368
				369	Reading /proc/[pid]/sched_init_task_load returns the 'initial task load'
				370	attribute for the given task.
				371
				372	**** kernel API
				373
				374	Following kernel APIs are provided to set or retrieve a given task's 'initial
				375	task load' attribute:
				376
				377	int sched_set_init_task_load(struct task_struct *p, int init_load_pct);
				378	int sched_get_init_task_load(struct task_struct *p);
				379
				380
				381	===============
				382	3. CPU CAPACITY
				383	===============
				384
				385	CPU capacity reflects peak performance offered by a cpu. It is defined both by
				386	maximum frequency at which cpu can run and its efficiency attribute. Capacity of
				387	a cpu is defined in reference to "least" performing cpu such that "least"
				388	performing cpu has capacity of 1024.
				389
				390	capacity = 1024 * (fmax_cur * / min_max_freq) *
				391	(efficiency / min_possible_efficiency)
				392
				393	where:
				394
				395	fmax_cur = maximum frequency at which cpu is currently
				396	allowed to run at
				397	efficiency = IPC of cpu
				398	min_max_freq = max frequency at which "least" performing cpu
				399	can run
				400	min_possible_efficiency = IPC of "least" performing cpu
				401
				402	'fmax_cur' reflects the fact that a cpu may be constrained at runtime to run at
				403	a maximum frequency less than what is supported. This may be a constraint placed
				404	by user or drivers such as thermal that intends to reduce temperature of a cpu
				405	by restricting its maximum frequency.
				406
				407	'max_possible_capacity' reflects the maximum capacity of a cpu based on the
				408	maximum frequency it supports.
				409
				410	max_possible_capacity = 1024 * (fmax * / min_max_freq) *
				411	(efficiency / min_possible_efficiency)
				412
				413	where:
				414	fmax = maximum frequency supported by a cpu
				415
				416	/proc/sched_debug lists capacity and maximum_capacity information for a cpu.
				417
				418	In the example HMP system quoted in Sec 2.3, "least" performing CPU is A53 and
				419	thus min_max_freq = 1GHz and min_possible_efficiency = 1024.
				420
				421	Capacity of A57 = 1024 * (2GHz / 1GHz) * (2048 / 1024) = 4096
				422	Capacity of A53 = 1024 * (1GHz / 1GHz) * (1024 / 1024) = 1024
				423
				424	Capacity of A57 when constrained to run at maximum frequency of 500MHz can be
				425	calculated as:
				426
				427	Capacity of A57 = 1024 * (500MHz / 1GHz) * (2048 / 1024) = 1024
				428
				429	*** 3.1 load_scale_factor
				430
				431	'lsf' or load scale factor attribute of a cpu is used to estimate load of a task
				432	on that cpu when running at its fmax_cur frequency. 'lsf' is defined in
				433	reference to "best" performing cpu such that it's lsf is 1024. 'lsf' for a cpu
				434	is defined as:
				435
				436	lsf = 1024 * (max_possible_freq / fmax_cur) *
				437	(max_possible_efficiency / ipc)
				438
				439	where:
				440	fmax_cur = maximum frequency at which cpu is currently
				441	allowed to run at
				442	ipc = IPC of cpu
				443	max_possible_freq = max frequency at which "best" performing cpu
				444	can run
				445	max_possible_efficiency = IPC of "best" performing cpu
				446
				447	In the example HMP system quoted in Sec 2.3, "best" performing CPU is A57 and
				448	thus max_possible_freq = 2 GHz, max_possible_efficiency = 2048
				449
				450	lsf of A57 = 1024 * (2GHz / 2GHz) * (2048 / 2048) = 1024
				451	lsf of A53 = 1024 * (2GHz / 1 GHz) * (2048 / 1024) = 4096
				452
				453	lsf of A57 constrained to run at maximum frequency of 500MHz can be calculated
				454	as:
				455
				456	lsf of A57 = 1024 * (2GHz / 500Mhz) * (2048 / 2048) = 4096
				457
				458	To estimate load of a task on a given cpu running at its fmax_cur:
				459
				460	load = scaled_load * lsf / 1024
				461
				462	A task with scaled load of 20% would thus be estimated to consume 80% bandwidth
				463	of A53 running at 1GHz. The same task with scaled load of 20% would be estimated
				464	to consume 160% bandwidth on A53 constrained to run at maximum frequency of
				465	500MHz.
				466
				467	load_scale_factor, thus, is very useful to estimate load of a task on a given
				468	cpu and thus to decide whether it can fit in a cpu or not.
				469
				470	*** 3.2 cpu_power
				471
				472	A metric 'cpu_power' related to 'capacity' is also listed in /proc/sched_debug.
				473	'cpu_power' is ideally same for all cpus (1024) when they are idle and running
				474	at the same frequency. 'cpu_power' of a cpu can be scaled down from its ideal
				475	value to reflect reduced frequency it is operating at and also to reflect the
				476	amount of cpu bandwidth consumed by real-time tasks executing on it.
				477	'cpu_power' metric is used by scheduler to decide task load distribution among
				478	cpus. CPUs with low 'cpu_power' will be assigned less task load compared to cpus
				479	with higher 'cpu_power'
				480
				481	============
				482	4. CPU POWER
				483	============
				484
				485	The HMP scheduler extensions currently depend on an architecture-specific driver
				486	to provide runtime information on cpu power. In the absence of an
				487	architecture-specific driver, the scheduler will resort to using the
				488	max_possible_capacity metric of a cpu as a measure of its power.
				489
				490	================
				491	5. HMP SCHEDULER
				492	================
				493
				494	For normal (SCHED_OTHER/fair class) tasks there are three paths in the
				495	scheduler which these HMP extensions affect. The task wakeup path, the
				496	load balancer, and the scheduler tick are each modified.
				497
				498	Real-time and stop-class tasks are served by different code
				499	paths. These will be discussed separately.
				500
				501	Prior to delving further into the algorithm and implementation however
				502	some definitions are required.
				503
				504	*** 5.1 Classification of Tasks and CPUs
				505
				506	With the extensions described thus far, the following information is
				507	available to the HMP scheduler:
				508
				509	- per-task CPU demand information from either Per-Entity Load Tracking
				510	(PELT) or the window-based algorithm described above
				511
				512	- a power value for each frequency supported by each CPU via the API
				513	described in section 4
				514
				515	- current CPU frequency, maximum CPU frequency (may be throttled by at
				516	runtime due to thermal conditions), maximum possible CPU frequency supported
				517	by hardware
				518
				519	- data previously maintained within the scheduler such as the number
				520	of currently runnable tasks on each CPU
				521
				522	Combined with tunable parameters, this information can be used to classify
				523	both tasks and CPUs to aid in the placement of tasks.
				524
				525	- big task
				526
				527	A big task is one that exerts a CPU demand too high for a particular
				528	CPU to satisfy. The scheduler will attempt to find a CPU with more
				529	capacity for such a task.
				530
				531	The definition of "big" is specific to a task and a CPU. A task
				532	may be considered big on one CPU in the system and not big on
				533	another if the first CPU has less capacity than the second.
				534
				535	What task demand is "too high" for a particular CPU? One obvious
				536	answer would be a task demand which, as measured by PELT or
				537	window-based load tracking, matches or exceeds the capacity of that
				538	CPU. A task which runs on a CPU for a long time, for example, might
				539	meet this criteria as it would report 100% demand of that CPU. It
				540	may be desirable however to classify tasks which use less than 100%
				541	of a particular CPU as big so that the task has some "headroom" to grow
				542	without its CPU bandwidth getting capped and its performance requirements
				543	not being met. This task demand is therefore a tunable parameter:
				544
				545	/proc/sys/kernel/sched_upmigrate
				546
				547	This value is a percentage. If a task consumes more than this much of a
				548	particular CPU, that CPU will be considered too small for the task. The task
				549	will thus be seen as a "big" task on the cpu and will reflect in nr_big_tasks
				550	statistics maintained for that cpu. Note that certain tasks (whose nice
				551	value exceeds SCHED_UPMIGRATE_MIN_NICE value or those that belong to a cgroup
				552	whose upmigrate_discourage flag is set) will never be classified as big tasks
				553	despite their high demand.
				554
				555	As the load scale factor is calculated against current fmax, it gets boosted
				556	when a lower capacity CPU is restricted to run at lower fmax. The task
				557	demand is inflated in this scenario and the task upmigrates early to the
				558	maximum capacity CPU. Hence this threshold is auto-adjusted by a factor
				559	equal to max_possible_frequency/current_frequency of a lower capacity CPU.
				560	This adjustment happens only when the lower capacity CPU frequency is
				561	restricted. The same adjustment is applied to the downmigrate threshold
				562	as well.
				563
				564	When the frequency restriction is relaxed, the previous values are restored.
				565	sched_up_down_migrate_auto_update macro defined in kernel/sched/core.c
				566	controls this auto-adjustment behavior and it is enabled by default.
				567
				568	If the adjusted upmigrate threshold exceeds the window size, it is clipped to
				569	the window size. If the adjusted downmigrate threshold decreases the difference
				570	between the upmigrate and downmigrate, it is clipped to a value such that the
				571	difference between the modified and the original thresholds is same.
				572
				573	- spill threshold
				574
				575	Tasks will normally be placed on lowest power-cost cluster where they can fit.
				576	This could result in power-efficient cluster becoming overcrowded when there
				577	are "too" many low-demand tasks. Spill threshold provides a spill over
				578	criteria, wherein low-demand task are allowed to be placed on idle or
				579	busy cpus in high-performance cluster.
				580
				581	Scheduler will avoid placing a task on a cpu if it can result in cpu exceeding
				582	its spill threshold, which is defined by two tunables:
				583
				584	/proc/sys/kernel/sched_spill_nr_run (default: 10)
				585	/proc/sys/kernel/sched_spill_load (default : 100%)
				586
				587	A cpu is considered to be above its spill level if it already has 10 tasks or
				588	if the sum of task load (scaled in reference to given cpu) and
				589	rq->cumulative_runnable_avg exceeds 'sched_spill_load'.
				590
				591	- power band
				592
				593	The scheduler may be faced with a tradeoff between power and performance when
				594	placing a task. If the scheduler sees two CPUs which can accommodate a task:
				595
				596	CPU 1, power cost of 20, load of 10
				597	CPU 2, power cost of 10, load of 15
				598
				599	It is not clear what the right choice of CPU is. The HMP scheduler
				600	offers the sched_powerband_limit tunable to determine how this
				601	situation should be handled. When the power delta between two CPUs
				602	is less than sched_powerband_limit_pct, load will be prioritized as
				603	the deciding factor as to which CPU is selected. If the power delta
				604	between two CPUs exceeds that, the lower power CPU is considered to
				605	be in a different "band" and it is selected, despite perhaps having
				606	a higher current task load.
				607
				608	*** 5.2 select_best_cpu()
				609
				610	CPU placement decisions for a task at its wakeup or creation time are the
				611	most important decisions made by the HMP scheduler. This section will describe
				612	the call flow and algorithm used in detail.
				613
				614	The primary entry point for a task wakeup operation is try_to_wake_up(),
				615	located in kernel/sched/core.c. This function relies on select_task_rq() to
				616	determine the target CPU for the waking task. For fair-class (SCHED_OTHER)
				617	tasks, that request will be routed to select_task_rq_fair() in
				618	kernel/sched/fair.c. As part of these scheduler extensions a hook has been
				619	inserted into the top of that function. If HMP scheduling is enabled the normal
				620	scheduling behavior will be replaced by a call to select_best_cpu(). This
				621	function, select_best_cpu(), represents the heart of the HMP scheduling
				622	algorithm described in this document. Note that select_best_cpu() is also
				623	invoked for a task being created.
				624
				625	The behavior of select_best_cpu() depends on several factors such as boost
				626	setting, choice of several tunables and on task demand.
				627
				628	**** 5.2.1 Boost
				629
				630	The task placement policy changes signifincantly when scheduler boost is in
				631	effect. When boost is in effect the scheduler ignores the power cost of
				632	placing tasks on CPUs. Instead it figures out the load on each CPU and then
				633	places task on the least loaded CPU. If the load of two or more CPUs is the
				634	same (generally when CPUs are idle) the task prefers to go highest capacity
				635	CPU in the system.
				636
				637	A further enhancement during boost is the scheduler' early detection feature.
				638	While boost is in effect the scheduler checks for the precence of tasks that
				639	have been runnable for over some period of time within the tick. For such
				640	tasks the scheduler informs the governor of imminent need for high frequency.
				641	If there exists a task on the runqueue at the tick that has been runnable
				642	for greater than SCHED_EARLY_DETECTION_DURATION amount of time, it notifies
				643	the governor with a fabricated load of the full window at the highest
				644	frequency. The fabricated load is maintained until the task is no longer
				645	runnable or until the next tick.
				646
				647	Boost can be set via either /proc/sys/kernel/sched_boost or by invoking
				648	kernel API sched_set_boost().
				649
				650	int sched_set_boost(int enable);
				651
				652	Once turned on, boost will remain in effect until it is explicitly turned off.
				653	To allow for boost to be controlled by multiple external entities (application
				654	or kernel module) at same time, boost setting is reference counted. This means
				655	that two applications can turn on boost and the effect of boost is eliminated
				656	only after both applications have turned off boost. boost_refcount variable
				657	represents this reference count.
				658
				659	**** 5.2.2 task_will_fit()
				660
				661	The overall goal of select_best_cpu() is to place a task on the least power
				662	cluster where it can "fit" i.e where its cpu usage shall be below the capacity
				663	offered by cluster. Criteria for a task to be considered as fitting in a cluster
				664	is:
				665
				666	i) A low-priority task, whose nice value is greater than
				667	SCHED_UPMIGRATE_MIN_NICE or whose cgroup has its
				668	upmigrate_discourage flag set, is considered to be fitting in all clusters,
				669	irrespective of their capacity and task's cpu demand.
				670
				671	ii) All tasks are considered to fit in highest capacity cluster.
				672
				673	iii) Task demand scaled in reference to the given cluster should be less than a
				674	threshold. See section on load_scale_factor to know more about how task
				675	demand is scaled in reference to a given cpu (cluster). The threshold used
				676	is normally sched_upmigrate. Its possible for a task's demand to exceed
				677	sched_upmigrate threshold in reference to a cluster when its upmigrated to
				678	higher capacity cluster. To prevent it from coming back immediately to
				679	lower capacity cluster, the task is not considered to "fit" on its earlier
				680	cluster until its demand has dropped below sched_downmigrate in reference
				681	to that earlier cluster. sched_downmigrate thus provides for some
				682	hysteresis control.
				683
				684
				685	**** 5.2.3 Factors affecting select_best_cpu()
				686
				687	Behavior of select_best_cpu() is further controlled by several tunables and
				688	synchronous nature of wakeup.
				689
				690	a. /proc/sys/kernel/sched_cpu_high_irqload
				691	A cpu whose irq load is greater than this threshold will not be
				692	considered eligible for placement. This threshold value in expressed in
				693	nanoseconds scale, with default threshold being 10000000 (10ms). See
				694	notes on sched_cpu_high_irqload tunable to understand how irq load on a
				695	cpu is measured.
				696
				697	b. Synchronous nature of wakeup
				698	Synchronous wakeup is a hint to scheduler that the task issuing wakeup
				699	(i.e task currently running on cpu where wakeup is being processed by
				700	scheduler) will "soon" relinquish CPU. A simple example is two tasks
				701	communicating with each other using a pipe structure. When reader task
				702	blocks waiting for data, its woken by writer task after it has written
				703	data to pipe. Writer task usually blocks waiting for reader task to
				704	consume data in pipe (which may not have any more room for writes).
				705
				706	Synchronous wakeup is accounted for by adjusting load of a cpu to not
				707	include load of currently running task. As a result, a cpu that has only
				708	one runnable task and which is currently processing synchronous wakeup
				709	will be considered idle.
				710
				711	c. PF_WAKE_UP_IDLE
				712	Any task with this flag set will be woken up to an idle cpu (if one is
				713	available) independent of sched_prefer_idle flag setting, its demand and
				714	synchronous nature of wakeup. Similarly idle cpu is preferred during
				715	wakeup for any task that does not have this flag set but is being woken
				716	by a task with PF_WAKE_UP_IDLE flag set. For simplicity, we will use the
				717	term "PF_WAKE_UP_IDLE wakeup" to signify wakeups involving a task with
				718	PF_WAKE_UP_IDLE flag set.
				719
				720	d. /proc/sys/kernel/sched_select_prev_cpu_us
				721	This threshold controls whether task placement goes through fast path or
				722	not. If task's wakeup time since last sleep is short there are high
				723	chances that it's better to place the task on its previous CPU. This
				724	reduces task placement latency, cache miss and number of migrations.
				725	Default value of sched_select_prev_cpu_us is 2000 (2ms). This can be
				726	turned off by setting it to 0.
				727
Srivatsa Vaddagiri	b36e661	2016-09-09 19:38:03 +0530	[diff] [blame]	728	e. /proc/sys/kernel/sched_short_burst_ns
				729	This threshold controls whether a task is considered as "short-burst"
				730	or not. "short-burst" tasks are eligible for packing to avoid overhead
				731	associated with waking up an idle CPU. "non-idle" CPUs which are not
				732	loaded with IRQs and can accommodate the waking task without exceeding
				733	spill limits are considered. The ties are broken with load followed
				734	by previous CPU. This tunable does not affect cluster selection.
				735	It only affects CPU selection in a given cluster. This packing is
				736	skipped for tasks that are eligible for "wake-up-idle" and "boost".
				737
Syed Rameez Mustafa	dddcab7	2016-09-07 16:18:27 -0700	[diff] [blame]	738	**** 5.2.4 Wakeup Logic for Task "p"
				739
				740	Wakeup task placement logic is as follows:
				741
				742	1) Eliminate CPUs with high irq load based on sched_cpu_high_irqload tunable.
				743
				744	2) Eliminate CPUs where either the task does not fit or CPUs where placement
				745	will result in exceeding the spill threshold tunables. CPUs elimiated at this
				746	stage will be considered as backup choices incase none of the CPUs get past
				747	this stage.
				748
				749	3) Find out and return the least power CPU that satisfies all conditions above.
				750
				751	4) If two or more CPUs are projected to have the same power, break ties in the
				752	following preference order:
				753	a) The CPU is the task's previous CPU.
				754	b) The CPU is in the same cluster as the task's previous CPU.
				755	c) The CPU has the least load
				756
				757	The placement logic described above does not apply when PF_WAKE_UP_IDLE is set
				758	for either the waker task or the wakee task. Instead the scheduler chooses the
				759	most power efficient idle CPU.
				760
				761	5) If no CPU is found after step 2, resort to backup CPU selection logic
				762	whereby the CPU with highest amount of spare capacity is selected.
				763
				764	6) If none of the CPUs have any spare capacity, return the task's previous
				765	CPU.
				766
				767	*** 5.3 Scheduler Tick
				768
				769	Every CPU is interrupted periodically to let kernel update various statistics
				770	and possibly preempt the currently running task in favor of a waiting task. This
				771	periodicity, determined by CONFIG_HZ value, is set at 10ms. There are various
				772	optimizations by which a CPU however can skip taking these interrupts (ticks).
				773	A cpu going idle for considerable time in one such case.
				774
				775	HMP scheduler extensions brings in a change in processing of tick
				776	(scheduler_tick()) that can result in task migration. In case the currently
				777	running task on a cpu belongs to fair_sched class, a check is made if it needs
				778	to be migrated. Possible reasons for migrating task could be:
				779
				780	a) A big task is running on a power-efficient cpu and a high-performance cpu is
				781	available (idle) to service it
				782
				783	b) A task is starving on a CPU with high irq load.
				784
				785	c) A task with upmigration discouraged is running on a performance cluster.
				786	See notes on 'cpu.upmigrate_discourage'.
				787
				788	In case the test for migration turns out positive (which is expected to be rare
				789	event), a candidate cpu is identified for task migration. To avoid multiple task
				790	migrations to the same candidate cpu(s), identification of candidate cpu is
				791	serialized via global spinlock (migration_lock).
				792
				793	*** 5.4 Load Balancer
				794
				795	Load balance is a key functionality of scheduler that strives to distribute task
				796	across available cpus in a "fair" manner. Most of the complexity associated with
				797	this feature involves balancing fair_sched class tasks. Changes made to load
				798	balance code serve these goals:
				799
				800	1. Restrict flow of tasks from power-efficient cpus to high-performance cpu.
				801	Provide a spill-over threshold, defined in terms of number of tasks
				802	(sched_spill_nr_run) and cpu demand (sched_spill_load), beyond which tasks
				803	can spill over from power-efficient cpu to high-performance cpus.
				804
				805	2. Allow idle power-efficient cpus to pick up extra load from over-loaded
				806	performance-efficient cpu
				807
				808	3. Allow idle high-performance cpu to pick up big tasks from power-efficient cpu
				809
				810	*** 5.5 Real Time Tasks
				811
				812	Minimal changes introduced in treatment of real-time tasks by HMP scheduler
				813	aims at preferring scheduling of real-time tasks on cpus with low load on
				814	a power efficient cluster.
				815
				816	Prior to HMP scheduler, the fast-path cpu selection for placing a real-time task
				817	(at wakeup) is its previous cpu, provided the currently running task on its
				818	previous cpu is not a real-time task or a real-time task with lower priority.
				819	Failing this, cpu selection in slow-path involves building a list of candidate
				820	cpus where the waking real-time task will be of highest priority and thus can be
				821	run immediately. The first cpu from this candidate list is chosen for the waking
				822	real-time task. Much of the premise for this simple approach is the assumption
				823	that real-time tasks often execute for very short intervals and thus the focus
				824	is to place them on a cpu where they can be run immediately.
				825
				826	HMP scheduler brings in a change which avoids fast-path and always resorts to
				827	slow-path. Further cpu with lowest load in a power efficient cluster from
				828	candidate list of cpus is chosen as cpu for placing waking real-time task.
				829
				830	- PF_WAKE_UP_IDLE
				831
				832	Idle cpu is preferred for any waking task that has this flag set in its
				833	'task_struct.flags' field. Further idle cpu is preferred for any task woken by
				834	such tasks. PF_WAKE_UP_IDLE flag of a task is inherited by it's children. It can
				835	be modified for a task in two ways:
				836
				837	> kernel-space interface
				838	set_wake_up_idle() needs to be called in the context of a task
				839	to set or clear its PF_WAKE_UP_IDLE flag.
				840
				841	> user-space interface
				842	/proc/[pid]/sched_wake_up_idle file needs to be written to for
				843	setting or clearing PF_WAKE_UP_IDLE flag for a given task
				844
				845	=====================
				846	6. FREQUENCY GUIDANCE
				847	=====================
				848
				849	As mentioned in the introduction section the scheduler is in a unique
				850	position to assist with the determination of CPU frequency. Because
				851	the scheduler now maintains an estimate of per-task CPU demand, task
				852	activity can be tracked, aggregated and provided to the CPUfreq
				853	governor as a replacement for simple CPU busy time.
				854
				855	Two of the most popular CPUfreq governors, interactive and ondemand,
				856	utilize a window-based approach for measuring CPU busy time. This
				857	works well with the window-based load tracking scheme previously
				858	described. The following APIs are provided to allow the CPUfreq
				859	governor to query busy time from the scheduler instead of using the
				860	basic CPU busy time value derived via get_cpu_idle_time_us() and
				861	get_cpu_iowait_time_us() APIs.
				862
				863	int sched_set_window(u64 window_start, unsigned int window_size)
				864
				865	This API is invoked by governor at initialization time or whenever
				866	window size is changed. 'window_size' argument (in jiffy units)
				867	indicates the size of window to be used. The first window of size
				868	'window_size' is set to begin at jiffy 'window_start'
				869
				870	-EINVAL is returned if per-entity load tracking is in use rather
				871	than window-based load tracking, otherwise a success value of 0
				872	is returned.
				873
				874	int sched_get_busy(int cpu)
				875
				876	Returns the busy time for the given CPU in the most recent
				877	complete window. The value returned is microseconds of busy
				878	time at fmax of given CPU.
				879
				880	The values returned by sched_get_busy() take a bit of explanation,
				881	both in what they mean and also how they are derived.
				882
				883	*** 6.1 Per-CPU Window-Based Stats
				884
				885	In addition to the per-task window-based demand, the HMP scheduler
				886	extensions also track the aggregate demand seen on each CPU. This is
				887	done using the same windows that the task demand is tracked with
				888	(which is in turn set by the governor when frequency guidance is in
				889	use). There are four quantities maintained for each CPU by the HMP scheduler:
				890
				891	curr_runnable_sum: aggregate demand from all tasks which executed during
				892	the current (not yet completed) window
				893
				894	prev_runnable_sum: aggregate demand from all tasks which executed during
				895	the most recent completed window
				896
				897	nt_curr_runnable_sum: aggregate demand from all 'new' tasks which executed
				898	during the current (not yet completed) window
				899
				900	nt_prev_runnable_sum: aggregate demand from all 'new' tasks which executed
				901	during the most recent completed window.
				902
				903	When the scheduler is updating a task's window-based stats it also
				904	updates these values. Like per-task window-based demand these
				905	quantities are normalized against the max possible frequency and max
				906	efficiency (instructions per cycle) in the system. If an update occurs
				907	and a window rollover is observed, curr_runnable_sum is copied into
				908	prev_runnable_sum before being reset to 0. The sched_get_busy() API
				909	returns prev_runnable_sum, scaled to the efficiency and fmax of given
				910	CPU. The same applies to nt_curr_runnable_sum and nt_prev_runnable_sum.
				911
				912	A 'new' task is defined as a task whose number of active windows since fork is
Pavankumar Kondeti	41b4166	2017-02-08 09:33:22 +0530	[diff] [blame]	913	less than SCHED_NEW_TASK_WINDOWS. An active window is defined as a window
Syed Rameez Mustafa	dddcab7	2016-09-07 16:18:27 -0700	[diff] [blame]	914	where a task was observed to be runnable.
				915
				916	*** 6.2 Per-task window-based stats
				917
				918	Corresponding to curr_runnable_sum and prev_runnable_sum, two counters are
				919	maintained per-task
				920
				921	curr_window - represents cpu demand of task in its most recently tracked
				922	window
				923	prev_window - represents cpu demand of task in the window prior to the one
				924	being tracked by curr_window
				925
				926	The above counters are resued for nt_curr_runnable_sum and
				927	nt_prev_runnable_sum.
				928
				929	"cpu demand" of a task includes its execution time and can also include its
				930	wait time. 'SCHED_FREQ_ACCOUNT_WAIT_TIME' controls whether task's wait
				931	time is included in its 'curr_window' and 'prev_window' counters or not.
				932
				933	Needless to say, curr_runnable_sum counter of a cpu is derived from curr_window
				934	counter of various tasks that ran on it in its most recent window.
				935
				936	*** 6.3 Effect of various task events
				937
				938	We now consider various events and how they affect above mentioned counters.
				939
				940	PICK_NEXT_TASK
				941	This represents beginning of execution for a task. Provided the task
				942	refers to a non-idle task, a portion of task's wait time that
				943	corresponds to the current window being tracked on a cpu is added to
				944	task's curr_window counter, provided SCHED_FREQ_ACCOUNT_WAIT_TIME is
				945	set. The same quantum is also added to cpu's curr_runnable_sum counter.
				946	The remaining portion, which corresponds to task's wait time in previous
				947	window is added to task's prev_window and cpu's prev_runnable_sum
				948	counters.
				949
				950	PUT_PREV_TASK
				951	This represents end of execution of a time-slice for a task, where the
				952	task could refer to a cpu's idle task also. In case the task is non-idle
				953	or (in case of task being idle with cpu having non-zero rq->nr_iowait
				954	count and sched_io_is_busy =1), a portion of task's execution time, that
				955	corresponds to current window being tracked on a cpu is added to task's
				956	curr_window_counter and also to cpu's curr_runnable_sum counter. Portion
				957	of task's execution that corresponds to the previous window is added to
				958	task's prev_window and cpu's prev_runnable_sum counters.
				959
				960	TASK_UPDATE
				961	This event is called on a cpu's currently running task and hence
				962	behaves effectively as PUT_PREV_TASK. Task continues executing after
				963	this event, until PUT_PREV_TASK event occurs on the task (during
				964	context switch).
				965
				966	TASK_WAKE
				967	This event signifies a task waking from sleep. Since many windows
				968	could have elapsed since the task went to sleep, its curr_window
				969	and prev_window are updated to reflect task's demand in the most
				970	recent and its previous window that is being tracked on a cpu.
				971
				972	TASK_MIGRATE
				973	This event signifies task migration across cpus. It is invoked on the
				974	task prior to being moved. Thus at the time of this event, the task
				975	can be considered to be in "waiting" state on src_cpu. In that way
				976	this event reflects actions taken under PICK_NEXT_TASK (i.e its
				977	wait time is added to task's curr/prev_window counters as well
				978	as src_cpu's curr/prev_runnable_sum counters, provided
				979	SCHED_FREQ_ACCOUNT_WAIT_TIME is non-zero). After that update,
				980	src_cpu's curr_runnable_sum is reduced by task's curr_window value
				981	and dst_cpu's curr_runnable_sum is increased by task's curr_window
				982	value. Similarly, src_cpu's prev_runnable_sum is reduced by task's
				983	prev_window value and dst_cpu's prev_runnable_sum is increased by
				984	task's prev_window value.
				985
				986	IRQ_UPDATE
				987	This event signifies end of execution of an interrupt handler. This
				988	event results in update of cpu's busy time counters, curr_runnable_sum
				989	and prev_runnable_sum, provided cpu was idle.
				990	When sched_io_is_busy = 0, only the interrupt handling time is added
				991	to cpu's curr_runnable_sum and prev_runnable_sum counters. When
				992	sched_io_is_busy = 1, the event mirrors actions taken under
				993	TASK_UPDATED event i.e time since last accounting of idle task's cpu
				994	usage is added to cpu's curr_runnable_sum and prev_runnable_sum
				995	counters.
				996
				997	===========
				998	7. TUNABLES
				999	===========
				1000
				1001	*** 7.1 sched_spill_load
				1002
				1003	Appears at: /proc/sys/kernel/sched_spill_load
				1004
				1005	Default value: 100
				1006
				1007	CPU selection criteria for fair-sched class tasks is the lowest power cpu where
				1008	they can fit. When the most power-efficient cpu where a task can fit is
				1009	overloaded (aggregate demand of tasks currently queued on it exceeds
				1010	sched_spill_load), a task can be placed on a higher-performance cpu, even though
				1011	the task strictly doesn't need one.
				1012
				1013	*** 7.2 sched_spill_nr_run
				1014
				1015	Appears at: /proc/sys/kernel/sched_spill_nr_run
				1016
				1017	Default value: 10
				1018
				1019	The intent of this tunable is similar to sched_spill_load, except it applies to
				1020	nr_running count of a cpu. A task can spill over to a higher-performance cpu
				1021	when the most power-efficient cpu where it can normally fit has more tasks than
				1022	sched_spill_nr_run.
				1023
				1024	*** 7.3 sched_upmigrate
				1025
				1026	Appears at: /proc/sys/kernel/sched_upmigrate
				1027
				1028	Default value: 80
				1029
				1030	This tunable is a percentage. If a task consumes more than this much
				1031	of a CPU, the CPU is considered too small for the task and the
				1032	scheduler will try to find a bigger CPU to place the task on.
				1033
				1034	*** 7.4 sched_init_task_load
				1035
				1036	Appears at: /proc/sys/kernel/sched_init_task_load
				1037
				1038	Default value: 15
				1039
				1040	This tunable is a percentage. When a task is first created it has no
				1041	history, so the task load tracking mechanism cannot determine a
				1042	historical load value to assign to it. This tunable specifies the
				1043	initial load value for newly created tasks. Also see Sec 2.8 on per-task
				1044	'initial task load' attribute.
				1045
				1046	*** 7.5 sched_ravg_hist_size
				1047
				1048	Appears at: /proc/sys/kernel/sched_ravg_hist_size
				1049
				1050	Default value: 5
				1051
				1052	This tunable controls the number of samples used from task's sum_history[]
				1053	array for determination of its demand.
				1054
				1055	*** 7.6 sched_window_stats_policy
				1056
				1057	Appears at: /proc/sys/kernel/sched_window_stats_policy
				1058
				1059	Default value: 2
				1060
				1061	This tunable controls the policy in how window-based load tracking
				1062	calculates an overall demand value based on the windows of CPU
				1063	utilization it has collected for a task.
				1064
				1065	Possible values for this tunable are:
				1066	0: Just use the most recent window sample of task activity when calculating
				1067	task demand.
				1068	1: Use the maximum value of first M samples found in task's cpu demand
				1069	history (sum_history[] array), where M = sysctl_sched_ravg_hist_size
				1070	2: Use the maximum of (the most recent window sample, average of first M
				1071	samples), where M = sysctl_sched_ravg_hist_size
				1072	3. Use average of first M samples, where M = sysctl_sched_ravg_hist_size
				1073
				1074	*** 7.7 sched_ravg_window
				1075
				1076	Appears at: kernel command line argument
				1077
				1078	Default value: 10000000 (10ms, units of tunable are nanoseconds)
				1079
				1080	This specifies the duration of each window in window-based load
				1081	tracking. By default each window is 10ms long. This quantity must
				1082	currently be set at boot time on the kernel command line (or the
				1083	default value of 10ms can be used).
				1084
				1085	*** 7.8 RAVG_HIST_SIZE
				1086
				1087	Appears at: compile time only (see RAVG_HIST_SIZE in include/linux/sched.h)
				1088
				1089	Default value: 5
				1090
				1091	This macro specifies the number of windows the window-based load
				1092	tracking mechanism maintains per task. If default values are used for
				1093	both this and sched_ravg_window then a total of 50ms of task history
				1094	would be maintained in 5 10ms windows.
				1095
				1096	*** 7.9 sched_freq_inc_notify
				1097
				1098	Appears at: /proc/sys/kernel/sched_freq_inc_notify
				1099
				1100	Default value: 10 * 1024 * 1024 (10 Ghz)
				1101
				1102	When scheduler detects that cur_freq of a cluster is insufficient to meet
				1103	demand, it sends notification to governor, provided (freq_required - cur_freq)
				1104	exceeds sched_freq_inc_notify, where freq_required is the frequency calculated
				1105	by scheduler to meet current task demand. Note that sched_freq_inc_notify is
				1106	specified in kHz units.
				1107
				1108	*** 7.10 sched_freq_dec_notify
				1109
				1110	Appears at: /proc/sys/kernel/sched_freq_dec_notify
				1111
				1112	Default value: 10 * 1024 * 1024 (10 Ghz)
				1113
				1114	When scheduler detects that cur_freq of a cluster is far greater than what is
				1115	needed to serve current task demand, it will send notification to governor.
				1116	More specifically, notification is sent when (cur_freq - freq_required)
				1117	exceeds sched_freq_dec_notify, where freq_required is the frequency calculated
				1118	by scheduler to meet current task demand. Note that sched_freq_dec_notify is
				1119	specified in kHz units.
				1120
				1121	*** 7.11 sched_cpu_high_irqload
				1122
				1123	Appears at: /proc/sys/kernel/sched_cpu_high_irqload
				1124
				1125	Default value: 10000000 (10ms)
				1126
				1127	The scheduler keeps a decaying average of the amount of irq and softirq activity
				1128	seen on each CPU within a ten millisecond window. Note that this "irqload"
				1129	(reported in the sched_cpu_load_* tracepoint) will be higher than the typical load
				1130	in a single window since every time the window rolls over, the value is decayed
				1131	by some fraction and then added to the irq/softirq time spent in the next
				1132	window.
				1133
				1134	When the irqload on a CPU exceeds the value of this tunable, the CPU is no
				1135	longer eligible for placement. This will affect the task placement logic
				1136	described above, causing the scheduler to try and steer tasks away from
				1137	the CPU.
				1138
				1139	*** 7.12 cpu.upmigrate_discourage
				1140
				1141	Default value : 0
				1142
				1143	This is a cgroup attribute supported by the cpu resource controller. It normally
				1144	appears at [root_cpu]/[name1]/../[name2]/cpu.upmigrate_discourage. Here
				1145	"root_cpu" is the mount point for cgroup (cpu resource control) filesystem
				1146	and name1, name2 etc are names of cgroups that form a hierarchy.
				1147
				1148	Setting this flag to 1 discourages upmigration for all tasks of a cgroup. High
				1149	demand tasks of such a cgroup will never be classified as big tasks and hence
				1150	not upmigrated. Any task of the cgroup is allowed to upmigrate only under
				1151	overcommitted scenario. See notes on sched_spill_nr_run and sched_spill_load for
				1152	how overcommitment threshold is defined.
				1153
				1154	*** 7.13 sched_static_cpu_pwr_cost
				1155
				1156	Default value: 0
				1157
				1158	Appears at /sys/devices/system/cpu/cpu<x>/sched_static_cpu_pwr_cost
				1159
				1160	This is the power cost associated with bringing an idle CPU out of low power
				1161	mode. It ignores the actual C-state that a CPU may be in and assumes the
				1162	worst case power cost of the highest C-state. It is means of biasing task
				1163	placement away from idle CPUs when necessary. It can be defined per CPU,
				1164	however, a more appropriate usage to define the same value for every CPU
				1165	within a cluster and possibly have differing value between clusters as
				1166	needed.
				1167
				1168
				1169	*** 7.14 sched_static_cluster_pwr_cost
				1170
				1171	Default value: 0
				1172
				1173	Appears at /sys/devices/system/cpu/cpu<x>/sched_static_cluster_pwr_cost
				1174
				1175	This is the power cost associated with bringing an idle cluster out of low
				1176	power mode. It ignores the actual D-state that a cluster may be in and assumes
				1177	the worst case power cost of the highest D-state. It is means of biasing task
				1178	placement away from idle clusters when necessary.
				1179
				1180	*** 7.15 sched_restrict_cluster_spill
				1181
				1182	Default value: 0
				1183
				1184	Appears at /proc/sys/kernel/sched_restrict_cluster_spill
				1185
				1186	This tunable can be used to restrict tasks spilling to the higher capacity
				1187	(higher power) cluster. When this tunable is enabled,
				1188
				1189	- Restrict the higher capacity cluster pulling tasks from the lower capacity
				1190	cluster in the load balance path. The restriction is lifted if all of the CPUS
				1191	in the lower capacity cluster are above spill. The power cost is used to break
				1192	the ties if the capacity of clusters are same for applying this restriction.
				1193
				1194	- The current CPU selection algorithm for RT tasks looks for the least loaded
				1195	CPU across all clusters. When this tunable is enabled, the RT tasks are
				1196	restricted to the lowest possible power cluster.
				1197
				1198
				1199	*** 7.16 sched_downmigrate
				1200
				1201	Appears at: /proc/sys/kernel/sched_downmigrate
				1202
				1203	Default value: 60
				1204
				1205	This tunable is a percentage. It exists to control hysteresis. Lets say a task
				1206	migrated to a high-performance cpu when it crossed 80% demand on a
				1207	power-efficient cpu. We don't let it come back to a power-efficient cpu until
				1208	its demand in reference to the power-efficient cpu drops less than 60%
				1209	(sched_downmigrate).
				1210
				1211
				1212	*** 7.17 sched_small_wakee_task_load
				1213
				1214	Appears at: /proc/sys/kernel/sched_small_wakee_task_load
				1215
				1216	Default value: 10
				1217
				1218	This tunable is a percentage. Configure the maximum demand of small wakee task.
				1219	Sync wakee tasks which have demand less than sched_small_wakee_task_load are
				1220	categorized as small wakee tasks. Scheduler places small wakee tasks on the
				1221	waker's cluster.
				1222
				1223
				1224	*** 7.18 sched_big_waker_task_load
				1225
				1226	Appears at: /proc/sys/kernel/sched_big_waker_task_load
				1227
				1228	Default value: 25
				1229
				1230	This tunable is a percentage. Configure the minimum demand of big sync waker
				1231	task. Scheduler places small wakee tasks woken up by big sync waker on the
				1232	waker's cluster.
				1233
Pavankumar Kondeti	72b49a3	2016-09-06 11:59:28 +0530	[diff] [blame]	1234	*** 7.19 sched_prefer_sync_wakee_to_waker
				1235
				1236	Appears at: /proc/sys/kernel/sched_prefer_sync_wakee_to_waker
				1237
				1238	Default value: 0
				1239
				1240	The default sync wakee policy has a preference to select an idle CPU in the
				1241	waker cluster compared to the waker CPU running only 1 task. By selecting
				1242	an idle CPU, it eliminates the chance of waker migrating to a different CPU
				1243	after the wakee preempts it. This policy is also not susceptible to the
				1244	incorrect "sync" usage i.e the waker does not goto sleep after waking up
				1245	the wakee.
				1246
				1247	However LPM exit latency associated with an idle CPU outweigh the above
				1248	benefits on some targets. When this knob is turned on, the waker CPU is
				1249	selected if it has only 1 runnable task.
				1250
Syed Rameez Mustafa	dddcab7	2016-09-07 16:18:27 -0700	[diff] [blame]	1251	=========================
				1252	8. HMP SCHEDULER TRACE POINTS
				1253	=========================
				1254
				1255	*** 8.1 sched_enq_deq_task
				1256
				1257	Logged when a task is either enqueued or dequeued on a CPU's run queue.
				1258
				1259	<idle>-0 [004] d.h4 12700.711665: sched_enq_deq_task: cpu=4 enqueue comm=powertop pid=13227 prio=120 nr_running=1 cpu_load=0 rt_nr_running=0 affine=ff demand=13364423
				1260
				1261	- cpu: the CPU that the task is being enqueued on to or dequeued off of
				1262	- enqueue/dequeue: whether this was an enqueue or dequeue event
				1263	- comm: name of task
				1264	- pid: PID of task
				1265	- prio: priority of task
				1266	- nr_running: number of runnable tasks on this CPU
				1267	- cpu_load: current priority-weighted load on the CPU (note, this is not
				1268	the same as CPU utilization or a metric tracked by PELT/window-based tracking)
				1269	- rt_nr_running: number of real-time processes running on this CPU
				1270	- affine: CPU affinity mask in hex for this task (so ff is a task eligible to
				1271	run on CPUs 0-7)
				1272	- demand: window-based task demand computed based on selected policy (recent,
				1273	max, or average) (ns)
				1274
				1275	*** 8.2 sched_task_load
				1276
				1277	Logged when selecting the best CPU to run the task (select_best_cpu()).
				1278
				1279	sched_task_load: 4004 (adbd): demand=698425 boost=0 reason=0 sync=0 need_idle=0 best_cpu=0 latency=103177
				1280
				1281	- demand: window-based task demand computed based on selected policy (recent,
				1282	max, or average) (ns)
				1283	- boost: whether boost is in effect
				1284	- reason: reason we are picking a new CPU:
				1285	0: no migration - selecting a CPU for a wakeup or new task wakeup
				1286	1: move to big CPU (migration)
				1287	2: move to little CPU (migration)
				1288	3: move to low irq load CPU (migration)
				1289	- sync: is the nature synchronous in nature
				1290	- need_idle: is an idle CPU required for this task based on PF_WAKE_UP_IDLE
				1291	- best_cpu: The CPU selected by the select_best_cpu() function for placement
				1292	- latency: The execution time of the function select_best_cpu()
				1293
				1294	*** 8.3 sched_cpu_load_*
				1295
				1296	Logged when selecting the best CPU to run a task (select_best_cpu() for fair
				1297	class tasks, find_lowest_rq_hmp() for RT tasks) and load balancing
				1298	(update_sg_lb_stats()).
				1299
				1300	<idle>-0 [004] d.h3 12700.711541: sched_cpu_load_*: cpu 0 idle 1 nr_run 0 nr_big 0 lsf 1119 capacity 1024 cr_avg 0 irqload 3301121 fcur 729600 fmax 1459200 power_cost 5 cstate 2 temp 38
				1301
				1302	- cpu: the CPU being described
				1303	- idle: boolean indicating whether the CPU is idle
				1304	- nr_run: number of tasks running on CPU
				1305	- nr_big: number of BIG tasks running on CPU
				1306	- lsf: load scale factor - multiply normalized load by this factor to determine
				1307	how much load task will exert on CPU
				1308	- capacity: capacity of CPU (based on max possible frequency and efficiency)
				1309	- cr_avg: cumulative runnable average, instantaneous sum of the demand (either
				1310	PELT or window-based) of all the runnable task on a CPU (ns)
				1311	- irqload: decaying average of irq activity on CPU (ns)
				1312	- fcur: current CPU frequency (Khz)
				1313	- fmax: max CPU frequency (but not maximum _possible_ frequency) (KHz)
				1314	- power_cost: cost of running this CPU at the current frequency
				1315	- cstate: current cstate of CPU
				1316	- temp: current temperature of the CPU
				1317
				1318	The power_cost value above differs in how it is calculated depending on the
				1319	callsite of this tracepoint. The select_best_cpu() call to this tracepoint
				1320	finds the minimum frequency required to satisfy the existing load on the CPU
				1321	as well as the task being placed, and returns the power cost of that frequency.
				1322	The load balance and real time task placement paths used a fixed frequency
				1323	(highest frequency common to all CPUs for load balancing, minimum
				1324	frequency of the CPU for real time task placement).
				1325
				1326	*** 8.4 sched_update_task_ravg
				1327
				1328	Logged when window-based stats are updated for a task. The update may happen
				1329	for a variety of reasons, see section 2.5, "Task Events."
				1330
				1331	<idle>-0 [004] d.h4 12700.711513: sched_update_task_ravg: wc 12700711473496 ws 12700691772135 delta 19701361 event TASK_WAKE cpu 4 cur_freq 199200 cur_pid 0 task 13227 (powertop) ms 12640648272532 delta 60063200964 demand 13364423 sum 0 irqtime 0 cs 0 ps 495018 cur_window 0 prev_window 0
				1332
				1333	- wc: wallclock, output of sched_clock(), monotonically increasing time since
				1334	boot (will roll over in 585 years) (ns)
				1335	- ws: window start, time when the current window started (ns)
				1336	- delta: time since the window started (wc - ws) (ns)
				1337	- event: What event caused this trace event to occur (see section 2.5 for more
				1338	details)
				1339	- cpu: which CPU the task is running on
				1340	- cur_freq: CPU's current frequency in KHz
				1341	- curr_pid: PID of the current running task (current)
				1342	- task: PID and name of task being updated
				1343	- ms: mark start - timestamp of the beginning of a segment of task activity,
				1344	either sleeping or runnable/running (ns)
				1345	- delta: time since last event within the window (wc - ms) (ns)
				1346	- demand: task demand computed based on selected policy (recent, max, or
				1347	average) (ns)
				1348	- sum: the task's run time during current window scaled by frequency and
				1349	efficiency (ns)
				1350	- irqtime: length of interrupt activity (ns). A non-zero irqtime is seen
				1351	when an idle cpu handles interrupts, the time for which needs to be
				1352	accounted as cpu busy time
				1353	- cs: curr_runnable_sum of cpu (ns). See section 6.1 for more details of this
				1354	counter.
				1355	- ps: prev_runnable_sum of cpu (ns). See section 6.1 for more details of this
				1356	counter.
				1357	- cur_window: cpu demand of task in its most recently tracked window (ns)
				1358	- prev_window: cpu demand of task in the window prior to the one being tracked
				1359	by cur_window
				1360
				1361	*** 8.5 sched_update_history
				1362
				1363	Logged when update_task_ravg() is accounting task activity into one or
				1364	more windows that have completed. This may occur more than once for a
				1365	single call into update_task_ravg(). A task that ran for 24ms spanning
				1366	four 10ms windows (the last 2ms of window 1, all of windows 2 and 3,
				1367	and the first 2ms of window 4) would result in two calls into
				1368	update_history() from update_task_ravg(). The first call would record activity
				1369	in completed window 1 and second call would record activity for windows 2 and 3
				1370	together (samples will be 2 in second call).
				1371
				1372	<idle>-0 [004] d.h4 12700.711489: sched_update_history: 13227 (powertop): runtime 13364423 samples 1 event TASK_WAKE demand 13364423 (hist: 13364423 9871252 2236009 6162476 10282078) cpu 4 nr_big 0
				1373
				1374	- runtime: task cpu demand in recently completed window(s). This value is scaled
				1375	to max_possible_freq and max_possible_efficiency. This value is pushed into
				1376	task's demand history array. The number of windows to which runtime applies is
				1377	provided by samples field.
				1378	- samples: Number of samples (windows), each having value of runtime, that is
				1379	recorded in task's demand history array.
				1380	- event: What event caused this trace event to occur (see section 2.5 for more
				1381	details) - PUT_PREV_TASK, PICK_NEXT_TASK, TASK_WAKE, TASK_MIGRATE,
				1382	TASK_UPDATE
				1383	- demand: task demand computed based on selected policy (recent, max, or
				1384	average) (ns)
				1385	- hist: last 5 windows of history for the task with the most recent window
				1386	listed first
				1387	- cpu: CPU the task is associated with
				1388	- nr_big: number of big tasks on the CPU
				1389
				1390	*** 8.6 sched_reset_all_windows_stats
				1391
				1392	Logged when key parameters controlling window-based statistics collection are
				1393	changed. This event signifies that all window-based statistics for tasks and
				1394	cpus are being reset. Changes to below attributes result in such a reset:
				1395
				1396	* sched_ravg_window (See Sec 2)
				1397	* sched_window_stats_policy (See Sec 2.4)
				1398	* sched_ravg_hist_size (See Sec 7.11)
				1399
				1400	<task>-0 [004] d.h4 12700.711489: sched_reset_all_windows_stats: time_taken 1123 window_start 0 window_size 0 reason POLICY_CHANGE old_val 0 new_val 1
				1401
				1402	- time_taken: time taken for the reset function to complete (ns)
				1403	- window_start: Beginning of first window following change to window size (ns)
				1404	- window_size: Size of window. Non-zero if window-size is changing (in ticks)
				1405	- reason: Reason for reset of statistics.
				1406	- old_val: Old value of variable, change of which is triggering reset
				1407	- new_val: New value of variable, change of which is triggering reset
				1408
				1409	*** 8.7 sched_migration_update_sum
				1410
				1411	Logged when a task is migrating to another cpu.
				1412
				1413	<task>-0 [000] d..8 5020.404137: sched_migration_update_sum: cpu 0: cs 471278 ps 902463 nt_cs 0 nt_ps 0 pid 2645
				1414
				1415	- cpu: cpu, away from which or to which, task is migrating
				1416	- cs: curr_runnable_sum of cpu (ns). See Sec 6.1 for more details of this
				1417	counter.
				1418	- ps: prev_runnable_sum of cpu (ns). See Sec 6.1 for more details of this
				1419	counter.
				1420	- nt_cs: nt_curr_runnable_sum of cpu (ns). See Sec 6.1 for more details of
				1421	this counter.
				1422	- nt_ps: nt_prev_runnable_sum of cpu (ns). See Sec 6.1 for more details of
				1423	this counter
				1424	- pid: PID of migrating task
				1425
				1426	*** 8.8 sched_get_busy
				1427
				1428	Logged when scheduler is returning busy time statistics for a cpu.
				1429
				1430	<...>-4331 [003] d.s3 313.700108: sched_get_busy: cpu 3 load 19076 new_task_load 0 early 0
				1431
				1432
				1433	- cpu: cpu, for which busy time statistic (prev_runnable_sum) is being
				1434	returned (ns)
				1435	- load: corresponds to prev_runnable_sum (ns), scaled to fmax of cpu
				1436	- new_task_load: corresponds to nt_prev_runnable_sum to fmax of cpu
				1437	- early: A flag indicating whether the scheduler is passing regular load or early detection load
				1438	0 - regular load
				1439	1 - early detection load
				1440
				1441	*** 8.9 sched_freq_alert
				1442
				1443	Logged when scheduler is alerting cpufreq governor about need to change
				1444	frequency
				1445
				1446	<task>-0 [004] d.h4 12700.711489: sched_freq_alert: cpu 0 old_load=XXX new_load=YYY
				1447
				1448	- cpu: cpu in cluster that has highest load (prev_runnable_sum)
				1449	- old_load: cpu busy time last reported to governor. This is load scaled in
				1450	reference to max_possible_freq and max_possible_efficiency.
				1451	- new_load: recent cpu busy time. This is load scaled in
				1452	reference to max_possible_freq and max_possible_efficiency.
				1453
				1454	*** 8.10 sched_set_boost
				1455
				1456	Logged when boost settings are being changed
				1457
				1458	<task>-0 [004] d.h4 12700.711489: sched_set_boost: ref_count=1
				1459
				1460	- ref_count: A non-zero value indicates boost is in effect
Pavankumar Kondeti	8de9ac6	2016-10-01 11:06:54 +0530	[diff] [blame]	1461
				1462	========================
				1463	9. Device Tree bindings
				1464	========================
				1465
				1466	The device tree bindings for the HMP scheduler are defined in
				1467	Documentation/devicetree/bindings/sched/sched_hmp.txt