Syed Rameez Mustafa | dddcab7 | 2016-09-07 16:18:27 -0700 | [diff] [blame] | 1 | CONTENTS |
| 2 | |
| 3 | 1. Introduction |
| 4 | 1.1 Heterogeneous Systems |
| 5 | 1.2 CPU Frequency Guidance |
| 6 | 2. Window-Based Load Tracking Scheme |
| 7 | 2.1 Synchronized Windows |
| 8 | 2.2 struct ravg |
| 9 | 2.3 Scaling Load Statistics |
| 10 | 2.4 sched_window_stats_policy |
| 11 | 2.5 Task Events |
| 12 | 2.6 update_task_ravg() |
| 13 | 2.7 update_history() |
| 14 | 2.8 Per-task 'initial task load' |
| 15 | 3. CPU Capacity |
| 16 | 3.1 Load scale factor |
| 17 | 3.2 CPU Power |
| 18 | 4. CPU Power |
| 19 | 5. HMP Scheduler |
| 20 | 5.1 Classification of Tasks and CPUs |
| 21 | 5.2 select_best_cpu() |
| 22 | 5.2.1 sched_boost |
| 23 | 5.2.2 task_will_fit() |
| 24 | 5.2.3 Tunables affecting select_best_cpu() |
| 25 | 5.2.4 Wakeup Logic |
| 26 | 5.3 Scheduler Tick |
| 27 | 5.4 Load Balancer |
| 28 | 5.5 Real Time Tasks |
| 29 | 5.6 Task packing |
| 30 | 6. Frequency Guidance |
| 31 | 6.1 Per-CPU Window-Based Stats |
| 32 | 6.2 Per-task Window-Based Stats |
| 33 | 6.3 Effect of various task events |
| 34 | 7. Tunables |
| 35 | 8. HMP Scheduler Trace Points |
| 36 | 8.1 sched_enq_deq_task |
| 37 | 8.2 sched_task_load |
| 38 | 8.3 sched_cpu_load_* |
| 39 | 8.4 sched_update_task_ravg |
| 40 | 8.5 sched_update_history |
| 41 | 8.6 sched_reset_all_windows_stats |
| 42 | 8.7 sched_migration_update_sum |
| 43 | 8.8 sched_get_busy |
| 44 | 8.9 sched_freq_alert |
| 45 | 8.10 sched_set_boost |
Pavankumar Kondeti | 8de9ac6 | 2016-10-01 11:06:54 +0530 | [diff] [blame] | 46 | 9. Device Tree bindings |
Syed Rameez Mustafa | dddcab7 | 2016-09-07 16:18:27 -0700 | [diff] [blame] | 47 | |
| 48 | =============== |
| 49 | 1. INTRODUCTION |
| 50 | =============== |
| 51 | |
| 52 | Scheduler extensions described in this document serves two goals: |
| 53 | |
| 54 | 1) handle heterogeneous multi-processor (HMP) systems |
| 55 | 2) guide cpufreq governor on proactive changes to cpu frequency |
| 56 | |
| 57 | *** 1.1 Heterogeneous systems |
| 58 | |
| 59 | Heterogeneous systems have cpus that differ with regard to their performance and |
| 60 | power characteristics. Some cpus could offer peak performance better than |
| 61 | others, although at cost of consuming more power. We shall refer such cpus as |
| 62 | "high performance" or "performance efficient" cpus. Other cpus that offer lesser |
| 63 | peak performance are referred to as "power efficient". |
| 64 | |
| 65 | In this situation the scheduler is tasked with the responsibility of assigning |
| 66 | tasks to run on the right cpus where their performance requirements can be met |
| 67 | at the least expense of power. |
| 68 | |
| 69 | Achieving that goal is made complicated by the fact that the scheduler has |
| 70 | little clue about performance requirements of tasks and how they may change by |
| 71 | running on power or performance efficient cpus! One simplifying assumption here |
| 72 | could be that a task's desire for more performance is expressed by its cpu |
| 73 | utilization. A task demanding high cpu utilization on a power-efficient cpu |
| 74 | would likely improve in its performance by running on a performance-efficient |
| 75 | cpu. This idea forms the basis for HMP-related scheduler extensions. |
| 76 | |
| 77 | Key inputs required by the HMP scheduler for its task placement decisions are: |
| 78 | |
| 79 | a) task load - this reflects cpu utilization or demand of tasks |
| 80 | b) CPU capacity - this reflects peak performance offered by cpus |
| 81 | c) CPU power - this reflects power or energy cost of cpus |
| 82 | |
| 83 | Once all 3 pieces of information are available, the HMP scheduler can place |
| 84 | tasks on the lowest power cpus where their demand can be satisfied. |
| 85 | |
| 86 | *** 1.2 CPU Frequency guidance |
| 87 | |
| 88 | A somewhat separate but related goal of the scheduler extensions described here |
| 89 | is to provide guidance to the cpufreq governor on the need to change cpu |
| 90 | frequency. Most governors that control cpu frequency work on a reactive basis. |
| 91 | CPU utilization is sampled at regular intervals, based on which the need to |
| 92 | change frequency is determined. Higher utilization leads to a frequency increase |
| 93 | and vice-versa. There are several problems with this approach that scheduler |
| 94 | can help resolve. |
| 95 | |
| 96 | a) latency |
| 97 | |
| 98 | Reactive nature introduces latency for cpus to ramp up to desired speed |
| 99 | which can hurt application performance. This is inevitable as cpufreq |
| 100 | governors can only track cpu utilization as a whole and not tasks which |
| 101 | are driving that demand. Scheduler can however keep track of individual |
| 102 | task demand and can alert the governor on changing task activity. For |
| 103 | example, request raise in frequency when tasks activity is increasing on |
| 104 | a cpu because of wakeup or migration or request frequency to be lowered |
| 105 | when task activity is decreasing because of sleep/exit or migration. |
| 106 | |
| 107 | b) part-picture |
| 108 | |
| 109 | Most governors track utilization of each CPU independently. When a task |
| 110 | migrates from one cpu to another the task's execution time is split |
| 111 | across the two cpus. The governor can fail to see the full picture of |
| 112 | task demand in this case and thus the need for increasing frequency, |
| 113 | affecting the task's performance. Scheduler can keep track of task |
| 114 | migrations, fix up busy time upon migration and report per-cpu busy time |
| 115 | to the governor that reflects task demand accurately. |
| 116 | |
| 117 | The rest of this document explains key enhancements made to the scheduler to |
| 118 | accomplish both of the aforementioned goals. |
| 119 | |
| 120 | ==================================== |
| 121 | 2. WINDOW-BASED LOAD TRACKING SCHEME |
| 122 | ==================================== |
| 123 | |
| 124 | As mentioned in the introduction section, knowledge of the CPU demand exerted by |
| 125 | a task is a prerequisite to knowing where to best place the task in an HMP |
| 126 | system. The per-entity load tracking (PELT) scheme, present in Linux kernel |
| 127 | since v3.7, has some perceived shortcomings when used to place tasks on HMP |
| 128 | systems or provide recommendations on CPU frequency. |
| 129 | |
| 130 | Per-entity load tracking does not make a distinction between the ramp up |
| 131 | vs ramp down time of task load. It also decays task load without exception when |
| 132 | a task sleeps. As an example, a cpu bound task at its peak load (LOAD_AVG_MAX or |
| 133 | 47742) can see its load decay to 0 after a sleep of just 213ms! A cpu-bound task |
| 134 | running on a performance-efficient cpu could thus get re-classified as not |
| 135 | requiring such a cpu after a short sleep. In the case of mobile workloads, tasks |
| 136 | could go to sleep due to a lack of user input. When they wakeup it is very |
| 137 | likely their cpu utilization pattern repeats. Resetting their load across sleep |
| 138 | and incurring latency to reclassify them as requiring a high performance cpu can |
| 139 | hurt application performance. |
| 140 | |
| 141 | The window-based load tracking scheme described in this document avoids these |
| 142 | drawbacks. It keeps track of N windows of execution for every task. Windows |
| 143 | where a task had no activity are ignored and not recorded. N can be tuned at |
| 144 | compile time (RAVG_HIST_SIZE defined in include/linux/sched.h) or at runtime |
| 145 | (/proc/sys/kernel/sched_ravg_hist_size). The window size, W, is common for all |
| 146 | tasks and currently defaults to 10ms ('sched_ravg_window' defined in |
| 147 | kernel/sched/core.c). The window size can be tuned at boot time via the |
| 148 | sched_ravg_window=W argument to kernel. Alternately it can be tuned after boot |
| 149 | via tunables provided by the interactive governor. More on this later. |
| 150 | |
| 151 | Based on the N samples available per-task, a per-task "demand" attribute is |
| 152 | calculated which represents the cpu demand of that task. The demand attribute is |
| 153 | used to classify tasks as to whether or not they need a performance-efficient |
| 154 | CPU and also serves to provide inputs on frequency to the cpufreq governor. More |
| 155 | on this later. The 'sched_window_stats_policy' tunable (defined in |
| 156 | kernel/sched/core.c) controls how the demand field for a task is derived from |
| 157 | its N past samples. |
| 158 | |
| 159 | *** 2.1 Synchronized windows |
| 160 | |
| 161 | Windows of observation for task activity are synchronized across cpus. This |
| 162 | greatly aids in the scheduler's frequency guidance feature. Scheduler currently |
| 163 | relies on a synchronized clock (sched_clock()) for this feature to work. It may |
| 164 | be possible to extend this feature to work on systems having an unsynchronized |
| 165 | sched_clock(). |
| 166 | |
| 167 | struct rq { |
| 168 | |
| 169 | .. |
| 170 | |
| 171 | u64 window_start; |
| 172 | |
| 173 | .. |
| 174 | }; |
| 175 | |
| 176 | The 'window_start' attribute represents the time when current window began on a |
| 177 | cpu. It is updated when key task events such as wakeup or context-switch call |
| 178 | update_task_ravg() to record task activity. The window_start value is expected |
| 179 | to be the same for all cpus, although it could be behind on some cpus where it |
| 180 | has not yet been updated because update_task_ravg() has not been recently |
| 181 | called. For example, when a cpu is idle for a long time its window_start could |
| 182 | be stale. The window_start value for such cpus is rolled forward upon |
| 183 | occurrence of a task event resulting in a call to update_task_ravg(). |
| 184 | |
| 185 | *** 2.2 struct ravg |
| 186 | |
| 187 | The ravg struct contains information tracked per-task. |
| 188 | |
| 189 | struct ravg { |
| 190 | u64 mark_start; |
| 191 | u32 sum, demand; |
| 192 | u32 sum_history[RAVG_HIST_SIZE]; |
| 193 | }; |
| 194 | |
| 195 | struct task_struct { |
| 196 | |
| 197 | .. |
| 198 | |
| 199 | struct ravg ravg; |
| 200 | |
| 201 | .. |
| 202 | }; |
| 203 | |
| 204 | sum_history[] - stores cpu utilization samples from N previous windows |
| 205 | where task had activity |
| 206 | |
| 207 | sum - stores cpu utilization of the task in its most recently |
| 208 | tracked window. Once the corresponding window terminates, |
| 209 | 'sum' will be pushed into the sum_history[] array and is then |
| 210 | reset to 0. It is possible that the window corresponding to |
| 211 | sum is not the current window being tracked on a cpu. For |
| 212 | example, a task could go to sleep in window X and wakeup in |
| 213 | window Y (Y > X). In this case, sum would correspond to the |
| 214 | task's activity seen in window X. When update_task_ravg() is |
| 215 | called during the task's wakeup event it will be seen that |
| 216 | window X has elapsed. The sum value will be pushed to |
| 217 | 'sum_history[]' array before being reset to 0. |
| 218 | |
| 219 | demand - represents task's cpu demand and is derived from the |
| 220 | elements in sum_history[]. The section on |
| 221 | 'sched_window_stats_policy' provides more details on how |
| 222 | 'demand' is derived from elements in sum_history[] array |
| 223 | |
| 224 | mark_start - records timestamp of the beginning of the most recent task |
| 225 | event. See section on 'Task events' for possible events that |
| 226 | update 'mark_start' |
| 227 | |
| 228 | curr_window - this is described in the section on 'Frequency guidance' |
| 229 | |
| 230 | prev_window - this is described in the section on 'Frequency guidance' |
| 231 | |
| 232 | |
| 233 | *** 2.3 Scaling load statistics |
| 234 | |
| 235 | Time required for a task to complete its work (and hence its load) depends on, |
| 236 | among various other factors, cpu frequency and its efficiency. In a HMP system, |
| 237 | some cpus are more performance efficient than others. Performance efficiency of |
| 238 | a cpu can be described by its "instructions-per-cycle" (IPC) attribute. History |
| 239 | of task execution could involve task having run at different frequencies and on |
| 240 | cpus with different IPC attributes. To avoid ambiguity of how task load relates |
| 241 | to the frequency and IPC of cpus on which a task has run, task load is captured |
| 242 | in a scaled form, with scaling being done in reference to an "ideal" cpu that |
| 243 | has best possible IPC and frequency. Such an "ideal" cpu, having the best |
| 244 | possible frequency and IPC, may or may not exist in system. |
| 245 | |
| 246 | As an example, consider a HMP system, with two types of cpus, A53 and A57. A53 |
| 247 | has IPC count of 1024 and can run at maximum frequency of 1 GHz, while A57 has |
| 248 | IPC count of 2048 and can run at maximum frequency of 2 GHz. Ideal cpu in this |
| 249 | case is A57 running at 2 GHz. |
| 250 | |
| 251 | A unit of work that takes 100ms to finish on A53 running at 100MHz would get |
| 252 | done in 10ms on A53 running at 1GHz, in 5 ms running on A57 at 1 GHz and 2.5ms |
| 253 | on A57 running at 2 GHz. Thus a load of 100ms can be expressed as 2.5ms in |
| 254 | reference to ideal cpu of A57 running at 2 GHz. |
| 255 | |
| 256 | In order to understand how much load a task will consume on a given cpu, its |
| 257 | scaled load needs to be multiplied by a factor (load scale factor). In above |
| 258 | example, scaled load of 2.5ms needs to be multiplied by a factor of 4 in order |
| 259 | to estimate the load of task on A53 running at 1 GHz. |
| 260 | |
| 261 | /proc/sched_debug provides IPC attribute and load scale factor for every cpu. |
| 262 | |
| 263 | In summary, task load information stored in a task's sum_history[] array is |
| 264 | scaled for both frequency and efficiency. If a task runs for X ms, then the |
| 265 | value stored in its 'sum' field is derived as: |
| 266 | |
| 267 | X_s = X * (f_cur / max_possible_freq) * |
| 268 | (efficiency / max_possible_efficiency) |
| 269 | |
| 270 | where: |
| 271 | |
| 272 | X = cpu utilization that needs to be accounted |
| 273 | X_s = Scaled derivative of X |
| 274 | f_cur = current frequency of the cpu where the task was |
| 275 | running |
| 276 | max_possible_freq = maximum possible frequency (across all cpus) |
| 277 | efficiency = instructions per cycle (IPC) of cpu where task was |
| 278 | running |
| 279 | max_possible_efficiency = maximum IPC offered by any cpu in system |
| 280 | |
| 281 | |
| 282 | *** 2.4 sched_window_stats_policy |
| 283 | |
| 284 | sched_window_stats_policy controls how the 'demand' attribute for a task is |
| 285 | derived from elements in its 'sum_history[]' array. |
| 286 | |
| 287 | WINDOW_STATS_RECENT (0) |
| 288 | demand = recent |
| 289 | |
| 290 | WINDOW_STATS_MAX (1) |
| 291 | demand = max |
| 292 | |
| 293 | WINDOW_STATS_MAX_RECENT_AVG (2) |
| 294 | demand = maximum(average, recent) |
| 295 | |
| 296 | WINDOW_STATS_AVG (3) |
| 297 | demand = average |
| 298 | |
| 299 | where: |
| 300 | M = history size specified by |
| 301 | /proc/sys/kernel/sched_ravg_hist_size |
| 302 | average = average of first M samples found in the sum_history[] array |
| 303 | max = maximum value of first M samples found in the sum_history[] |
| 304 | array |
| 305 | recent = most recent sample (sum_history[0]) |
| 306 | demand = demand attribute found in 'struct ravg' |
| 307 | |
| 308 | This policy can be changed at runtime via |
| 309 | /proc/sys/kernel/sched_window_stats_policy. For example, the command |
| 310 | below would select WINDOW_STATS_USE_MAX policy |
| 311 | |
| 312 | echo 1 > /proc/sys/kernel/sched_window_stats_policy |
| 313 | |
| 314 | *** 2.5 Task events |
| 315 | |
| 316 | A number of events results in the window-based stats of a task being |
| 317 | updated. These are: |
| 318 | |
| 319 | PICK_NEXT_TASK - the task is about to start running on a cpu |
| 320 | PUT_PREV_TASK - the task stopped running on a cpu |
| 321 | TASK_WAKE - the task is waking from sleep |
| 322 | TASK_MIGRATE - the task is migrating from one cpu to another |
| 323 | TASK_UPDATE - this event is invoked on a currently running task to |
| 324 | update the task's window-stats and also the cpu's |
| 325 | window-stats such as 'window_start' |
| 326 | IRQ_UPDATE - event to record the busy time spent by an idle cpu |
| 327 | processing interrupts |
| 328 | |
| 329 | *** 2.6 update_task_ravg() |
| 330 | |
| 331 | update_task_ravg() is called to mark the beginning of an event for a task or a |
| 332 | cpu. It serves to accomplish these functions: |
| 333 | |
| 334 | a. Update a cpu's window_start value |
| 335 | b. Update a task's window-stats (sum, sum_history[], demand and mark_start) |
| 336 | |
| 337 | In addition update_task_ravg() updates the busy time information for the given |
| 338 | cpu, which is used for frequency guidance. This is described further in section |
| 339 | 6. |
| 340 | |
| 341 | *** 2.7 update_history() |
| 342 | |
| 343 | update_history() is called on a task to record its activity in an elapsed |
| 344 | window. 'sum', which represents task's cpu demand in its elapsed window is |
| 345 | pushed onto sum_history[] array and its 'demand' attribute is updated based on |
| 346 | the sched_window_stats_policy in effect. |
| 347 | |
| 348 | *** 2.8 Initial task load attribute for a task (init_load_pct) |
| 349 | |
| 350 | In some cases, it may be desirable for children of a task to be assigned a |
| 351 | "high" load so that they can start running on best capacity cluster. By default, |
| 352 | newly created tasks are assigned a load defined by tunable sched_init_task_load |
| 353 | (Sec 7.8). Some specialized tasks may need a higher value than the global |
| 354 | default for their child tasks. This will let child tasks run on cpus with best |
| 355 | capacity. This is accomplished by setting the 'initial task load' attribute |
| 356 | (init_load_pct) for a task. Child tasks starting load (ravg.demand and |
| 357 | ravg.sum_history[]) is initialized from their parent's 'initial task load' |
| 358 | attribute. Note that child task's 'initial task load' attribute itself will be 0 |
| 359 | by default (i.e it is not inherited from parent). |
| 360 | |
| 361 | A task's 'initial task load' attribute can be set in two ways: |
| 362 | |
| 363 | **** /proc interface |
| 364 | |
| 365 | /proc/[pid]/sched_init_task_load can be written to for setting a task's 'initial |
| 366 | task load' attribute. A numeric value between 0 - 100 (in percent scale) is |
| 367 | accepted for task's 'initial task load' attribute. |
| 368 | |
| 369 | Reading /proc/[pid]/sched_init_task_load returns the 'initial task load' |
| 370 | attribute for the given task. |
| 371 | |
| 372 | **** kernel API |
| 373 | |
| 374 | Following kernel APIs are provided to set or retrieve a given task's 'initial |
| 375 | task load' attribute: |
| 376 | |
| 377 | int sched_set_init_task_load(struct task_struct *p, int init_load_pct); |
| 378 | int sched_get_init_task_load(struct task_struct *p); |
| 379 | |
| 380 | |
| 381 | =============== |
| 382 | 3. CPU CAPACITY |
| 383 | =============== |
| 384 | |
| 385 | CPU capacity reflects peak performance offered by a cpu. It is defined both by |
| 386 | maximum frequency at which cpu can run and its efficiency attribute. Capacity of |
| 387 | a cpu is defined in reference to "least" performing cpu such that "least" |
| 388 | performing cpu has capacity of 1024. |
| 389 | |
| 390 | capacity = 1024 * (fmax_cur * / min_max_freq) * |
| 391 | (efficiency / min_possible_efficiency) |
| 392 | |
| 393 | where: |
| 394 | |
| 395 | fmax_cur = maximum frequency at which cpu is currently |
| 396 | allowed to run at |
| 397 | efficiency = IPC of cpu |
| 398 | min_max_freq = max frequency at which "least" performing cpu |
| 399 | can run |
| 400 | min_possible_efficiency = IPC of "least" performing cpu |
| 401 | |
| 402 | 'fmax_cur' reflects the fact that a cpu may be constrained at runtime to run at |
| 403 | a maximum frequency less than what is supported. This may be a constraint placed |
| 404 | by user or drivers such as thermal that intends to reduce temperature of a cpu |
| 405 | by restricting its maximum frequency. |
| 406 | |
| 407 | 'max_possible_capacity' reflects the maximum capacity of a cpu based on the |
| 408 | maximum frequency it supports. |
| 409 | |
| 410 | max_possible_capacity = 1024 * (fmax * / min_max_freq) * |
| 411 | (efficiency / min_possible_efficiency) |
| 412 | |
| 413 | where: |
| 414 | fmax = maximum frequency supported by a cpu |
| 415 | |
| 416 | /proc/sched_debug lists capacity and maximum_capacity information for a cpu. |
| 417 | |
| 418 | In the example HMP system quoted in Sec 2.3, "least" performing CPU is A53 and |
| 419 | thus min_max_freq = 1GHz and min_possible_efficiency = 1024. |
| 420 | |
| 421 | Capacity of A57 = 1024 * (2GHz / 1GHz) * (2048 / 1024) = 4096 |
| 422 | Capacity of A53 = 1024 * (1GHz / 1GHz) * (1024 / 1024) = 1024 |
| 423 | |
| 424 | Capacity of A57 when constrained to run at maximum frequency of 500MHz can be |
| 425 | calculated as: |
| 426 | |
| 427 | Capacity of A57 = 1024 * (500MHz / 1GHz) * (2048 / 1024) = 1024 |
| 428 | |
| 429 | *** 3.1 load_scale_factor |
| 430 | |
| 431 | 'lsf' or load scale factor attribute of a cpu is used to estimate load of a task |
| 432 | on that cpu when running at its fmax_cur frequency. 'lsf' is defined in |
| 433 | reference to "best" performing cpu such that it's lsf is 1024. 'lsf' for a cpu |
| 434 | is defined as: |
| 435 | |
| 436 | lsf = 1024 * (max_possible_freq / fmax_cur) * |
| 437 | (max_possible_efficiency / ipc) |
| 438 | |
| 439 | where: |
| 440 | fmax_cur = maximum frequency at which cpu is currently |
| 441 | allowed to run at |
| 442 | ipc = IPC of cpu |
| 443 | max_possible_freq = max frequency at which "best" performing cpu |
| 444 | can run |
| 445 | max_possible_efficiency = IPC of "best" performing cpu |
| 446 | |
| 447 | In the example HMP system quoted in Sec 2.3, "best" performing CPU is A57 and |
| 448 | thus max_possible_freq = 2 GHz, max_possible_efficiency = 2048 |
| 449 | |
| 450 | lsf of A57 = 1024 * (2GHz / 2GHz) * (2048 / 2048) = 1024 |
| 451 | lsf of A53 = 1024 * (2GHz / 1 GHz) * (2048 / 1024) = 4096 |
| 452 | |
| 453 | lsf of A57 constrained to run at maximum frequency of 500MHz can be calculated |
| 454 | as: |
| 455 | |
| 456 | lsf of A57 = 1024 * (2GHz / 500Mhz) * (2048 / 2048) = 4096 |
| 457 | |
| 458 | To estimate load of a task on a given cpu running at its fmax_cur: |
| 459 | |
| 460 | load = scaled_load * lsf / 1024 |
| 461 | |
| 462 | A task with scaled load of 20% would thus be estimated to consume 80% bandwidth |
| 463 | of A53 running at 1GHz. The same task with scaled load of 20% would be estimated |
| 464 | to consume 160% bandwidth on A53 constrained to run at maximum frequency of |
| 465 | 500MHz. |
| 466 | |
| 467 | load_scale_factor, thus, is very useful to estimate load of a task on a given |
| 468 | cpu and thus to decide whether it can fit in a cpu or not. |
| 469 | |
| 470 | *** 3.2 cpu_power |
| 471 | |
| 472 | A metric 'cpu_power' related to 'capacity' is also listed in /proc/sched_debug. |
| 473 | 'cpu_power' is ideally same for all cpus (1024) when they are idle and running |
| 474 | at the same frequency. 'cpu_power' of a cpu can be scaled down from its ideal |
| 475 | value to reflect reduced frequency it is operating at and also to reflect the |
| 476 | amount of cpu bandwidth consumed by real-time tasks executing on it. |
| 477 | 'cpu_power' metric is used by scheduler to decide task load distribution among |
| 478 | cpus. CPUs with low 'cpu_power' will be assigned less task load compared to cpus |
| 479 | with higher 'cpu_power' |
| 480 | |
| 481 | ============ |
| 482 | 4. CPU POWER |
| 483 | ============ |
| 484 | |
| 485 | The HMP scheduler extensions currently depend on an architecture-specific driver |
| 486 | to provide runtime information on cpu power. In the absence of an |
| 487 | architecture-specific driver, the scheduler will resort to using the |
| 488 | max_possible_capacity metric of a cpu as a measure of its power. |
| 489 | |
| 490 | ================ |
| 491 | 5. HMP SCHEDULER |
| 492 | ================ |
| 493 | |
| 494 | For normal (SCHED_OTHER/fair class) tasks there are three paths in the |
| 495 | scheduler which these HMP extensions affect. The task wakeup path, the |
| 496 | load balancer, and the scheduler tick are each modified. |
| 497 | |
| 498 | Real-time and stop-class tasks are served by different code |
| 499 | paths. These will be discussed separately. |
| 500 | |
| 501 | Prior to delving further into the algorithm and implementation however |
| 502 | some definitions are required. |
| 503 | |
| 504 | *** 5.1 Classification of Tasks and CPUs |
| 505 | |
| 506 | With the extensions described thus far, the following information is |
| 507 | available to the HMP scheduler: |
| 508 | |
| 509 | - per-task CPU demand information from either Per-Entity Load Tracking |
| 510 | (PELT) or the window-based algorithm described above |
| 511 | |
| 512 | - a power value for each frequency supported by each CPU via the API |
| 513 | described in section 4 |
| 514 | |
| 515 | - current CPU frequency, maximum CPU frequency (may be throttled by at |
| 516 | runtime due to thermal conditions), maximum possible CPU frequency supported |
| 517 | by hardware |
| 518 | |
| 519 | - data previously maintained within the scheduler such as the number |
| 520 | of currently runnable tasks on each CPU |
| 521 | |
| 522 | Combined with tunable parameters, this information can be used to classify |
| 523 | both tasks and CPUs to aid in the placement of tasks. |
| 524 | |
| 525 | - big task |
| 526 | |
| 527 | A big task is one that exerts a CPU demand too high for a particular |
| 528 | CPU to satisfy. The scheduler will attempt to find a CPU with more |
| 529 | capacity for such a task. |
| 530 | |
| 531 | The definition of "big" is specific to a task *and* a CPU. A task |
| 532 | may be considered big on one CPU in the system and not big on |
| 533 | another if the first CPU has less capacity than the second. |
| 534 | |
| 535 | What task demand is "too high" for a particular CPU? One obvious |
| 536 | answer would be a task demand which, as measured by PELT or |
| 537 | window-based load tracking, matches or exceeds the capacity of that |
| 538 | CPU. A task which runs on a CPU for a long time, for example, might |
| 539 | meet this criteria as it would report 100% demand of that CPU. It |
| 540 | may be desirable however to classify tasks which use less than 100% |
| 541 | of a particular CPU as big so that the task has some "headroom" to grow |
| 542 | without its CPU bandwidth getting capped and its performance requirements |
| 543 | not being met. This task demand is therefore a tunable parameter: |
| 544 | |
| 545 | /proc/sys/kernel/sched_upmigrate |
| 546 | |
| 547 | This value is a percentage. If a task consumes more than this much of a |
| 548 | particular CPU, that CPU will be considered too small for the task. The task |
| 549 | will thus be seen as a "big" task on the cpu and will reflect in nr_big_tasks |
| 550 | statistics maintained for that cpu. Note that certain tasks (whose nice |
| 551 | value exceeds SCHED_UPMIGRATE_MIN_NICE value or those that belong to a cgroup |
| 552 | whose upmigrate_discourage flag is set) will never be classified as big tasks |
| 553 | despite their high demand. |
| 554 | |
| 555 | As the load scale factor is calculated against current fmax, it gets boosted |
| 556 | when a lower capacity CPU is restricted to run at lower fmax. The task |
| 557 | demand is inflated in this scenario and the task upmigrates early to the |
| 558 | maximum capacity CPU. Hence this threshold is auto-adjusted by a factor |
| 559 | equal to max_possible_frequency/current_frequency of a lower capacity CPU. |
| 560 | This adjustment happens only when the lower capacity CPU frequency is |
| 561 | restricted. The same adjustment is applied to the downmigrate threshold |
| 562 | as well. |
| 563 | |
| 564 | When the frequency restriction is relaxed, the previous values are restored. |
| 565 | sched_up_down_migrate_auto_update macro defined in kernel/sched/core.c |
| 566 | controls this auto-adjustment behavior and it is enabled by default. |
| 567 | |
| 568 | If the adjusted upmigrate threshold exceeds the window size, it is clipped to |
| 569 | the window size. If the adjusted downmigrate threshold decreases the difference |
| 570 | between the upmigrate and downmigrate, it is clipped to a value such that the |
| 571 | difference between the modified and the original thresholds is same. |
| 572 | |
| 573 | - spill threshold |
| 574 | |
| 575 | Tasks will normally be placed on lowest power-cost cluster where they can fit. |
| 576 | This could result in power-efficient cluster becoming overcrowded when there |
| 577 | are "too" many low-demand tasks. Spill threshold provides a spill over |
| 578 | criteria, wherein low-demand task are allowed to be placed on idle or |
| 579 | busy cpus in high-performance cluster. |
| 580 | |
| 581 | Scheduler will avoid placing a task on a cpu if it can result in cpu exceeding |
| 582 | its spill threshold, which is defined by two tunables: |
| 583 | |
| 584 | /proc/sys/kernel/sched_spill_nr_run (default: 10) |
| 585 | /proc/sys/kernel/sched_spill_load (default : 100%) |
| 586 | |
| 587 | A cpu is considered to be above its spill level if it already has 10 tasks or |
| 588 | if the sum of task load (scaled in reference to given cpu) and |
| 589 | rq->cumulative_runnable_avg exceeds 'sched_spill_load'. |
| 590 | |
| 591 | - power band |
| 592 | |
| 593 | The scheduler may be faced with a tradeoff between power and performance when |
| 594 | placing a task. If the scheduler sees two CPUs which can accommodate a task: |
| 595 | |
| 596 | CPU 1, power cost of 20, load of 10 |
| 597 | CPU 2, power cost of 10, load of 15 |
| 598 | |
| 599 | It is not clear what the right choice of CPU is. The HMP scheduler |
| 600 | offers the sched_powerband_limit tunable to determine how this |
| 601 | situation should be handled. When the power delta between two CPUs |
| 602 | is less than sched_powerband_limit_pct, load will be prioritized as |
| 603 | the deciding factor as to which CPU is selected. If the power delta |
| 604 | between two CPUs exceeds that, the lower power CPU is considered to |
| 605 | be in a different "band" and it is selected, despite perhaps having |
| 606 | a higher current task load. |
| 607 | |
| 608 | *** 5.2 select_best_cpu() |
| 609 | |
| 610 | CPU placement decisions for a task at its wakeup or creation time are the |
| 611 | most important decisions made by the HMP scheduler. This section will describe |
| 612 | the call flow and algorithm used in detail. |
| 613 | |
| 614 | The primary entry point for a task wakeup operation is try_to_wake_up(), |
| 615 | located in kernel/sched/core.c. This function relies on select_task_rq() to |
| 616 | determine the target CPU for the waking task. For fair-class (SCHED_OTHER) |
| 617 | tasks, that request will be routed to select_task_rq_fair() in |
| 618 | kernel/sched/fair.c. As part of these scheduler extensions a hook has been |
| 619 | inserted into the top of that function. If HMP scheduling is enabled the normal |
| 620 | scheduling behavior will be replaced by a call to select_best_cpu(). This |
| 621 | function, select_best_cpu(), represents the heart of the HMP scheduling |
| 622 | algorithm described in this document. Note that select_best_cpu() is also |
| 623 | invoked for a task being created. |
| 624 | |
| 625 | The behavior of select_best_cpu() depends on several factors such as boost |
| 626 | setting, choice of several tunables and on task demand. |
| 627 | |
| 628 | **** 5.2.1 Boost |
| 629 | |
| 630 | The task placement policy changes signifincantly when scheduler boost is in |
| 631 | effect. When boost is in effect the scheduler ignores the power cost of |
| 632 | placing tasks on CPUs. Instead it figures out the load on each CPU and then |
| 633 | places task on the least loaded CPU. If the load of two or more CPUs is the |
| 634 | same (generally when CPUs are idle) the task prefers to go highest capacity |
| 635 | CPU in the system. |
| 636 | |
| 637 | A further enhancement during boost is the scheduler' early detection feature. |
| 638 | While boost is in effect the scheduler checks for the precence of tasks that |
| 639 | have been runnable for over some period of time within the tick. For such |
| 640 | tasks the scheduler informs the governor of imminent need for high frequency. |
| 641 | If there exists a task on the runqueue at the tick that has been runnable |
| 642 | for greater than SCHED_EARLY_DETECTION_DURATION amount of time, it notifies |
| 643 | the governor with a fabricated load of the full window at the highest |
| 644 | frequency. The fabricated load is maintained until the task is no longer |
| 645 | runnable or until the next tick. |
| 646 | |
| 647 | Boost can be set via either /proc/sys/kernel/sched_boost or by invoking |
| 648 | kernel API sched_set_boost(). |
| 649 | |
| 650 | int sched_set_boost(int enable); |
| 651 | |
| 652 | Once turned on, boost will remain in effect until it is explicitly turned off. |
| 653 | To allow for boost to be controlled by multiple external entities (application |
| 654 | or kernel module) at same time, boost setting is reference counted. This means |
| 655 | that two applications can turn on boost and the effect of boost is eliminated |
| 656 | only after both applications have turned off boost. boost_refcount variable |
| 657 | represents this reference count. |
| 658 | |
| 659 | **** 5.2.2 task_will_fit() |
| 660 | |
| 661 | The overall goal of select_best_cpu() is to place a task on the least power |
| 662 | cluster where it can "fit" i.e where its cpu usage shall be below the capacity |
| 663 | offered by cluster. Criteria for a task to be considered as fitting in a cluster |
| 664 | is: |
| 665 | |
| 666 | i) A low-priority task, whose nice value is greater than |
| 667 | SCHED_UPMIGRATE_MIN_NICE or whose cgroup has its |
| 668 | upmigrate_discourage flag set, is considered to be fitting in all clusters, |
| 669 | irrespective of their capacity and task's cpu demand. |
| 670 | |
| 671 | ii) All tasks are considered to fit in highest capacity cluster. |
| 672 | |
| 673 | iii) Task demand scaled in reference to the given cluster should be less than a |
| 674 | threshold. See section on load_scale_factor to know more about how task |
| 675 | demand is scaled in reference to a given cpu (cluster). The threshold used |
| 676 | is normally sched_upmigrate. Its possible for a task's demand to exceed |
| 677 | sched_upmigrate threshold in reference to a cluster when its upmigrated to |
| 678 | higher capacity cluster. To prevent it from coming back immediately to |
| 679 | lower capacity cluster, the task is not considered to "fit" on its earlier |
| 680 | cluster until its demand has dropped below sched_downmigrate in reference |
| 681 | to that earlier cluster. sched_downmigrate thus provides for some |
| 682 | hysteresis control. |
| 683 | |
| 684 | |
| 685 | **** 5.2.3 Factors affecting select_best_cpu() |
| 686 | |
| 687 | Behavior of select_best_cpu() is further controlled by several tunables and |
| 688 | synchronous nature of wakeup. |
| 689 | |
| 690 | a. /proc/sys/kernel/sched_cpu_high_irqload |
| 691 | A cpu whose irq load is greater than this threshold will not be |
| 692 | considered eligible for placement. This threshold value in expressed in |
| 693 | nanoseconds scale, with default threshold being 10000000 (10ms). See |
| 694 | notes on sched_cpu_high_irqload tunable to understand how irq load on a |
| 695 | cpu is measured. |
| 696 | |
| 697 | b. Synchronous nature of wakeup |
| 698 | Synchronous wakeup is a hint to scheduler that the task issuing wakeup |
| 699 | (i.e task currently running on cpu where wakeup is being processed by |
| 700 | scheduler) will "soon" relinquish CPU. A simple example is two tasks |
| 701 | communicating with each other using a pipe structure. When reader task |
| 702 | blocks waiting for data, its woken by writer task after it has written |
| 703 | data to pipe. Writer task usually blocks waiting for reader task to |
| 704 | consume data in pipe (which may not have any more room for writes). |
| 705 | |
| 706 | Synchronous wakeup is accounted for by adjusting load of a cpu to not |
| 707 | include load of currently running task. As a result, a cpu that has only |
| 708 | one runnable task and which is currently processing synchronous wakeup |
| 709 | will be considered idle. |
| 710 | |
| 711 | c. PF_WAKE_UP_IDLE |
| 712 | Any task with this flag set will be woken up to an idle cpu (if one is |
| 713 | available) independent of sched_prefer_idle flag setting, its demand and |
| 714 | synchronous nature of wakeup. Similarly idle cpu is preferred during |
| 715 | wakeup for any task that does not have this flag set but is being woken |
| 716 | by a task with PF_WAKE_UP_IDLE flag set. For simplicity, we will use the |
| 717 | term "PF_WAKE_UP_IDLE wakeup" to signify wakeups involving a task with |
| 718 | PF_WAKE_UP_IDLE flag set. |
| 719 | |
| 720 | d. /proc/sys/kernel/sched_select_prev_cpu_us |
| 721 | This threshold controls whether task placement goes through fast path or |
| 722 | not. If task's wakeup time since last sleep is short there are high |
| 723 | chances that it's better to place the task on its previous CPU. This |
| 724 | reduces task placement latency, cache miss and number of migrations. |
| 725 | Default value of sched_select_prev_cpu_us is 2000 (2ms). This can be |
| 726 | turned off by setting it to 0. |
| 727 | |
Srivatsa Vaddagiri | b36e661 | 2016-09-09 19:38:03 +0530 | [diff] [blame] | 728 | e. /proc/sys/kernel/sched_short_burst_ns |
| 729 | This threshold controls whether a task is considered as "short-burst" |
| 730 | or not. "short-burst" tasks are eligible for packing to avoid overhead |
| 731 | associated with waking up an idle CPU. "non-idle" CPUs which are not |
| 732 | loaded with IRQs and can accommodate the waking task without exceeding |
| 733 | spill limits are considered. The ties are broken with load followed |
| 734 | by previous CPU. This tunable does not affect cluster selection. |
| 735 | It only affects CPU selection in a given cluster. This packing is |
| 736 | skipped for tasks that are eligible for "wake-up-idle" and "boost". |
| 737 | |
Syed Rameez Mustafa | dddcab7 | 2016-09-07 16:18:27 -0700 | [diff] [blame] | 738 | **** 5.2.4 Wakeup Logic for Task "p" |
| 739 | |
| 740 | Wakeup task placement logic is as follows: |
| 741 | |
| 742 | 1) Eliminate CPUs with high irq load based on sched_cpu_high_irqload tunable. |
| 743 | |
| 744 | 2) Eliminate CPUs where either the task does not fit or CPUs where placement |
| 745 | will result in exceeding the spill threshold tunables. CPUs elimiated at this |
| 746 | stage will be considered as backup choices incase none of the CPUs get past |
| 747 | this stage. |
| 748 | |
| 749 | 3) Find out and return the least power CPU that satisfies all conditions above. |
| 750 | |
| 751 | 4) If two or more CPUs are projected to have the same power, break ties in the |
| 752 | following preference order: |
| 753 | a) The CPU is the task's previous CPU. |
| 754 | b) The CPU is in the same cluster as the task's previous CPU. |
| 755 | c) The CPU has the least load |
| 756 | |
| 757 | The placement logic described above does not apply when PF_WAKE_UP_IDLE is set |
| 758 | for either the waker task or the wakee task. Instead the scheduler chooses the |
| 759 | most power efficient idle CPU. |
| 760 | |
| 761 | 5) If no CPU is found after step 2, resort to backup CPU selection logic |
| 762 | whereby the CPU with highest amount of spare capacity is selected. |
| 763 | |
| 764 | 6) If none of the CPUs have any spare capacity, return the task's previous |
| 765 | CPU. |
| 766 | |
| 767 | *** 5.3 Scheduler Tick |
| 768 | |
| 769 | Every CPU is interrupted periodically to let kernel update various statistics |
| 770 | and possibly preempt the currently running task in favor of a waiting task. This |
| 771 | periodicity, determined by CONFIG_HZ value, is set at 10ms. There are various |
| 772 | optimizations by which a CPU however can skip taking these interrupts (ticks). |
| 773 | A cpu going idle for considerable time in one such case. |
| 774 | |
| 775 | HMP scheduler extensions brings in a change in processing of tick |
| 776 | (scheduler_tick()) that can result in task migration. In case the currently |
| 777 | running task on a cpu belongs to fair_sched class, a check is made if it needs |
| 778 | to be migrated. Possible reasons for migrating task could be: |
| 779 | |
| 780 | a) A big task is running on a power-efficient cpu and a high-performance cpu is |
| 781 | available (idle) to service it |
| 782 | |
| 783 | b) A task is starving on a CPU with high irq load. |
| 784 | |
| 785 | c) A task with upmigration discouraged is running on a performance cluster. |
| 786 | See notes on 'cpu.upmigrate_discourage'. |
| 787 | |
| 788 | In case the test for migration turns out positive (which is expected to be rare |
| 789 | event), a candidate cpu is identified for task migration. To avoid multiple task |
| 790 | migrations to the same candidate cpu(s), identification of candidate cpu is |
| 791 | serialized via global spinlock (migration_lock). |
| 792 | |
| 793 | *** 5.4 Load Balancer |
| 794 | |
| 795 | Load balance is a key functionality of scheduler that strives to distribute task |
| 796 | across available cpus in a "fair" manner. Most of the complexity associated with |
| 797 | this feature involves balancing fair_sched class tasks. Changes made to load |
| 798 | balance code serve these goals: |
| 799 | |
| 800 | 1. Restrict flow of tasks from power-efficient cpus to high-performance cpu. |
| 801 | Provide a spill-over threshold, defined in terms of number of tasks |
| 802 | (sched_spill_nr_run) and cpu demand (sched_spill_load), beyond which tasks |
| 803 | can spill over from power-efficient cpu to high-performance cpus. |
| 804 | |
| 805 | 2. Allow idle power-efficient cpus to pick up extra load from over-loaded |
| 806 | performance-efficient cpu |
| 807 | |
| 808 | 3. Allow idle high-performance cpu to pick up big tasks from power-efficient cpu |
| 809 | |
| 810 | *** 5.5 Real Time Tasks |
| 811 | |
| 812 | Minimal changes introduced in treatment of real-time tasks by HMP scheduler |
| 813 | aims at preferring scheduling of real-time tasks on cpus with low load on |
| 814 | a power efficient cluster. |
| 815 | |
| 816 | Prior to HMP scheduler, the fast-path cpu selection for placing a real-time task |
| 817 | (at wakeup) is its previous cpu, provided the currently running task on its |
| 818 | previous cpu is not a real-time task or a real-time task with lower priority. |
| 819 | Failing this, cpu selection in slow-path involves building a list of candidate |
| 820 | cpus where the waking real-time task will be of highest priority and thus can be |
| 821 | run immediately. The first cpu from this candidate list is chosen for the waking |
| 822 | real-time task. Much of the premise for this simple approach is the assumption |
| 823 | that real-time tasks often execute for very short intervals and thus the focus |
| 824 | is to place them on a cpu where they can be run immediately. |
| 825 | |
| 826 | HMP scheduler brings in a change which avoids fast-path and always resorts to |
| 827 | slow-path. Further cpu with lowest load in a power efficient cluster from |
| 828 | candidate list of cpus is chosen as cpu for placing waking real-time task. |
| 829 | |
| 830 | - PF_WAKE_UP_IDLE |
| 831 | |
| 832 | Idle cpu is preferred for any waking task that has this flag set in its |
| 833 | 'task_struct.flags' field. Further idle cpu is preferred for any task woken by |
| 834 | such tasks. PF_WAKE_UP_IDLE flag of a task is inherited by it's children. It can |
| 835 | be modified for a task in two ways: |
| 836 | |
| 837 | > kernel-space interface |
| 838 | set_wake_up_idle() needs to be called in the context of a task |
| 839 | to set or clear its PF_WAKE_UP_IDLE flag. |
| 840 | |
| 841 | > user-space interface |
| 842 | /proc/[pid]/sched_wake_up_idle file needs to be written to for |
| 843 | setting or clearing PF_WAKE_UP_IDLE flag for a given task |
| 844 | |
| 845 | ===================== |
| 846 | 6. FREQUENCY GUIDANCE |
| 847 | ===================== |
| 848 | |
| 849 | As mentioned in the introduction section the scheduler is in a unique |
| 850 | position to assist with the determination of CPU frequency. Because |
| 851 | the scheduler now maintains an estimate of per-task CPU demand, task |
| 852 | activity can be tracked, aggregated and provided to the CPUfreq |
| 853 | governor as a replacement for simple CPU busy time. |
| 854 | |
| 855 | Two of the most popular CPUfreq governors, interactive and ondemand, |
| 856 | utilize a window-based approach for measuring CPU busy time. This |
| 857 | works well with the window-based load tracking scheme previously |
| 858 | described. The following APIs are provided to allow the CPUfreq |
| 859 | governor to query busy time from the scheduler instead of using the |
| 860 | basic CPU busy time value derived via get_cpu_idle_time_us() and |
| 861 | get_cpu_iowait_time_us() APIs. |
| 862 | |
| 863 | int sched_set_window(u64 window_start, unsigned int window_size) |
| 864 | |
| 865 | This API is invoked by governor at initialization time or whenever |
| 866 | window size is changed. 'window_size' argument (in jiffy units) |
| 867 | indicates the size of window to be used. The first window of size |
| 868 | 'window_size' is set to begin at jiffy 'window_start' |
| 869 | |
| 870 | -EINVAL is returned if per-entity load tracking is in use rather |
| 871 | than window-based load tracking, otherwise a success value of 0 |
| 872 | is returned. |
| 873 | |
| 874 | int sched_get_busy(int cpu) |
| 875 | |
| 876 | Returns the busy time for the given CPU in the most recent |
| 877 | complete window. The value returned is microseconds of busy |
| 878 | time at fmax of given CPU. |
| 879 | |
| 880 | The values returned by sched_get_busy() take a bit of explanation, |
| 881 | both in what they mean and also how they are derived. |
| 882 | |
| 883 | *** 6.1 Per-CPU Window-Based Stats |
| 884 | |
| 885 | In addition to the per-task window-based demand, the HMP scheduler |
| 886 | extensions also track the aggregate demand seen on each CPU. This is |
| 887 | done using the same windows that the task demand is tracked with |
| 888 | (which is in turn set by the governor when frequency guidance is in |
| 889 | use). There are four quantities maintained for each CPU by the HMP scheduler: |
| 890 | |
| 891 | curr_runnable_sum: aggregate demand from all tasks which executed during |
| 892 | the current (not yet completed) window |
| 893 | |
| 894 | prev_runnable_sum: aggregate demand from all tasks which executed during |
| 895 | the most recent completed window |
| 896 | |
| 897 | nt_curr_runnable_sum: aggregate demand from all 'new' tasks which executed |
| 898 | during the current (not yet completed) window |
| 899 | |
| 900 | nt_prev_runnable_sum: aggregate demand from all 'new' tasks which executed |
| 901 | during the most recent completed window. |
| 902 | |
| 903 | When the scheduler is updating a task's window-based stats it also |
| 904 | updates these values. Like per-task window-based demand these |
| 905 | quantities are normalized against the max possible frequency and max |
| 906 | efficiency (instructions per cycle) in the system. If an update occurs |
| 907 | and a window rollover is observed, curr_runnable_sum is copied into |
| 908 | prev_runnable_sum before being reset to 0. The sched_get_busy() API |
| 909 | returns prev_runnable_sum, scaled to the efficiency and fmax of given |
| 910 | CPU. The same applies to nt_curr_runnable_sum and nt_prev_runnable_sum. |
| 911 | |
| 912 | A 'new' task is defined as a task whose number of active windows since fork is |
Pavankumar Kondeti | 41b4166 | 2017-02-08 09:33:22 +0530 | [diff] [blame] | 913 | less than SCHED_NEW_TASK_WINDOWS. An active window is defined as a window |
Syed Rameez Mustafa | dddcab7 | 2016-09-07 16:18:27 -0700 | [diff] [blame] | 914 | where a task was observed to be runnable. |
| 915 | |
| 916 | *** 6.2 Per-task window-based stats |
| 917 | |
| 918 | Corresponding to curr_runnable_sum and prev_runnable_sum, two counters are |
| 919 | maintained per-task |
| 920 | |
| 921 | curr_window - represents cpu demand of task in its most recently tracked |
| 922 | window |
| 923 | prev_window - represents cpu demand of task in the window prior to the one |
| 924 | being tracked by curr_window |
| 925 | |
| 926 | The above counters are resued for nt_curr_runnable_sum and |
| 927 | nt_prev_runnable_sum. |
| 928 | |
| 929 | "cpu demand" of a task includes its execution time and can also include its |
| 930 | wait time. 'SCHED_FREQ_ACCOUNT_WAIT_TIME' controls whether task's wait |
| 931 | time is included in its 'curr_window' and 'prev_window' counters or not. |
| 932 | |
| 933 | Needless to say, curr_runnable_sum counter of a cpu is derived from curr_window |
| 934 | counter of various tasks that ran on it in its most recent window. |
| 935 | |
| 936 | *** 6.3 Effect of various task events |
| 937 | |
| 938 | We now consider various events and how they affect above mentioned counters. |
| 939 | |
| 940 | PICK_NEXT_TASK |
| 941 | This represents beginning of execution for a task. Provided the task |
| 942 | refers to a non-idle task, a portion of task's wait time that |
| 943 | corresponds to the current window being tracked on a cpu is added to |
| 944 | task's curr_window counter, provided SCHED_FREQ_ACCOUNT_WAIT_TIME is |
| 945 | set. The same quantum is also added to cpu's curr_runnable_sum counter. |
| 946 | The remaining portion, which corresponds to task's wait time in previous |
| 947 | window is added to task's prev_window and cpu's prev_runnable_sum |
| 948 | counters. |
| 949 | |
| 950 | PUT_PREV_TASK |
| 951 | This represents end of execution of a time-slice for a task, where the |
| 952 | task could refer to a cpu's idle task also. In case the task is non-idle |
| 953 | or (in case of task being idle with cpu having non-zero rq->nr_iowait |
| 954 | count and sched_io_is_busy =1), a portion of task's execution time, that |
| 955 | corresponds to current window being tracked on a cpu is added to task's |
| 956 | curr_window_counter and also to cpu's curr_runnable_sum counter. Portion |
| 957 | of task's execution that corresponds to the previous window is added to |
| 958 | task's prev_window and cpu's prev_runnable_sum counters. |
| 959 | |
| 960 | TASK_UPDATE |
| 961 | This event is called on a cpu's currently running task and hence |
| 962 | behaves effectively as PUT_PREV_TASK. Task continues executing after |
| 963 | this event, until PUT_PREV_TASK event occurs on the task (during |
| 964 | context switch). |
| 965 | |
| 966 | TASK_WAKE |
| 967 | This event signifies a task waking from sleep. Since many windows |
| 968 | could have elapsed since the task went to sleep, its curr_window |
| 969 | and prev_window are updated to reflect task's demand in the most |
| 970 | recent and its previous window that is being tracked on a cpu. |
| 971 | |
| 972 | TASK_MIGRATE |
| 973 | This event signifies task migration across cpus. It is invoked on the |
| 974 | task prior to being moved. Thus at the time of this event, the task |
| 975 | can be considered to be in "waiting" state on src_cpu. In that way |
| 976 | this event reflects actions taken under PICK_NEXT_TASK (i.e its |
| 977 | wait time is added to task's curr/prev_window counters as well |
| 978 | as src_cpu's curr/prev_runnable_sum counters, provided |
| 979 | SCHED_FREQ_ACCOUNT_WAIT_TIME is non-zero). After that update, |
| 980 | src_cpu's curr_runnable_sum is reduced by task's curr_window value |
| 981 | and dst_cpu's curr_runnable_sum is increased by task's curr_window |
| 982 | value. Similarly, src_cpu's prev_runnable_sum is reduced by task's |
| 983 | prev_window value and dst_cpu's prev_runnable_sum is increased by |
| 984 | task's prev_window value. |
| 985 | |
| 986 | IRQ_UPDATE |
| 987 | This event signifies end of execution of an interrupt handler. This |
| 988 | event results in update of cpu's busy time counters, curr_runnable_sum |
| 989 | and prev_runnable_sum, provided cpu was idle. |
| 990 | When sched_io_is_busy = 0, only the interrupt handling time is added |
| 991 | to cpu's curr_runnable_sum and prev_runnable_sum counters. When |
| 992 | sched_io_is_busy = 1, the event mirrors actions taken under |
| 993 | TASK_UPDATED event i.e time since last accounting of idle task's cpu |
| 994 | usage is added to cpu's curr_runnable_sum and prev_runnable_sum |
| 995 | counters. |
| 996 | |
| 997 | =========== |
| 998 | 7. TUNABLES |
| 999 | =========== |
| 1000 | |
| 1001 | *** 7.1 sched_spill_load |
| 1002 | |
| 1003 | Appears at: /proc/sys/kernel/sched_spill_load |
| 1004 | |
| 1005 | Default value: 100 |
| 1006 | |
| 1007 | CPU selection criteria for fair-sched class tasks is the lowest power cpu where |
| 1008 | they can fit. When the most power-efficient cpu where a task can fit is |
| 1009 | overloaded (aggregate demand of tasks currently queued on it exceeds |
| 1010 | sched_spill_load), a task can be placed on a higher-performance cpu, even though |
| 1011 | the task strictly doesn't need one. |
| 1012 | |
| 1013 | *** 7.2 sched_spill_nr_run |
| 1014 | |
| 1015 | Appears at: /proc/sys/kernel/sched_spill_nr_run |
| 1016 | |
| 1017 | Default value: 10 |
| 1018 | |
| 1019 | The intent of this tunable is similar to sched_spill_load, except it applies to |
| 1020 | nr_running count of a cpu. A task can spill over to a higher-performance cpu |
| 1021 | when the most power-efficient cpu where it can normally fit has more tasks than |
| 1022 | sched_spill_nr_run. |
| 1023 | |
| 1024 | *** 7.3 sched_upmigrate |
| 1025 | |
| 1026 | Appears at: /proc/sys/kernel/sched_upmigrate |
| 1027 | |
| 1028 | Default value: 80 |
| 1029 | |
| 1030 | This tunable is a percentage. If a task consumes more than this much |
| 1031 | of a CPU, the CPU is considered too small for the task and the |
| 1032 | scheduler will try to find a bigger CPU to place the task on. |
| 1033 | |
| 1034 | *** 7.4 sched_init_task_load |
| 1035 | |
| 1036 | Appears at: /proc/sys/kernel/sched_init_task_load |
| 1037 | |
| 1038 | Default value: 15 |
| 1039 | |
| 1040 | This tunable is a percentage. When a task is first created it has no |
| 1041 | history, so the task load tracking mechanism cannot determine a |
| 1042 | historical load value to assign to it. This tunable specifies the |
| 1043 | initial load value for newly created tasks. Also see Sec 2.8 on per-task |
| 1044 | 'initial task load' attribute. |
| 1045 | |
| 1046 | *** 7.5 sched_ravg_hist_size |
| 1047 | |
| 1048 | Appears at: /proc/sys/kernel/sched_ravg_hist_size |
| 1049 | |
| 1050 | Default value: 5 |
| 1051 | |
| 1052 | This tunable controls the number of samples used from task's sum_history[] |
| 1053 | array for determination of its demand. |
| 1054 | |
| 1055 | *** 7.6 sched_window_stats_policy |
| 1056 | |
| 1057 | Appears at: /proc/sys/kernel/sched_window_stats_policy |
| 1058 | |
| 1059 | Default value: 2 |
| 1060 | |
| 1061 | This tunable controls the policy in how window-based load tracking |
| 1062 | calculates an overall demand value based on the windows of CPU |
| 1063 | utilization it has collected for a task. |
| 1064 | |
| 1065 | Possible values for this tunable are: |
| 1066 | 0: Just use the most recent window sample of task activity when calculating |
| 1067 | task demand. |
| 1068 | 1: Use the maximum value of first M samples found in task's cpu demand |
| 1069 | history (sum_history[] array), where M = sysctl_sched_ravg_hist_size |
| 1070 | 2: Use the maximum of (the most recent window sample, average of first M |
| 1071 | samples), where M = sysctl_sched_ravg_hist_size |
| 1072 | 3. Use average of first M samples, where M = sysctl_sched_ravg_hist_size |
| 1073 | |
| 1074 | *** 7.7 sched_ravg_window |
| 1075 | |
| 1076 | Appears at: kernel command line argument |
| 1077 | |
| 1078 | Default value: 10000000 (10ms, units of tunable are nanoseconds) |
| 1079 | |
| 1080 | This specifies the duration of each window in window-based load |
| 1081 | tracking. By default each window is 10ms long. This quantity must |
| 1082 | currently be set at boot time on the kernel command line (or the |
| 1083 | default value of 10ms can be used). |
| 1084 | |
| 1085 | *** 7.8 RAVG_HIST_SIZE |
| 1086 | |
| 1087 | Appears at: compile time only (see RAVG_HIST_SIZE in include/linux/sched.h) |
| 1088 | |
| 1089 | Default value: 5 |
| 1090 | |
| 1091 | This macro specifies the number of windows the window-based load |
| 1092 | tracking mechanism maintains per task. If default values are used for |
| 1093 | both this and sched_ravg_window then a total of 50ms of task history |
| 1094 | would be maintained in 5 10ms windows. |
| 1095 | |
| 1096 | *** 7.9 sched_freq_inc_notify |
| 1097 | |
| 1098 | Appears at: /proc/sys/kernel/sched_freq_inc_notify |
| 1099 | |
| 1100 | Default value: 10 * 1024 * 1024 (10 Ghz) |
| 1101 | |
| 1102 | When scheduler detects that cur_freq of a cluster is insufficient to meet |
| 1103 | demand, it sends notification to governor, provided (freq_required - cur_freq) |
| 1104 | exceeds sched_freq_inc_notify, where freq_required is the frequency calculated |
| 1105 | by scheduler to meet current task demand. Note that sched_freq_inc_notify is |
| 1106 | specified in kHz units. |
| 1107 | |
| 1108 | *** 7.10 sched_freq_dec_notify |
| 1109 | |
| 1110 | Appears at: /proc/sys/kernel/sched_freq_dec_notify |
| 1111 | |
| 1112 | Default value: 10 * 1024 * 1024 (10 Ghz) |
| 1113 | |
| 1114 | When scheduler detects that cur_freq of a cluster is far greater than what is |
| 1115 | needed to serve current task demand, it will send notification to governor. |
| 1116 | More specifically, notification is sent when (cur_freq - freq_required) |
| 1117 | exceeds sched_freq_dec_notify, where freq_required is the frequency calculated |
| 1118 | by scheduler to meet current task demand. Note that sched_freq_dec_notify is |
| 1119 | specified in kHz units. |
| 1120 | |
| 1121 | *** 7.11 sched_cpu_high_irqload |
| 1122 | |
| 1123 | Appears at: /proc/sys/kernel/sched_cpu_high_irqload |
| 1124 | |
| 1125 | Default value: 10000000 (10ms) |
| 1126 | |
| 1127 | The scheduler keeps a decaying average of the amount of irq and softirq activity |
| 1128 | seen on each CPU within a ten millisecond window. Note that this "irqload" |
| 1129 | (reported in the sched_cpu_load_* tracepoint) will be higher than the typical load |
| 1130 | in a single window since every time the window rolls over, the value is decayed |
| 1131 | by some fraction and then added to the irq/softirq time spent in the next |
| 1132 | window. |
| 1133 | |
| 1134 | When the irqload on a CPU exceeds the value of this tunable, the CPU is no |
| 1135 | longer eligible for placement. This will affect the task placement logic |
| 1136 | described above, causing the scheduler to try and steer tasks away from |
| 1137 | the CPU. |
| 1138 | |
| 1139 | *** 7.12 cpu.upmigrate_discourage |
| 1140 | |
| 1141 | Default value : 0 |
| 1142 | |
| 1143 | This is a cgroup attribute supported by the cpu resource controller. It normally |
| 1144 | appears at [root_cpu]/[name1]/../[name2]/cpu.upmigrate_discourage. Here |
| 1145 | "root_cpu" is the mount point for cgroup (cpu resource control) filesystem |
| 1146 | and name1, name2 etc are names of cgroups that form a hierarchy. |
| 1147 | |
| 1148 | Setting this flag to 1 discourages upmigration for all tasks of a cgroup. High |
| 1149 | demand tasks of such a cgroup will never be classified as big tasks and hence |
| 1150 | not upmigrated. Any task of the cgroup is allowed to upmigrate only under |
| 1151 | overcommitted scenario. See notes on sched_spill_nr_run and sched_spill_load for |
| 1152 | how overcommitment threshold is defined. |
| 1153 | |
| 1154 | *** 7.13 sched_static_cpu_pwr_cost |
| 1155 | |
| 1156 | Default value: 0 |
| 1157 | |
| 1158 | Appears at /sys/devices/system/cpu/cpu<x>/sched_static_cpu_pwr_cost |
| 1159 | |
| 1160 | This is the power cost associated with bringing an idle CPU out of low power |
| 1161 | mode. It ignores the actual C-state that a CPU may be in and assumes the |
| 1162 | worst case power cost of the highest C-state. It is means of biasing task |
| 1163 | placement away from idle CPUs when necessary. It can be defined per CPU, |
| 1164 | however, a more appropriate usage to define the same value for every CPU |
| 1165 | within a cluster and possibly have differing value between clusters as |
| 1166 | needed. |
| 1167 | |
| 1168 | |
| 1169 | *** 7.14 sched_static_cluster_pwr_cost |
| 1170 | |
| 1171 | Default value: 0 |
| 1172 | |
| 1173 | Appears at /sys/devices/system/cpu/cpu<x>/sched_static_cluster_pwr_cost |
| 1174 | |
| 1175 | This is the power cost associated with bringing an idle cluster out of low |
| 1176 | power mode. It ignores the actual D-state that a cluster may be in and assumes |
| 1177 | the worst case power cost of the highest D-state. It is means of biasing task |
| 1178 | placement away from idle clusters when necessary. |
| 1179 | |
| 1180 | *** 7.15 sched_restrict_cluster_spill |
| 1181 | |
| 1182 | Default value: 0 |
| 1183 | |
| 1184 | Appears at /proc/sys/kernel/sched_restrict_cluster_spill |
| 1185 | |
| 1186 | This tunable can be used to restrict tasks spilling to the higher capacity |
| 1187 | (higher power) cluster. When this tunable is enabled, |
| 1188 | |
| 1189 | - Restrict the higher capacity cluster pulling tasks from the lower capacity |
| 1190 | cluster in the load balance path. The restriction is lifted if all of the CPUS |
| 1191 | in the lower capacity cluster are above spill. The power cost is used to break |
| 1192 | the ties if the capacity of clusters are same for applying this restriction. |
| 1193 | |
| 1194 | - The current CPU selection algorithm for RT tasks looks for the least loaded |
| 1195 | CPU across all clusters. When this tunable is enabled, the RT tasks are |
| 1196 | restricted to the lowest possible power cluster. |
| 1197 | |
| 1198 | |
| 1199 | *** 7.16 sched_downmigrate |
| 1200 | |
| 1201 | Appears at: /proc/sys/kernel/sched_downmigrate |
| 1202 | |
| 1203 | Default value: 60 |
| 1204 | |
| 1205 | This tunable is a percentage. It exists to control hysteresis. Lets say a task |
| 1206 | migrated to a high-performance cpu when it crossed 80% demand on a |
| 1207 | power-efficient cpu. We don't let it come back to a power-efficient cpu until |
| 1208 | its demand *in reference to the power-efficient cpu* drops less than 60% |
| 1209 | (sched_downmigrate). |
| 1210 | |
| 1211 | |
| 1212 | *** 7.17 sched_small_wakee_task_load |
| 1213 | |
| 1214 | Appears at: /proc/sys/kernel/sched_small_wakee_task_load |
| 1215 | |
| 1216 | Default value: 10 |
| 1217 | |
| 1218 | This tunable is a percentage. Configure the maximum demand of small wakee task. |
| 1219 | Sync wakee tasks which have demand less than sched_small_wakee_task_load are |
| 1220 | categorized as small wakee tasks. Scheduler places small wakee tasks on the |
| 1221 | waker's cluster. |
| 1222 | |
| 1223 | |
| 1224 | *** 7.18 sched_big_waker_task_load |
| 1225 | |
| 1226 | Appears at: /proc/sys/kernel/sched_big_waker_task_load |
| 1227 | |
| 1228 | Default value: 25 |
| 1229 | |
| 1230 | This tunable is a percentage. Configure the minimum demand of big sync waker |
| 1231 | task. Scheduler places small wakee tasks woken up by big sync waker on the |
| 1232 | waker's cluster. |
| 1233 | |
Pavankumar Kondeti | 72b49a3 | 2016-09-06 11:59:28 +0530 | [diff] [blame] | 1234 | *** 7.19 sched_prefer_sync_wakee_to_waker |
| 1235 | |
| 1236 | Appears at: /proc/sys/kernel/sched_prefer_sync_wakee_to_waker |
| 1237 | |
| 1238 | Default value: 0 |
| 1239 | |
| 1240 | The default sync wakee policy has a preference to select an idle CPU in the |
| 1241 | waker cluster compared to the waker CPU running only 1 task. By selecting |
| 1242 | an idle CPU, it eliminates the chance of waker migrating to a different CPU |
| 1243 | after the wakee preempts it. This policy is also not susceptible to the |
| 1244 | incorrect "sync" usage i.e the waker does not goto sleep after waking up |
| 1245 | the wakee. |
| 1246 | |
| 1247 | However LPM exit latency associated with an idle CPU outweigh the above |
| 1248 | benefits on some targets. When this knob is turned on, the waker CPU is |
| 1249 | selected if it has only 1 runnable task. |
| 1250 | |
Syed Rameez Mustafa | dddcab7 | 2016-09-07 16:18:27 -0700 | [diff] [blame] | 1251 | ========================= |
| 1252 | 8. HMP SCHEDULER TRACE POINTS |
| 1253 | ========================= |
| 1254 | |
| 1255 | *** 8.1 sched_enq_deq_task |
| 1256 | |
| 1257 | Logged when a task is either enqueued or dequeued on a CPU's run queue. |
| 1258 | |
| 1259 | <idle>-0 [004] d.h4 12700.711665: sched_enq_deq_task: cpu=4 enqueue comm=powertop pid=13227 prio=120 nr_running=1 cpu_load=0 rt_nr_running=0 affine=ff demand=13364423 |
| 1260 | |
| 1261 | - cpu: the CPU that the task is being enqueued on to or dequeued off of |
| 1262 | - enqueue/dequeue: whether this was an enqueue or dequeue event |
| 1263 | - comm: name of task |
| 1264 | - pid: PID of task |
| 1265 | - prio: priority of task |
| 1266 | - nr_running: number of runnable tasks on this CPU |
| 1267 | - cpu_load: current priority-weighted load on the CPU (note, this is *not* |
| 1268 | the same as CPU utilization or a metric tracked by PELT/window-based tracking) |
| 1269 | - rt_nr_running: number of real-time processes running on this CPU |
| 1270 | - affine: CPU affinity mask in hex for this task (so ff is a task eligible to |
| 1271 | run on CPUs 0-7) |
| 1272 | - demand: window-based task demand computed based on selected policy (recent, |
| 1273 | max, or average) (ns) |
| 1274 | |
| 1275 | *** 8.2 sched_task_load |
| 1276 | |
| 1277 | Logged when selecting the best CPU to run the task (select_best_cpu()). |
| 1278 | |
| 1279 | sched_task_load: 4004 (adbd): demand=698425 boost=0 reason=0 sync=0 need_idle=0 best_cpu=0 latency=103177 |
| 1280 | |
| 1281 | - demand: window-based task demand computed based on selected policy (recent, |
| 1282 | max, or average) (ns) |
| 1283 | - boost: whether boost is in effect |
| 1284 | - reason: reason we are picking a new CPU: |
| 1285 | 0: no migration - selecting a CPU for a wakeup or new task wakeup |
| 1286 | 1: move to big CPU (migration) |
| 1287 | 2: move to little CPU (migration) |
| 1288 | 3: move to low irq load CPU (migration) |
| 1289 | - sync: is the nature synchronous in nature |
| 1290 | - need_idle: is an idle CPU required for this task based on PF_WAKE_UP_IDLE |
| 1291 | - best_cpu: The CPU selected by the select_best_cpu() function for placement |
| 1292 | - latency: The execution time of the function select_best_cpu() |
| 1293 | |
| 1294 | *** 8.3 sched_cpu_load_* |
| 1295 | |
| 1296 | Logged when selecting the best CPU to run a task (select_best_cpu() for fair |
| 1297 | class tasks, find_lowest_rq_hmp() for RT tasks) and load balancing |
| 1298 | (update_sg_lb_stats()). |
| 1299 | |
| 1300 | <idle>-0 [004] d.h3 12700.711541: sched_cpu_load_*: cpu 0 idle 1 nr_run 0 nr_big 0 lsf 1119 capacity 1024 cr_avg 0 irqload 3301121 fcur 729600 fmax 1459200 power_cost 5 cstate 2 temp 38 |
| 1301 | |
| 1302 | - cpu: the CPU being described |
| 1303 | - idle: boolean indicating whether the CPU is idle |
| 1304 | - nr_run: number of tasks running on CPU |
| 1305 | - nr_big: number of BIG tasks running on CPU |
| 1306 | - lsf: load scale factor - multiply normalized load by this factor to determine |
| 1307 | how much load task will exert on CPU |
| 1308 | - capacity: capacity of CPU (based on max possible frequency and efficiency) |
| 1309 | - cr_avg: cumulative runnable average, instantaneous sum of the demand (either |
| 1310 | PELT or window-based) of all the runnable task on a CPU (ns) |
| 1311 | - irqload: decaying average of irq activity on CPU (ns) |
| 1312 | - fcur: current CPU frequency (Khz) |
| 1313 | - fmax: max CPU frequency (but not maximum _possible_ frequency) (KHz) |
| 1314 | - power_cost: cost of running this CPU at the current frequency |
| 1315 | - cstate: current cstate of CPU |
| 1316 | - temp: current temperature of the CPU |
| 1317 | |
| 1318 | The power_cost value above differs in how it is calculated depending on the |
| 1319 | callsite of this tracepoint. The select_best_cpu() call to this tracepoint |
| 1320 | finds the minimum frequency required to satisfy the existing load on the CPU |
| 1321 | as well as the task being placed, and returns the power cost of that frequency. |
| 1322 | The load balance and real time task placement paths used a fixed frequency |
| 1323 | (highest frequency common to all CPUs for load balancing, minimum |
| 1324 | frequency of the CPU for real time task placement). |
| 1325 | |
| 1326 | *** 8.4 sched_update_task_ravg |
| 1327 | |
| 1328 | Logged when window-based stats are updated for a task. The update may happen |
| 1329 | for a variety of reasons, see section 2.5, "Task Events." |
| 1330 | |
| 1331 | <idle>-0 [004] d.h4 12700.711513: sched_update_task_ravg: wc 12700711473496 ws 12700691772135 delta 19701361 event TASK_WAKE cpu 4 cur_freq 199200 cur_pid 0 task 13227 (powertop) ms 12640648272532 delta 60063200964 demand 13364423 sum 0 irqtime 0 cs 0 ps 495018 cur_window 0 prev_window 0 |
| 1332 | |
| 1333 | - wc: wallclock, output of sched_clock(), monotonically increasing time since |
| 1334 | boot (will roll over in 585 years) (ns) |
| 1335 | - ws: window start, time when the current window started (ns) |
| 1336 | - delta: time since the window started (wc - ws) (ns) |
| 1337 | - event: What event caused this trace event to occur (see section 2.5 for more |
| 1338 | details) |
| 1339 | - cpu: which CPU the task is running on |
| 1340 | - cur_freq: CPU's current frequency in KHz |
| 1341 | - curr_pid: PID of the current running task (current) |
| 1342 | - task: PID and name of task being updated |
| 1343 | - ms: mark start - timestamp of the beginning of a segment of task activity, |
| 1344 | either sleeping or runnable/running (ns) |
| 1345 | - delta: time since last event within the window (wc - ms) (ns) |
| 1346 | - demand: task demand computed based on selected policy (recent, max, or |
| 1347 | average) (ns) |
| 1348 | - sum: the task's run time during current window scaled by frequency and |
| 1349 | efficiency (ns) |
| 1350 | - irqtime: length of interrupt activity (ns). A non-zero irqtime is seen |
| 1351 | when an idle cpu handles interrupts, the time for which needs to be |
| 1352 | accounted as cpu busy time |
| 1353 | - cs: curr_runnable_sum of cpu (ns). See section 6.1 for more details of this |
| 1354 | counter. |
| 1355 | - ps: prev_runnable_sum of cpu (ns). See section 6.1 for more details of this |
| 1356 | counter. |
| 1357 | - cur_window: cpu demand of task in its most recently tracked window (ns) |
| 1358 | - prev_window: cpu demand of task in the window prior to the one being tracked |
| 1359 | by cur_window |
| 1360 | |
| 1361 | *** 8.5 sched_update_history |
| 1362 | |
| 1363 | Logged when update_task_ravg() is accounting task activity into one or |
| 1364 | more windows that have completed. This may occur more than once for a |
| 1365 | single call into update_task_ravg(). A task that ran for 24ms spanning |
| 1366 | four 10ms windows (the last 2ms of window 1, all of windows 2 and 3, |
| 1367 | and the first 2ms of window 4) would result in two calls into |
| 1368 | update_history() from update_task_ravg(). The first call would record activity |
| 1369 | in completed window 1 and second call would record activity for windows 2 and 3 |
| 1370 | together (samples will be 2 in second call). |
| 1371 | |
| 1372 | <idle>-0 [004] d.h4 12700.711489: sched_update_history: 13227 (powertop): runtime 13364423 samples 1 event TASK_WAKE demand 13364423 (hist: 13364423 9871252 2236009 6162476 10282078) cpu 4 nr_big 0 |
| 1373 | |
| 1374 | - runtime: task cpu demand in recently completed window(s). This value is scaled |
| 1375 | to max_possible_freq and max_possible_efficiency. This value is pushed into |
| 1376 | task's demand history array. The number of windows to which runtime applies is |
| 1377 | provided by samples field. |
| 1378 | - samples: Number of samples (windows), each having value of runtime, that is |
| 1379 | recorded in task's demand history array. |
| 1380 | - event: What event caused this trace event to occur (see section 2.5 for more |
| 1381 | details) - PUT_PREV_TASK, PICK_NEXT_TASK, TASK_WAKE, TASK_MIGRATE, |
| 1382 | TASK_UPDATE |
| 1383 | - demand: task demand computed based on selected policy (recent, max, or |
| 1384 | average) (ns) |
| 1385 | - hist: last 5 windows of history for the task with the most recent window |
| 1386 | listed first |
| 1387 | - cpu: CPU the task is associated with |
| 1388 | - nr_big: number of big tasks on the CPU |
| 1389 | |
| 1390 | *** 8.6 sched_reset_all_windows_stats |
| 1391 | |
| 1392 | Logged when key parameters controlling window-based statistics collection are |
| 1393 | changed. This event signifies that all window-based statistics for tasks and |
| 1394 | cpus are being reset. Changes to below attributes result in such a reset: |
| 1395 | |
| 1396 | * sched_ravg_window (See Sec 2) |
| 1397 | * sched_window_stats_policy (See Sec 2.4) |
| 1398 | * sched_ravg_hist_size (See Sec 7.11) |
| 1399 | |
| 1400 | <task>-0 [004] d.h4 12700.711489: sched_reset_all_windows_stats: time_taken 1123 window_start 0 window_size 0 reason POLICY_CHANGE old_val 0 new_val 1 |
| 1401 | |
| 1402 | - time_taken: time taken for the reset function to complete (ns) |
| 1403 | - window_start: Beginning of first window following change to window size (ns) |
| 1404 | - window_size: Size of window. Non-zero if window-size is changing (in ticks) |
| 1405 | - reason: Reason for reset of statistics. |
| 1406 | - old_val: Old value of variable, change of which is triggering reset |
| 1407 | - new_val: New value of variable, change of which is triggering reset |
| 1408 | |
| 1409 | *** 8.7 sched_migration_update_sum |
| 1410 | |
| 1411 | Logged when a task is migrating to another cpu. |
| 1412 | |
| 1413 | <task>-0 [000] d..8 5020.404137: sched_migration_update_sum: cpu 0: cs 471278 ps 902463 nt_cs 0 nt_ps 0 pid 2645 |
| 1414 | |
| 1415 | - cpu: cpu, away from which or to which, task is migrating |
| 1416 | - cs: curr_runnable_sum of cpu (ns). See Sec 6.1 for more details of this |
| 1417 | counter. |
| 1418 | - ps: prev_runnable_sum of cpu (ns). See Sec 6.1 for more details of this |
| 1419 | counter. |
| 1420 | - nt_cs: nt_curr_runnable_sum of cpu (ns). See Sec 6.1 for more details of |
| 1421 | this counter. |
| 1422 | - nt_ps: nt_prev_runnable_sum of cpu (ns). See Sec 6.1 for more details of |
| 1423 | this counter |
| 1424 | - pid: PID of migrating task |
| 1425 | |
| 1426 | *** 8.8 sched_get_busy |
| 1427 | |
| 1428 | Logged when scheduler is returning busy time statistics for a cpu. |
| 1429 | |
| 1430 | <...>-4331 [003] d.s3 313.700108: sched_get_busy: cpu 3 load 19076 new_task_load 0 early 0 |
| 1431 | |
| 1432 | |
| 1433 | - cpu: cpu, for which busy time statistic (prev_runnable_sum) is being |
| 1434 | returned (ns) |
| 1435 | - load: corresponds to prev_runnable_sum (ns), scaled to fmax of cpu |
| 1436 | - new_task_load: corresponds to nt_prev_runnable_sum to fmax of cpu |
| 1437 | - early: A flag indicating whether the scheduler is passing regular load or early detection load |
| 1438 | 0 - regular load |
| 1439 | 1 - early detection load |
| 1440 | |
| 1441 | *** 8.9 sched_freq_alert |
| 1442 | |
| 1443 | Logged when scheduler is alerting cpufreq governor about need to change |
| 1444 | frequency |
| 1445 | |
| 1446 | <task>-0 [004] d.h4 12700.711489: sched_freq_alert: cpu 0 old_load=XXX new_load=YYY |
| 1447 | |
| 1448 | - cpu: cpu in cluster that has highest load (prev_runnable_sum) |
| 1449 | - old_load: cpu busy time last reported to governor. This is load scaled in |
| 1450 | reference to max_possible_freq and max_possible_efficiency. |
| 1451 | - new_load: recent cpu busy time. This is load scaled in |
| 1452 | reference to max_possible_freq and max_possible_efficiency. |
| 1453 | |
| 1454 | *** 8.10 sched_set_boost |
| 1455 | |
| 1456 | Logged when boost settings are being changed |
| 1457 | |
| 1458 | <task>-0 [004] d.h4 12700.711489: sched_set_boost: ref_count=1 |
| 1459 | |
| 1460 | - ref_count: A non-zero value indicates boost is in effect |
Pavankumar Kondeti | 8de9ac6 | 2016-10-01 11:06:54 +0530 | [diff] [blame] | 1461 | |
| 1462 | ======================== |
| 1463 | 9. Device Tree bindings |
| 1464 | ======================== |
| 1465 | |
| 1466 | The device tree bindings for the HMP scheduler are defined in |
| 1467 | Documentation/devicetree/bindings/sched/sched_hmp.txt |