Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 1 | .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 2 | .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 3 | |
| 4 | ======================= |
| 5 | CPU Performance Scaling |
| 6 | ======================= |
| 7 | |
| 8 | :: |
| 9 | |
| 10 | Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
| 11 | |
| 12 | The Concept of CPU Performance Scaling |
| 13 | ====================================== |
| 14 | |
| 15 | The majority of modern processors are capable of operating in a number of |
| 16 | different clock frequency and voltage configurations, often referred to as |
| 17 | Operating Performance Points or P-states (in ACPI terminology). As a rule, |
| 18 | the higher the clock frequency and the higher the voltage, the more instructions |
| 19 | can be retired by the CPU over a unit of time, but also the higher the clock |
| 20 | frequency and the higher the voltage, the more energy is consumed over a unit of |
| 21 | time (or the more power is drawn) by the CPU in the given P-state. Therefore |
| 22 | there is a natural tradeoff between the CPU capacity (the number of instructions |
| 23 | that can be executed over a unit of time) and the power drawn by the CPU. |
| 24 | |
| 25 | In some situations it is desirable or even necessary to run the program as fast |
| 26 | as possible and then there is no reason to use any P-states different from the |
| 27 | highest one (i.e. the highest-performance frequency/voltage configuration |
| 28 | available). In some other cases, however, it may not be necessary to execute |
| 29 | instructions so quickly and maintaining the highest available CPU capacity for a |
| 30 | relatively long time without utilizing it entirely may be regarded as wasteful. |
| 31 | It also may not be physically possible to maintain maximum CPU capacity for too |
| 32 | long for thermal or power supply capacity reasons or similar. To cover those |
| 33 | cases, there are hardware interfaces allowing CPUs to be switched between |
| 34 | different frequency/voltage configurations or (in the ACPI terminology) to be |
| 35 | put into different P-states. |
| 36 | |
| 37 | Typically, they are used along with algorithms to estimate the required CPU |
| 38 | capacity, so as to decide which P-states to put the CPUs into. Of course, since |
| 39 | the utilization of the system generally changes over time, that has to be done |
| 40 | repeatedly on a regular basis. The activity by which this happens is referred |
| 41 | to as CPU performance scaling or CPU frequency scaling (because it involves |
| 42 | adjusting the CPU clock frequency). |
| 43 | |
| 44 | |
| 45 | CPU Performance Scaling in Linux |
| 46 | ================================ |
| 47 | |
| 48 | The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` |
| 49 | (CPU Frequency scaling) subsystem that consists of three layers of code: the |
| 50 | core, scaling governors and scaling drivers. |
| 51 | |
| 52 | The ``CPUFreq`` core provides the common code infrastructure and user space |
| 53 | interfaces for all platforms that support CPU performance scaling. It defines |
| 54 | the basic framework in which the other components operate. |
| 55 | |
| 56 | Scaling governors implement algorithms to estimate the required CPU capacity. |
| 57 | As a rule, each governor implements one, possibly parametrized, scaling |
| 58 | algorithm. |
| 59 | |
| 60 | Scaling drivers talk to the hardware. They provide scaling governors with |
| 61 | information on the available P-states (or P-state ranges in some cases) and |
| 62 | access platform-specific hardware interfaces to change CPU P-states as requested |
| 63 | by scaling governors. |
| 64 | |
| 65 | In principle, all available scaling governors can be used with every scaling |
| 66 | driver. That design is based on the observation that the information used by |
| 67 | performance scaling algorithms for P-state selection can be represented in a |
| 68 | platform-independent form in the majority of cases, so it should be possible |
| 69 | to use the same performance scaling algorithm implemented in exactly the same |
| 70 | way regardless of which scaling driver is used. Consequently, the same set of |
| 71 | scaling governors should be suitable for every supported platform. |
| 72 | |
| 73 | However, that observation may not hold for performance scaling algorithms |
| 74 | based on information provided by the hardware itself, for example through |
| 75 | feedback registers, as that information is typically specific to the hardware |
| 76 | interface it comes from and may not be easily represented in an abstract, |
| 77 | platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers |
| 78 | to bypass the governor layer and implement their own performance scaling |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 79 | algorithms. That is done by the |intel_pstate| scaling driver. |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 80 | |
| 81 | |
| 82 | ``CPUFreq`` Policy Objects |
| 83 | ========================== |
| 84 | |
| 85 | In some cases the hardware interface for P-state control is shared by multiple |
| 86 | CPUs. That is, for example, the same register (or set of registers) is used to |
| 87 | control the P-state of multiple CPUs at the same time and writing to it affects |
| 88 | all of those CPUs simultaneously. |
| 89 | |
| 90 | Sets of CPUs sharing hardware P-state control interfaces are represented by |
| 91 | ``CPUFreq`` as |struct cpufreq_policy| objects. For consistency, |
| 92 | |struct cpufreq_policy| is also used when there is only one CPU in the given |
| 93 | set. |
| 94 | |
| 95 | The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for |
| 96 | every CPU in the system, including CPUs that are currently offline. If multiple |
| 97 | CPUs share the same hardware P-state control interface, all of the pointers |
| 98 | corresponding to them point to the same |struct cpufreq_policy| object. |
| 99 | |
| 100 | ``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design |
| 101 | of its user space interface is based on the policy concept. |
| 102 | |
| 103 | |
| 104 | CPU Initialization |
| 105 | ================== |
| 106 | |
| 107 | First of all, a scaling driver has to be registered for ``CPUFreq`` to work. |
| 108 | It is only possible to register one scaling driver at a time, so the scaling |
| 109 | driver is expected to be able to handle all CPUs in the system. |
| 110 | |
| 111 | The scaling driver may be registered before or after CPU registration. If |
| 112 | CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to |
| 113 | take a note of all of the already registered CPUs during the registration of the |
| 114 | scaling driver. In turn, if any CPUs are registered after the registration of |
| 115 | the scaling driver, the ``CPUFreq`` core will be invoked to take note of them |
| 116 | at their registration time. |
| 117 | |
| 118 | In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it |
| 119 | has not seen so far as soon as it is ready to handle that CPU. [Note that the |
| 120 | logical CPU may be a physical single-core processor, or a single core in a |
| 121 | multicore processor, or a hardware thread in a physical processor or processor |
| 122 | core. In what follows "CPU" always means "logical CPU" unless explicitly stated |
| 123 | otherwise and the word "processor" is used to refer to the physical part |
| 124 | possibly including multiple logical CPUs.] |
| 125 | |
| 126 | Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set |
| 127 | for the given CPU and if so, it skips the policy object creation. Otherwise, |
| 128 | a new policy object is created and initialized, which involves the creation of |
| 129 | a new policy directory in ``sysfs``, and the policy pointer corresponding to |
| 130 | the given CPU is set to the new policy object's address in memory. |
| 131 | |
| 132 | Next, the scaling driver's ``->init()`` callback is invoked with the policy |
| 133 | pointer of the new CPU passed to it as the argument. That callback is expected |
| 134 | to initialize the performance scaling hardware interface for the given CPU (or, |
| 135 | more precisely, for the set of CPUs sharing the hardware interface it belongs |
| 136 | to, represented by its policy object) and, if the policy object it has been |
| 137 | called for is new, to set parameters of the policy, like the minimum and maximum |
| 138 | frequencies supported by the hardware, the table of available frequencies (if |
| 139 | the set of supported P-states is not a continuous range), and the mask of CPUs |
| 140 | that belong to the same policy (including both online and offline CPUs). That |
| 141 | mask is then used by the core to populate the policy pointers for all of the |
| 142 | CPUs in it. |
| 143 | |
| 144 | The next major initialization step for a new policy object is to attach a |
| 145 | scaling governor to it (to begin with, that is the default scaling governor |
| 146 | determined by the kernel configuration, but it may be changed later |
| 147 | via ``sysfs``). First, a pointer to the new policy object is passed to the |
| 148 | governor's ``->init()`` callback which is expected to initialize all of the |
| 149 | data structures necessary to handle the given policy and, possibly, to add |
| 150 | a governor ``sysfs`` interface to it. Next, the governor is started by |
| 151 | invoking its ``->start()`` callback. |
| 152 | |
| 153 | That callback it expected to register per-CPU utilization update callbacks for |
| 154 | all of the online CPUs belonging to the given policy with the CPU scheduler. |
| 155 | The utilization update callbacks will be invoked by the CPU scheduler on |
| 156 | important events, like task enqueue and dequeue, on every iteration of the |
| 157 | scheduler tick or generally whenever the CPU utilization may change (from the |
| 158 | scheduler's perspective). They are expected to carry out computations needed |
| 159 | to determine the P-state to use for the given policy going forward and to |
| 160 | invoke the scaling driver to make changes to the hardware in accordance with |
| 161 | the P-state selection. The scaling driver may be invoked directly from |
| 162 | scheduler context or asynchronously, via a kernel thread or workqueue, depending |
| 163 | on the configuration and capabilities of the scaling driver and the governor. |
| 164 | |
| 165 | Similar steps are taken for policy objects that are not new, but were "inactive" |
| 166 | previously, meaning that all of the CPUs belonging to them were offline. The |
| 167 | only practical difference in that case is that the ``CPUFreq`` core will attempt |
| 168 | to use the scaling governor previously used with the policy that became |
| 169 | "inactive" (and is re-initialized now) instead of the default governor. |
| 170 | |
| 171 | In turn, if a previously offline CPU is being brought back online, but some |
| 172 | other CPUs sharing the policy object with it are online already, there is no |
| 173 | need to re-initialize the policy object at all. In that case, it only is |
| 174 | necessary to restart the scaling governor so that it can take the new online CPU |
| 175 | into account. That is achieved by invoking the governor's ``->stop`` and |
| 176 | ``->start()`` callbacks, in this order, for the entire policy. |
| 177 | |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 178 | As mentioned before, the |intel_pstate| scaling driver bypasses the scaling |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 179 | governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 180 | Consequently, if |intel_pstate| is used, scaling governors are not attached to |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 181 | new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked |
| 182 | to register per-CPU utilization update callbacks for each policy. These |
| 183 | callbacks are invoked by the CPU scheduler in the same way as for scaling |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 184 | governors, but in the |intel_pstate| case they both determine the P-state to |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 185 | use and change the hardware configuration accordingly in one go from scheduler |
| 186 | context. |
| 187 | |
| 188 | The policy objects created during CPU initialization and other data structures |
| 189 | associated with them are torn down when the scaling driver is unregistered |
| 190 | (which happens when the kernel module containing it is unloaded, for example) or |
| 191 | when the last CPU belonging to the given policy in unregistered. |
| 192 | |
| 193 | |
| 194 | Policy Interface in ``sysfs`` |
| 195 | ============================= |
| 196 | |
| 197 | During the initialization of the kernel, the ``CPUFreq`` core creates a |
| 198 | ``sysfs`` directory (kobject) called ``cpufreq`` under |
| 199 | :file:`/sys/devices/system/cpu/`. |
| 200 | |
| 201 | That directory contains a ``policyX`` subdirectory (where ``X`` represents an |
| 202 | integer number) for every policy object maintained by the ``CPUFreq`` core. |
| 203 | Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links |
| 204 | under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer |
| 205 | that may be different from the one represented by ``X``) for all of the CPUs |
| 206 | associated with (or belonging to) the given policy. The ``policyX`` directories |
| 207 | in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific |
| 208 | attributes (files) to control ``CPUFreq`` behavior for the corresponding policy |
| 209 | objects (that is, for all of the CPUs associated with them). |
| 210 | |
| 211 | Some of those attributes are generic. They are created by the ``CPUFreq`` core |
| 212 | and their behavior generally does not depend on what scaling driver is in use |
| 213 | and what scaling governor is attached to the given policy. Some scaling drivers |
| 214 | also add driver-specific attributes to the policy directories in ``sysfs`` to |
| 215 | control policy-specific aspects of driver behavior. |
| 216 | |
| 217 | The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` |
| 218 | are the following: |
| 219 | |
| 220 | ``affected_cpus`` |
| 221 | List of online CPUs belonging to this policy (i.e. sharing the hardware |
| 222 | performance scaling interface represented by the ``policyX`` policy |
| 223 | object). |
| 224 | |
| 225 | ``bios_limit`` |
| 226 | If the platform firmware (BIOS) tells the OS to apply an upper limit to |
| 227 | CPU frequencies, that limit will be reported through this attribute (if |
| 228 | present). |
| 229 | |
| 230 | The existence of the limit may be a result of some (often unintentional) |
| 231 | BIOS settings, restrictions coming from a service processor or another |
| 232 | BIOS/HW-based mechanisms. |
| 233 | |
| 234 | This does not cover ACPI thermal limitations which can be discovered |
| 235 | through a generic thermal driver. |
| 236 | |
| 237 | This attribute is not present if the scaling driver in use does not |
| 238 | support it. |
| 239 | |
| 240 | ``cpuinfo_max_freq`` |
| 241 | Maximum possible operating frequency the CPUs belonging to this policy |
| 242 | can run at (in kHz). |
| 243 | |
| 244 | ``cpuinfo_min_freq`` |
| 245 | Minimum possible operating frequency the CPUs belonging to this policy |
| 246 | can run at (in kHz). |
| 247 | |
| 248 | ``cpuinfo_transition_latency`` |
| 249 | The time it takes to switch the CPUs belonging to this policy from one |
| 250 | P-state to another, in nanoseconds. |
| 251 | |
| 252 | If unknown or if known to be so high that the scaling driver does not |
| 253 | work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) |
| 254 | will be returned by reads from this attribute. |
| 255 | |
| 256 | ``related_cpus`` |
| 257 | List of all (online and offline) CPUs belonging to this policy. |
| 258 | |
| 259 | ``scaling_available_governors`` |
| 260 | List of ``CPUFreq`` scaling governors present in the kernel that can |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 261 | be attached to this policy or (if the |intel_pstate| scaling driver is |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 262 | in use) list of scaling algorithms provided by the driver that can be |
| 263 | applied to this policy. |
| 264 | |
| 265 | [Note that some governors are modular and it may be necessary to load a |
| 266 | kernel module for the governor held by it to become available and be |
| 267 | listed by this attribute.] |
| 268 | |
| 269 | ``scaling_cur_freq`` |
| 270 | Current frequency of all of the CPUs belonging to this policy (in kHz). |
| 271 | |
| 272 | For the majority of scaling drivers, this is the frequency of the last |
| 273 | P-state requested by the driver from the hardware using the scaling |
| 274 | interface provided by it, which may or may not reflect the frequency |
| 275 | the CPU is actually running at (due to hardware design and other |
| 276 | limitations). |
| 277 | |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 278 | Some scaling drivers (e.g. |intel_pstate|) attempt to provide |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 279 | information more precisely reflecting the current CPU frequency through |
| 280 | this attribute, but that still may not be the exact current CPU |
| 281 | frequency as seen by the hardware at the moment. |
| 282 | |
| 283 | ``scaling_driver`` |
| 284 | The scaling driver currently in use. |
| 285 | |
| 286 | ``scaling_governor`` |
| 287 | The scaling governor currently attached to this policy or (if the |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 288 | |intel_pstate| scaling driver is in use) the scaling algorithm |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 289 | provided by the driver that is currently applied to this policy. |
| 290 | |
| 291 | This attribute is read-write and writing to it will cause a new scaling |
| 292 | governor to be attached to this policy or a new scaling algorithm |
| 293 | provided by the scaling driver to be applied to it (in the |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 294 | |intel_pstate| case), as indicated by the string written to this |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 295 | attribute (which must be one of the names listed by the |
| 296 | ``scaling_available_governors`` attribute described above). |
| 297 | |
| 298 | ``scaling_max_freq`` |
| 299 | Maximum frequency the CPUs belonging to this policy are allowed to be |
| 300 | running at (in kHz). |
| 301 | |
| 302 | This attribute is read-write and writing a string representing an |
| 303 | integer to it will cause a new limit to be set (it must not be lower |
| 304 | than the value of the ``scaling_min_freq`` attribute). |
| 305 | |
| 306 | ``scaling_min_freq`` |
| 307 | Minimum frequency the CPUs belonging to this policy are allowed to be |
| 308 | running at (in kHz). |
| 309 | |
| 310 | This attribute is read-write and writing a string representing a |
| 311 | non-negative integer to it will cause a new limit to be set (it must not |
| 312 | be higher than the value of the ``scaling_max_freq`` attribute). |
| 313 | |
| 314 | ``scaling_setspeed`` |
| 315 | This attribute is functional only if the `userspace`_ scaling governor |
| 316 | is attached to the given policy. |
| 317 | |
| 318 | It returns the last frequency requested by the governor (in kHz) or can |
| 319 | be written to in order to set a new frequency for the policy. |
| 320 | |
| 321 | |
| 322 | Generic Scaling Governors |
| 323 | ========================= |
| 324 | |
| 325 | ``CPUFreq`` provides generic scaling governors that can be used with all |
| 326 | scaling drivers. As stated before, each of them implements a single, possibly |
| 327 | parametrized, performance scaling algorithm. |
| 328 | |
| 329 | Scaling governors are attached to policy objects and different policy objects |
| 330 | can be handled by different scaling governors at the same time (although that |
| 331 | may lead to suboptimal results in some cases). |
| 332 | |
| 333 | The scaling governor for a given policy object can be changed at any time with |
| 334 | the help of the ``scaling_governor`` policy attribute in ``sysfs``. |
| 335 | |
| 336 | Some governors expose ``sysfs`` attributes to control or fine-tune the scaling |
| 337 | algorithms implemented by them. Those attributes, referred to as governor |
| 338 | tunables, can be either global (system-wide) or per-policy, depending on the |
| 339 | scaling driver in use. If the driver requires governor tunables to be |
| 340 | per-policy, they are located in a subdirectory of each policy directory. |
| 341 | Otherwise, they are located in a subdirectory under |
| 342 | :file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the |
| 343 | subdirectory containing the governor tunables is the name of the governor |
| 344 | providing them. |
| 345 | |
| 346 | ``performance`` |
| 347 | --------------- |
| 348 | |
| 349 | When attached to a policy object, this governor causes the highest frequency, |
| 350 | within the ``scaling_max_freq`` policy limit, to be requested for that policy. |
| 351 | |
| 352 | The request is made once at that time the governor for the policy is set to |
| 353 | ``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` |
| 354 | policy limits change after that. |
| 355 | |
| 356 | ``powersave`` |
| 357 | ------------- |
| 358 | |
| 359 | When attached to a policy object, this governor causes the lowest frequency, |
| 360 | within the ``scaling_min_freq`` policy limit, to be requested for that policy. |
| 361 | |
| 362 | The request is made once at that time the governor for the policy is set to |
| 363 | ``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` |
| 364 | policy limits change after that. |
| 365 | |
| 366 | ``userspace`` |
| 367 | ------------- |
| 368 | |
| 369 | This governor does not do anything by itself. Instead, it allows user space |
| 370 | to set the CPU frequency for the policy it is attached to by writing to the |
| 371 | ``scaling_setspeed`` attribute of that policy. |
| 372 | |
| 373 | ``schedutil`` |
| 374 | ------------- |
| 375 | |
| 376 | This governor uses CPU utilization data available from the CPU scheduler. It |
| 377 | generally is regarded as a part of the CPU scheduler, so it can access the |
| 378 | scheduler's internal data structures directly. |
| 379 | |
| 380 | It runs entirely in scheduler context, although in some cases it may need to |
| 381 | invoke the scaling driver asynchronously when it decides that the CPU frequency |
| 382 | should be changed for a given policy (that depends on whether or not the driver |
| 383 | is capable of changing the CPU frequency from scheduler context). |
| 384 | |
| 385 | The actions of this governor for a particular CPU depend on the scheduling class |
| 386 | invoking its utilization update callback for that CPU. If it is invoked by the |
| 387 | RT or deadline scheduling classes, the governor will increase the frequency to |
| 388 | the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, |
| 389 | if it is invoked by the CFS scheduling class, the governor will use the |
| 390 | Per-Entity Load Tracking (PELT) metric for the root control group of the |
| 391 | given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ |
| 392 | LWN.net article for a description of the PELT mechanism). Then, the new |
| 393 | CPU frequency to apply is computed in accordance with the formula |
| 394 | |
| 395 | f = 1.25 * ``f_0`` * ``util`` / ``max`` |
| 396 | |
| 397 | where ``util`` is the PELT number, ``max`` is the theoretical maximum of |
| 398 | ``util``, and ``f_0`` is either the maximum possible CPU frequency for the given |
| 399 | policy (if the PELT number is frequency-invariant), or the current CPU frequency |
| 400 | (otherwise). |
| 401 | |
| 402 | This governor also employs a mechanism allowing it to temporarily bump up the |
| 403 | CPU frequency for tasks that have been waiting on I/O most recently, called |
| 404 | "IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag |
| 405 | is passed by the scheduler to the governor callback which causes the frequency |
| 406 | to go up to the allowed maximum immediately and then draw back to the value |
| 407 | returned by the above formula over time. |
| 408 | |
| 409 | This governor exposes only one tunable: |
| 410 | |
| 411 | ``rate_limit_us`` |
| 412 | Minimum time (in microseconds) that has to pass between two consecutive |
| 413 | runs of governor computations (default: 1000 times the scaling driver's |
| 414 | transition latency). |
| 415 | |
| 416 | The purpose of this tunable is to reduce the scheduler context overhead |
| 417 | of the governor which might be excessive without it. |
| 418 | |
| 419 | This governor generally is regarded as a replacement for the older `ondemand`_ |
| 420 | and `conservative`_ governors (described below), as it is simpler and more |
| 421 | tightly integrated with the CPU scheduler, its overhead in terms of CPU context |
| 422 | switches and similar is less significant, and it uses the scheduler's own CPU |
| 423 | utilization metric, so in principle its decisions should not contradict the |
| 424 | decisions made by the other parts of the scheduler. |
| 425 | |
| 426 | ``ondemand`` |
| 427 | ------------ |
| 428 | |
| 429 | This governor uses CPU load as a CPU frequency selection metric. |
| 430 | |
| 431 | In order to estimate the current CPU load, it measures the time elapsed between |
| 432 | consecutive invocations of its worker routine and computes the fraction of that |
| 433 | time in which the given CPU was not idle. The ratio of the non-idle (active) |
| 434 | time to the total CPU time is taken as an estimate of the load. |
| 435 | |
| 436 | If this governor is attached to a policy shared by multiple CPUs, the load is |
| 437 | estimated for all of them and the greatest result is taken as the load estimate |
| 438 | for the entire policy. |
| 439 | |
| 440 | The worker routine of this governor has to run in process context, so it is |
| 441 | invoked asynchronously (via a workqueue) and CPU P-states are updated from |
| 442 | there if necessary. As a result, the scheduler context overhead from this |
| 443 | governor is minimum, but it causes additional CPU context switches to happen |
| 444 | relatively often and the CPU P-state updates triggered by it can be relatively |
| 445 | irregular. Also, it affects its own CPU load metric by running code that |
| 446 | reduces the CPU idle time (even though the CPU idle time is only reduced very |
| 447 | slightly by it). |
| 448 | |
| 449 | It generally selects CPU frequencies proportional to the estimated load, so that |
| 450 | the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of |
| 451 | 1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute |
| 452 | corresponds to the load of 0, unless when the load exceeds a (configurable) |
| 453 | speedup threshold, in which case it will go straight for the highest frequency |
| 454 | it is allowed to use (the ``scaling_max_freq`` policy limit). |
| 455 | |
| 456 | This governor exposes the following tunables: |
| 457 | |
| 458 | ``sampling_rate`` |
| 459 | This is how often the governor's worker routine should run, in |
| 460 | microseconds. |
| 461 | |
| 462 | Typically, it is set to values of the order of 10000 (10 ms). Its |
| 463 | default value is equal to the value of ``cpuinfo_transition_latency`` |
| 464 | for each policy this governor is attached to (but since the unit here |
| 465 | is greater by 1000, this means that the time represented by |
| 466 | ``sampling_rate`` is 1000 times greater than the transition latency by |
| 467 | default). |
| 468 | |
| 469 | If this tunable is per-policy, the following shell command sets the time |
| 470 | represented by it to be 750 times as high as the transition latency:: |
| 471 | |
| 472 | # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate |
| 473 | |
| 474 | |
| 475 | ``min_sampling_rate`` |
| 476 | The minimum value of ``sampling_rate``. |
| 477 | |
| 478 | Equal to 10000 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and |
| 479 | :c:data:`tick_nohz_active` are both set or to 20 times the value of |
| 480 | :c:data:`jiffies` in microseconds otherwise. |
| 481 | |
| 482 | ``up_threshold`` |
| 483 | If the estimated CPU load is above this value (in percent), the governor |
| 484 | will set the frequency to the maximum value allowed for the policy. |
| 485 | Otherwise, the selected frequency will be proportional to the estimated |
| 486 | CPU load. |
| 487 | |
| 488 | ``ignore_nice_load`` |
| 489 | If set to 1 (default 0), it will cause the CPU load estimation code to |
| 490 | treat the CPU time spent on executing tasks with "nice" levels greater |
| 491 | than 0 as CPU idle time. |
| 492 | |
| 493 | This may be useful if there are tasks in the system that should not be |
| 494 | taken into account when deciding what frequency to run the CPUs at. |
| 495 | Then, to make that happen it is sufficient to increase the "nice" level |
| 496 | of those tasks above 0 and set this attribute to 1. |
| 497 | |
| 498 | ``sampling_down_factor`` |
| 499 | Temporary multiplier, between 1 (default) and 100 inclusive, to apply to |
| 500 | the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. |
| 501 | |
| 502 | This causes the next execution of the governor's worker routine (after |
| 503 | setting the frequency to the allowed maximum) to be delayed, so the |
| 504 | frequency stays at the maximum level for a longer time. |
| 505 | |
| 506 | Frequency fluctuations in some bursty workloads may be avoided this way |
| 507 | at the cost of additional energy spent on maintaining the maximum CPU |
| 508 | capacity. |
| 509 | |
| 510 | ``powersave_bias`` |
| 511 | Reduction factor to apply to the original frequency target of the |
| 512 | governor (including the maximum value used when the ``up_threshold`` |
| 513 | value is exceeded by the estimated CPU load) or sensitivity threshold |
| 514 | for the AMD frequency sensitivity powersave bias driver |
| 515 | (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 |
| 516 | inclusive. |
| 517 | |
| 518 | If the AMD frequency sensitivity powersave bias driver is not loaded, |
| 519 | the effective frequency to apply is given by |
| 520 | |
| 521 | f * (1 - ``powersave_bias`` / 1000) |
| 522 | |
| 523 | where f is the governor's original frequency target. The default value |
| 524 | of this attribute is 0 in that case. |
| 525 | |
| 526 | If the AMD frequency sensitivity powersave bias driver is loaded, the |
| 527 | value of this attribute is 400 by default and it is used in a different |
| 528 | way. |
| 529 | |
| 530 | On Family 16h (and later) AMD processors there is a mechanism to get a |
| 531 | measured workload sensitivity, between 0 and 100% inclusive, from the |
| 532 | hardware. That value can be used to estimate how the performance of the |
| 533 | workload running on a CPU will change in response to frequency changes. |
| 534 | |
| 535 | The performance of a workload with the sensitivity of 0 (memory-bound or |
| 536 | IO-bound) is not expected to increase at all as a result of increasing |
| 537 | the CPU frequency, whereas workloads with the sensitivity of 100% |
| 538 | (CPU-bound) are expected to perform much better if the CPU frequency is |
| 539 | increased. |
| 540 | |
| 541 | If the workload sensitivity is less than the threshold represented by |
| 542 | the ``powersave_bias`` value, the sensitivity powersave bias driver |
| 543 | will cause the governor to select a frequency lower than its original |
| 544 | target, so as to avoid over-provisioning workloads that will not benefit |
| 545 | from running at higher CPU frequencies. |
| 546 | |
| 547 | ``conservative`` |
| 548 | ---------------- |
| 549 | |
| 550 | This governor uses CPU load as a CPU frequency selection metric. |
| 551 | |
| 552 | It estimates the CPU load in the same way as the `ondemand`_ governor described |
| 553 | above, but the CPU frequency selection algorithm implemented by it is different. |
| 554 | |
| 555 | Namely, it avoids changing the frequency significantly over short time intervals |
| 556 | which may not be suitable for systems with limited power supply capacity (e.g. |
| 557 | battery-powered). To achieve that, it changes the frequency in relatively |
| 558 | small steps, one step at a time, up or down - depending on whether or not a |
| 559 | (configurable) threshold has been exceeded by the estimated CPU load. |
| 560 | |
| 561 | This governor exposes the following tunables: |
| 562 | |
| 563 | ``freq_step`` |
| 564 | Frequency step in percent of the maximum frequency the governor is |
| 565 | allowed to set (the ``scaling_max_freq`` policy limit), between 0 and |
| 566 | 100 (5 by default). |
| 567 | |
| 568 | This is how much the frequency is allowed to change in one go. Setting |
| 569 | it to 0 will cause the default frequency step (5 percent) to be used |
| 570 | and setting it to 100 effectively causes the governor to periodically |
| 571 | switch the frequency between the ``scaling_min_freq`` and |
| 572 | ``scaling_max_freq`` policy limits. |
| 573 | |
| 574 | ``down_threshold`` |
| 575 | Threshold value (in percent, 20 by default) used to determine the |
| 576 | frequency change direction. |
| 577 | |
| 578 | If the estimated CPU load is greater than this value, the frequency will |
| 579 | go up (by ``freq_step``). If the load is less than this value (and the |
| 580 | ``sampling_down_factor`` mechanism is not in effect), the frequency will |
| 581 | go down. Otherwise, the frequency will not be changed. |
| 582 | |
| 583 | ``sampling_down_factor`` |
| 584 | Frequency decrease deferral factor, between 1 (default) and 10 |
| 585 | inclusive. |
| 586 | |
| 587 | It effectively causes the frequency to go down ``sampling_down_factor`` |
| 588 | times slower than it ramps up. |
| 589 | |
| 590 | |
| 591 | Frequency Boost Support |
| 592 | ======================= |
| 593 | |
| 594 | Background |
| 595 | ---------- |
| 596 | |
| 597 | Some processors support a mechanism to raise the operating frequency of some |
| 598 | cores in a multicore package temporarily (and above the sustainable frequency |
| 599 | threshold for the whole package) under certain conditions, for example if the |
| 600 | whole chip is not fully utilized and below its intended thermal or power budget. |
| 601 | |
| 602 | Different names are used by different vendors to refer to this functionality. |
| 603 | For Intel processors it is referred to as "Turbo Boost", AMD calls it |
| 604 | "Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. |
| 605 | As a rule, it also is implemented differently by different vendors. The simple |
| 606 | term "frequency boost" is used here for brevity to refer to all of those |
| 607 | implementations. |
| 608 | |
| 609 | The frequency boost mechanism may be either hardware-based or software-based. |
| 610 | If it is hardware-based (e.g. on x86), the decision to trigger the boosting is |
| 611 | made by the hardware (although in general it requires the hardware to be put |
| 612 | into a special state in which it can control the CPU frequency within certain |
| 613 | limits). If it is software-based (e.g. on ARM), the scaling driver decides |
| 614 | whether or not to trigger boosting and when to do that. |
| 615 | |
| 616 | The ``boost`` File in ``sysfs`` |
| 617 | ------------------------------- |
| 618 | |
| 619 | This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls |
| 620 | the "boost" setting for the whole system. It is not present if the underlying |
| 621 | scaling driver does not support the frequency boost mechanism (or supports it, |
| 622 | but provides a driver-specific interface for controlling it, like |
Rafael J. Wysocki | 33fc30b | 2017-05-14 02:06:03 +0200 | [diff] [blame] | 623 | |intel_pstate|). |
Rafael J. Wysocki | 2a0e492 | 2017-03-13 23:59:57 +0100 | [diff] [blame] | 624 | |
| 625 | If the value in this file is 1, the frequency boost mechanism is enabled. This |
| 626 | means that either the hardware can be put into states in which it is able to |
| 627 | trigger boosting (in the hardware-based case), or the software is allowed to |
| 628 | trigger boosting (in the software-based case). It does not mean that boosting |
| 629 | is actually in use at the moment on any CPUs in the system. It only means a |
| 630 | permission to use the frequency boost mechanism (which still may never be used |
| 631 | for other reasons). |
| 632 | |
| 633 | If the value in this file is 0, the frequency boost mechanism is disabled and |
| 634 | cannot be used at all. |
| 635 | |
| 636 | The only values that can be written to this file are 0 and 1. |
| 637 | |
| 638 | Rationale for Boost Control Knob |
| 639 | -------------------------------- |
| 640 | |
| 641 | The frequency boost mechanism is generally intended to help to achieve optimum |
| 642 | CPU performance on time scales below software resolution (e.g. below the |
| 643 | scheduler tick interval) and it is demonstrably suitable for many workloads, but |
| 644 | it may lead to problems in certain situations. |
| 645 | |
| 646 | For this reason, many systems make it possible to disable the frequency boost |
| 647 | mechanism in the platform firmware (BIOS) setup, but that requires the system to |
| 648 | be restarted for the setting to be adjusted as desired, which may not be |
| 649 | practical at least in some cases. For example: |
| 650 | |
| 651 | 1. Boosting means overclocking the processor, although under controlled |
| 652 | conditions. Generally, the processor's energy consumption increases |
| 653 | as a result of increasing its frequency and voltage, even temporarily. |
| 654 | That may not be desirable on systems that switch to power sources of |
| 655 | limited capacity, such as batteries, so the ability to disable the boost |
| 656 | mechanism while the system is running may help there (but that depends on |
| 657 | the workload too). |
| 658 | |
| 659 | 2. In some situations deterministic behavior is more important than |
| 660 | performance or energy consumption (or both) and the ability to disable |
| 661 | boosting while the system is running may be useful then. |
| 662 | |
| 663 | 3. To examine the impact of the frequency boost mechanism itself, it is useful |
| 664 | to be able to run tests with and without boosting, preferably without |
| 665 | restarting the system in the meantime. |
| 666 | |
| 667 | 4. Reproducible results are important when running benchmarks. Since |
| 668 | the boosting functionality depends on the load of the whole package, |
| 669 | single-thread performance may vary because of it which may lead to |
| 670 | unreproducible results sometimes. That can be avoided by disabling the |
| 671 | frequency boost mechanism before running benchmarks sensitive to that |
| 672 | issue. |
| 673 | |
| 674 | Legacy AMD ``cpb`` Knob |
| 675 | ----------------------- |
| 676 | |
| 677 | The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to |
| 678 | the global ``boost`` one. It is used for disabling/enabling the "Core |
| 679 | Performance Boost" feature of some AMD processors. |
| 680 | |
| 681 | If present, that knob is located in every ``CPUFreq`` policy directory in |
| 682 | ``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called |
| 683 | ``cpb``, which indicates a more fine grained control interface. The actual |
| 684 | implementation, however, works on the system-wide basis and setting that knob |
| 685 | for one policy causes the same value of it to be set for all of the other |
| 686 | policies at the same time. |
| 687 | |
| 688 | That knob is still supported on AMD processors that support its underlying |
| 689 | hardware feature, but it may be configured out of the kernel (via the |
| 690 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global |
| 691 | ``boost`` knob is present regardless. Thus it is always possible use the |
| 692 | ``boost`` knob instead of the ``cpb`` one which is highly recommended, as that |
| 693 | is more consistent with what all of the other systems do (and the ``cpb`` knob |
| 694 | may not be supported any more in the future). |
| 695 | |
| 696 | The ``cpb`` knob is never present for any processors without the underlying |
| 697 | hardware feature (e.g. all Intel ones), even if the |
| 698 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. |
| 699 | |
| 700 | |
| 701 | .. _Per-entity load tracking: https://lwn.net/Articles/531853/ |