blob: f7b12c071d5356ceed76231a07b4e49f59e9f8d8 [file] [log] [blame]
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -08001Intel P-State driver
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +05302--------------------
3
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -08004This driver provides an interface to control the P-State selection for the
5SandyBridge+ Intel processors.
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +05306
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -08007The following document explains P-States:
8http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
9As stated in the document, P-State doesn’t exactly mean a frequency. However, for
10the sake of the relationship with cpufreq, P-State and frequency are used
11interchangeably.
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +053012
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080013Understanding the cpufreq core governors and policies are important before
14discussing more details about the Intel P-State driver. Based on what callbacks
15a cpufreq driver provides to the cpufreq core, it can support two types of
16drivers:
17- with target_index() callback: In this mode, the drivers using cpufreq core
18simply provide the minimum and maximum frequency limits and an additional
19interface target_index() to set the current frequency. The cpufreq subsystem
20has a number of scaling governors ("performance", "powersave", "ondemand",
21etc.). Depending on which governor is in use, cpufreq core will call for
22transitions to a specific frequency using target_index() callback.
23- setpolicy() callback: In this mode, drivers do not provide target_index()
24callback, so cpufreq core can't request a transition to a specific frequency.
25The driver provides minimum and maximum frequency limits and callbacks to set a
26policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
27The cpufreq core can request the driver to operate in any of the two policies:
28"performance: and "powersave". The driver decides which frequency to use based
29on the above policy selection considering minimum and maximum frequency limits.
Dirk Brandewie2f86dc42014-11-06 09:40:47 -080030
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080031The Intel P-State driver falls under the latter category, which implements the
32setpolicy() callback. This driver decides what P-State to use based on the
33requested policy from the cpufreq core. If the processor is capable of
34selecting its next P-State internally, then the driver will offload this
35responsibility to the processor (aka HWP: Hardware P-States). If not, the
36driver implements algorithms to select the next P-State.
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +053037
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080038Since these policies are implemented in the driver, they are not same as the
39cpufreq scaling governors implementation, even if they have the same name in
40the cpufreq sysfs (scaling_governors). For example the "performance" policy is
41similar to cpufreq’s "performance" governor, but "powersave" is completely
42different than the cpufreq "powersave" governor. The strategy here is similar
43to cpufreq "ondemand", where the requested P-State is related to the system load.
44
45Sysfs Interface
46
47In addition to the frequency-controlling interfaces provided by the cpufreq
48core, the driver provides its own sysfs files to control the P-State selection.
49These files have been added to /sys/devices/system/cpu/intel_pstate/.
50Any changes made to these files are applicable to all CPUs (even in a
51multi-package system).
52
53 max_perf_pct: Limits the maximum P-State that will be requested by
54 the driver. It states it as a percentage of the available performance. The
55 available (P-State) performance may be reduced by the no_turbo
Dirk Brandewie41629a82014-06-20 07:28:00 -070056 setting described below.
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +053057
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080058 min_perf_pct: Limits the minimum P-State that will be requested by
59 the driver. It states it as a percentage of the max (non-turbo)
Dirk Brandewie41629a82014-06-20 07:28:00 -070060 performance level.
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +053061
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080062 no_turbo: Limits the driver to selecting P-State below the turbo
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +053063 frequency range.
64
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080065 turbo_pct: Displays the percentage of the total performance that
66 is supported by hardware that is in the turbo range. This number
Kristen Carlson Accardid01b1f42015-01-28 15:03:27 -080067 is independent of whether turbo has been disabled or not.
68
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080069 num_pstates: Displays the number of P-States that are supported
70 by hardware. This number is independent of whether turbo has
Kristen Carlson Accardi05224242015-01-28 15:03:28 -080071 been disabled or not.
72
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -080073For example, if a system has these parameters:
74 Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
75 Max non turbo ratio: 0x17
76 Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)
77
78Sysfs will show :
79 max_perf_pct:100, which corresponds to 1 core ratio
80 min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
81 no_turbo:0, turbo is not disabled
82 num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
83 turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates
84
85Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
86Volume 3: System Programming Guide" to understand ratios.
87
88cpufreq sysfs for Intel P-State
89
90Since this driver registers with cpufreq, cpufreq sysfs is also presented.
91There are some important differences, which need to be considered.
92
93scaling_cur_freq: This displays the real frequency which was used during
94the last sample period instead of what is requested. Some other cpufreq driver,
95like acpi-cpufreq, displays what is requested (Some changes are on the
96way to fix this for acpi-cpufreq driver). The same is true for frequencies
97displayed at /proc/cpuinfo.
98
99scaling_governor: This displays current active policy. Since each CPU has a
100cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
101is not possible with Intel P-States, as there is one common policy for all
102CPUs. Here, the last requested policy will be applicable to all CPUs. It is
103suggested that one use the cpupower utility to change policy to all CPUs at the
104same time.
105
106scaling_setspeed: This attribute can never be used with Intel P-State.
107
108scaling_max_freq/scaling_min_freq: This interface can be used similarly to
109the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
110are converted to nearest possible P-State, this is prone to rounding errors.
111This method is not preferred to limit performance.
112
113affected_cpus: Not used
114related_cpus: Not used
115
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +0530116For contemporary Intel processors, the frequency is controlled by the
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -0800117processor itself and the P-State exposed to software is related to
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +0530118performance levels. The idea that frequency can be set to a single
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -0800119frequency is fictional for Intel Core processors. Even if the scaling
120driver selects a single P-State, the actual frequency the processor
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +0530121will run at is selected by the processor itself.
122
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -0800123Tuning Intel P-State driver
124
125When HWP mode is not used, debugfs files have also been added to allow the
126tuning of the internal governor algorithm. These files are located at
127/sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional
128Integral Derivative) controller. The PID tunable parameters are:
Ramkumar Ramachandraa3ea0152014-01-05 15:51:14 +0530129
130 deadband
131 d_gain_pct
132 i_gain_pct
133 p_gain_pct
134 sample_rate_ms
135 setpoint
Srinivas Pandruvadaa032d2d2015-12-30 17:45:19 -0800136
137To adjust these parameters, some understanding of driver implementation is
138necessary. There are some tweeks described here, but be very careful. Adjusting
139them requires expert level understanding of power and performance relationship.
140These limits are only useful when the "powersave" policy is active.
141
142-To make the system more responsive to load changes, sample_rate_ms can
143be adjusted (current default is 10ms).
144-To make the system use higher performance, even if the load is lower, setpoint
145can be adjusted to a lower number. This will also lead to faster ramp up time
146to reach the maximum P-State.
147If there are no derivative and integral coefficients, The next P-State will be
148equal to:
149 current P-State - ((setpoint - current cpu load) * p_gain_pct)
150
151For example, if the current PID parameters are (Which are defaults for the core
152processors like SandyBridge):
153 deadband = 0
154 d_gain_pct = 0
155 i_gain_pct = 0
156 p_gain_pct = 20
157 sample_rate_ms = 10
158 setpoint = 97
159
160If the current P-State = 0x08 and current load = 100, this will result in the
161next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
162goes up by only 1. If during next sample interval the current load doesn't
163change and still 100, then P-State goes up by one again. This process will
164continue as long as the load is more than the setpoint until the maximum P-State
165is reached.
166
167For the same load at setpoint = 60, this will result in the next P-State
168= 0x08 - ((60 - 100) * 0.2) = 16
169So by changing the setpoint from 97 to 60, there is an increase of the
170next P-State from 9 to 16. So this will make processor execute at higher
171P-State for the same CPU load. If the load continues to be more than the
172setpoint during next sample intervals, then P-State will go up again till the
173maximum P-State is reached. But the ramp up time to reach the maximum P-State
174will be much faster when the setpoint is 60 compared to 97.
175
176Debugging Intel P-State driver
177
178Event tracing
179To debug P-State transition, the Linux event tracing interface can be used.
180There are two specific events, which can be enabled (Provided the kernel
181configs related to event tracing are enabled).
182
183# cd /sys/kernel/debug/tracing/
184# echo 1 > events/power/pstate_sample/enable
185# echo 1 > events/power/cpu_frequency/enable
186# cat trace
187gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107
188 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
189 freq=2474476
190cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
191
192
193Using ftrace
194
195If function level tracing is required, the Linux ftrace interface can be used.
196For example if we want to check how often a function to set a P-State is
197called, we can set ftrace filter to intel_pstate_set_pstate.
198
199# cd /sys/kernel/debug/tracing/
200# cat available_filter_functions | grep -i pstate
201intel_pstate_set_pstate
202intel_pstate_cpu_init
203...
204
205# echo intel_pstate_set_pstate > set_ftrace_filter
206# echo function > current_tracer
207# cat trace | head -15
208# tracer: function
209#
210# entries-in-buffer/entries-written: 80/80 #P:4
211#
212# _-----=> irqs-off
213# / _----=> need-resched
214# | / _---=> hardirq/softirq
215# || / _--=> preempt-depth
216# ||| / delay
217# TASK-PID CPU# |||| TIMESTAMP FUNCTION
218# | | | |||| | |
219 Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
220 gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
221 gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
222 <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func