blob: 0e8cf4efd7e77cbd5f32b893c0a927b7481f3f6f [file] [log] [blame]
Rafael J. Wysocki2728b2d2017-02-02 01:32:13 +01001.. |struct| replace:: :c:type:`struct`
2
3==============================
4Device Power Management Basics
5==============================
6
7::
8
9 Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
10 Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu>
11 Copyright (c) 2016 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
12
13Most of the code in Linux is device drivers, so most of the Linux power
14management (PM) code is also driver-specific. Most drivers will do very
15little; others, especially for platforms with small batteries (like cell
16phones), will do a lot.
17
18This writeup gives an overview of how drivers interact with system-wide
19power management goals, emphasizing the models and interfaces that are
20shared by everything that hooks up to the driver model core. Read it as
21background for the domain-specific work you'd do with any specific driver.
22
23
24Two Models for Device Power Management
25======================================
26
27Drivers will use one or both of these models to put devices into low-power
28states:
29
30 System Sleep model:
31
32 Drivers can enter low-power states as part of entering system-wide
33 low-power states like "suspend" (also known as "suspend-to-RAM"), or
34 (mostly for systems with disks) "hibernation" (also known as
35 "suspend-to-disk").
36
37 This is something that device, bus, and class drivers collaborate on
38 by implementing various role-specific suspend and resume methods to
39 cleanly power down hardware and software subsystems, then reactivate
40 them without loss of data.
41
42 Some drivers can manage hardware wakeup events, which make the system
43 leave the low-power state. This feature may be enabled or disabled
44 using the relevant :file:`/sys/devices/.../power/wakeup` file (for
45 Ethernet drivers the ioctl interface used by ethtool may also be used
46 for this purpose); enabling it may cost some power usage, but let the
47 whole system enter low-power states more often.
48
49 Runtime Power Management model:
50
51 Devices may also be put into low-power states while the system is
52 running, independently of other power management activity in principle.
53 However, devices are not generally independent of each other (for
54 example, a parent device cannot be suspended unless all of its child
55 devices have been suspended). Moreover, depending on the bus type the
56 device is on, it may be necessary to carry out some bus-specific
57 operations on the device for this purpose. Devices put into low power
58 states at run time may require special handling during system-wide power
59 transitions (suspend or hibernation).
60
61 For these reasons not only the device driver itself, but also the
62 appropriate subsystem (bus type, device type or device class) driver and
63 the PM core are involved in runtime power management. As in the system
64 sleep power management case, they need to collaborate by implementing
65 various role-specific suspend and resume methods, so that the hardware
66 is cleanly powered down and reactivated without data or service loss.
67
68There's not a lot to be said about those low-power states except that they are
69very system-specific, and often device-specific. Also, that if enough devices
70have been put into low-power states (at runtime), the effect may be very similar
71to entering some system-wide low-power state (system sleep) ... and that
72synergies exist, so that several drivers using runtime PM might put the system
73into a state where even deeper power saving options are available.
74
75Most suspended devices will have quiesced all I/O: no more DMA or IRQs (except
76for wakeup events), no more data read or written, and requests from upstream
77drivers are no longer accepted. A given bus or platform may have different
78requirements though.
79
80Examples of hardware wakeup events include an alarm from a real time clock,
81network wake-on-LAN packets, keyboard or mouse activity, and media insertion
82or removal (for PCMCIA, MMC/SD, USB, and so on).
83
84Interfaces for Entering System Sleep States
85===========================================
86
87There are programming interfaces provided for subsystems (bus type, device type,
88device class) and device drivers to allow them to participate in the power
89management of devices they are concerned with. These interfaces cover both
90system sleep and runtime power management.
91
92
93Device Power Management Operations
94----------------------------------
95
96Device power management operations, at the subsystem level as well as at the
97device driver level, are implemented by defining and populating objects of type
98|struct| :c:type:`dev_pm_ops` defined in :file:`include/linux/pm.h`.
99The roles of the methods included in it will be explained in what follows. For
100now, it should be sufficient to remember that the last three methods are
101specific to runtime power management while the remaining ones are used during
102system-wide power transitions.
103
104There also is a deprecated "old" or "legacy" interface for power management
105operations available at least for some subsystems. This approach does not use
106|struct| :c:type:`dev_pm_ops` objects and it is suitable only for implementing
107system sleep power management methods in a limited way. Therefore it is not
108described in this document, so please refer directly to the source code for more
109information about it.
110
111
112Subsystem-Level Methods
113-----------------------
114
115The core methods to suspend and resume devices reside in
116|struct| :c:type:`dev_pm_ops` pointed to by the :c:member:`ops`
117member of |struct| :c:type:`dev_pm_domain`, or by the :c:member:`pm`
118member of |struct| :c:type:`bus_type`, |struct| :c:type:`device_type` and
119|struct| :c:type:`class`. They are mostly of interest to the people writing
120infrastructure for platforms and buses, like PCI or USB, or device type and
121device class drivers. They also are relevant to the writers of device drivers
122whose subsystems (PM domains, device types, device classes and bus types) don't
123provide all power management methods.
124
125Bus drivers implement these methods as appropriate for the hardware and the
126drivers using it; PCI works differently from USB, and so on. Not many people
127write subsystem-level drivers; most driver code is a "device driver" that builds
128on top of bus-specific framework code.
129
130For more information on these driver calls, see the description later;
131they are called in phases for every device, respecting the parent-child
132sequencing in the driver model tree.
133
134
135:file:`/sys/devices/.../power/wakeup` files
136-------------------------------------------
137
138All device objects in the driver model contain fields that control the handling
139of system wakeup events (hardware signals that can force the system out of a
140sleep state). These fields are initialized by bus or device driver code using
141:c:func:`device_set_wakeup_capable()` and :c:func:`device_set_wakeup_enable()`,
142defined in :file:`include/linux/pm_wakeup.h`.
143
144The :c:member:`power.can_wakeup` flag just records whether the device (and its
145driver) can physically support wakeup events. The
146:c:func:`device_set_wakeup_capable()` routine affects this flag. The
147:c:member:`power.wakeup` field is a pointer to an object of type
148|struct| :c:type:`wakeup_source` used for controlling whether or not
149the device should use its system wakeup mechanism and for notifying the
150PM core of system wakeup events signaled by the device. This object is only
151present for wakeup-capable devices (i.e. devices whose
152:c:member:`can_wakeup` flags are set) and is created (or removed) by
153:c:func:`device_set_wakeup_capable()`.
154
155Whether or not a device is capable of issuing wakeup events is a hardware
156matter, and the kernel is responsible for keeping track of it. By contrast,
157whether or not a wakeup-capable device should issue wakeup events is a policy
158decision, and it is managed by user space through a sysfs attribute: the
159:file:`power/wakeup` file. User space can write the "enabled" or "disabled"
160strings to it to indicate whether or not, respectively, the device is supposed
161to signal system wakeup. This file is only present if the
162:c:member:`power.wakeup` object exists for the given device and is created (or
163removed) along with that object, by :c:func:`device_set_wakeup_capable()`.
164Reads from the file will return the corresponding string.
165
166The initial value in the :file:`power/wakeup` file is "disabled" for the
167majority of devices; the major exceptions are power buttons, keyboards, and
168Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with ethtool.
169It should also default to "enabled" for devices that don't generate wakeup
170requests on their own but merely forward wakeup requests from one bus to another
171(like PCI Express ports).
172
173The :c:func:`device_may_wakeup()` routine returns true only if the
174:c:member:`power.wakeup` object exists and the corresponding :file:`power/wakeup`
175file contains the "enabled" string. This information is used by subsystems,
176like the PCI bus type code, to see whether or not to enable the devices' wakeup
177mechanisms. If device wakeup mechanisms are enabled or disabled directly by
178drivers, they also should use :c:func:`device_may_wakeup()` to decide what to do
179during a system sleep transition. Device drivers, however, are not expected to
180call :c:func:`device_set_wakeup_enable()` directly in any case.
181
182It ought to be noted that system wakeup is conceptually different from "remote
183wakeup" used by runtime power management, although it may be supported by the
184same physical mechanism. Remote wakeup is a feature allowing devices in
185low-power states to trigger specific interrupts to signal conditions in which
186they should be put into the full-power state. Those interrupts may or may not
187be used to signal system wakeup events, depending on the hardware design. On
188some systems it is impossible to trigger them from system sleep states. In any
189case, remote wakeup should always be enabled for runtime power management for
190all devices and drivers that support it.
191
192
193:file:`/sys/devices/.../power/control` files
194--------------------------------------------
195
196Each device in the driver model has a flag to control whether it is subject to
197runtime power management. This flag, :c:member:`runtime_auto`, is initialized
198by the bus type (or generally subsystem) code using :c:func:`pm_runtime_allow()`
199or :c:func:`pm_runtime_forbid()`; the default is to allow runtime power
200management.
201
202The setting can be adjusted by user space by writing either "on" or "auto" to
203the device's :file:`power/control` sysfs file. Writing "auto" calls
204:c:func:`pm_runtime_allow()`, setting the flag and allowing the device to be
205runtime power-managed by its driver. Writing "on" calls
206:c:func:`pm_runtime_forbid()`, clearing the flag, returning the device to full
207power if it was in a low-power state, and preventing the
208device from being runtime power-managed. User space can check the current value
209of the :c:member:`runtime_auto` flag by reading that file.
210
211The device's :c:member:`runtime_auto` flag has no effect on the handling of
212system-wide power transitions. In particular, the device can (and in the
213majority of cases should and will) be put into a low-power state during a
214system-wide transition to a sleep state even though its :c:member:`runtime_auto`
215flag is clear.
216
217For more information about the runtime power management framework, refer to
218:file:`Documentation/power/runtime_pm.txt`.
219
220
221Calling Drivers to Enter and Leave System Sleep States
222======================================================
223
224When the system goes into a sleep state, each device's driver is asked to
225suspend the device by putting it into a state compatible with the target
226system state. That's usually some version of "off", but the details are
227system-specific. Also, wakeup-enabled devices will usually stay partly
228functional in order to wake the system.
229
230When the system leaves that low-power state, the device's driver is asked to
231resume it by returning it to full power. The suspend and resume operations
232always go together, and both are multi-phase operations.
233
234For simple drivers, suspend might quiesce the device using class code
235and then turn its hardware as "off" as possible during suspend_noirq. The
236matching resume calls would then completely reinitialize the hardware
237before reactivating its class I/O queues.
238
239More power-aware drivers might prepare the devices for triggering system wakeup
240events.
241
242
243Call Sequence Guarantees
244------------------------
245
246To ensure that bridges and similar links needing to talk to a device are
247available when the device is suspended or resumed, the device hierarchy is
248walked in a bottom-up order to suspend devices. A top-down order is
249used to resume those devices.
250
251The ordering of the device hierarchy is defined by the order in which devices
252get registered: a child can never be registered, probed or resumed before
253its parent; and can't be removed or suspended after that parent.
254
255The policy is that the device hierarchy should match hardware bus topology.
256[Or at least the control bus, for devices which use multiple busses.]
257In particular, this means that a device registration may fail if the parent of
258the device is suspending (i.e. has been chosen by the PM core as the next
259device to suspend) or has already suspended, as well as after all of the other
260devices have been suspended. Device drivers must be prepared to cope with such
261situations.
262
263
264System Power Management Phases
265------------------------------
266
267Suspending or resuming the system is done in several phases. Different phases
268are used for suspend-to-idle, shallow (standby), and deep ("suspend-to-RAM")
269sleep states and the hibernation state ("suspend-to-disk"). Each phase involves
270executing callbacks for every device before the next phase begins. Not all
271buses or classes support all these callbacks and not all drivers use all the
272callbacks. The various phases always run after tasks have been frozen and
273before they are unfrozen. Furthermore, the ``*_noirq phases`` run at a time
274when IRQ handlers have been disabled (except for those marked with the
275IRQF_NO_SUSPEND flag).
276
277All phases use PM domain, bus, type, class or driver callbacks (that is, methods
278defined in ``dev->pm_domain->ops``, ``dev->bus->pm``, ``dev->type->pm``,
279``dev->class->pm`` or ``dev->driver->pm``). These callbacks are regarded by the
280PM core as mutually exclusive. Moreover, PM domain callbacks always take
281precedence over all of the other callbacks and, for example, type callbacks take
282precedence over bus, class and driver callbacks. To be precise, the following
283rules are used to determine which callback to execute in the given phase:
284
285 1. If ``dev->pm_domain`` is present, the PM core will choose the callback
286 provided by ``dev->pm_domain->ops`` for execution.
287
288 2. Otherwise, if both ``dev->type`` and ``dev->type->pm`` are present, the
289 callback provided by ``dev->type->pm`` will be chosen for execution.
290
291 3. Otherwise, if both ``dev->class`` and ``dev->class->pm`` are present,
292 the callback provided by ``dev->class->pm`` will be chosen for
293 execution.
294
295 4. Otherwise, if both ``dev->bus`` and ``dev->bus->pm`` are present, the
296 callback provided by ``dev->bus->pm`` will be chosen for execution.
297
298This allows PM domains and device types to override callbacks provided by bus
299types or device classes if necessary.
300
301The PM domain, type, class and bus callbacks may in turn invoke device- or
302driver-specific methods stored in ``dev->driver->pm``, but they don't have to do
303that.
304
305If the subsystem callback chosen for execution is not present, the PM core will
306execute the corresponding method from the ``dev->driver->pm`` set instead if
307there is one.
308
309
310Entering System Suspend
311-----------------------
312
313When the system goes into the freeze, standby or memory sleep state,
314the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``.
315
316 1. The ``prepare`` phase is meant to prevent races by preventing new
317 devices from being registered; the PM core would never know that all the
318 children of a device had been suspended if new children could be
319 registered at will. [By contrast, from the PM core's perspective,
320 devices may be unregistered at any time.] Unlike the other
321 suspend-related phases, during the ``prepare`` phase the device
322 hierarchy is traversed top-down.
323
324 After the ``->prepare`` callback method returns, no new children may be
325 registered below the device. The method may also prepare the device or
326 driver in some way for the upcoming system power transition, but it
327 should not put the device into a low-power state.
328
329 For devices supporting runtime power management, the return value of the
330 prepare callback can be used to indicate to the PM core that it may
331 safely leave the device in runtime suspend (if runtime-suspended
332 already), provided that all of the device's descendants are also left in
333 runtime suspend. Namely, if the prepare callback returns a positive
334 number and that happens for all of the descendants of the device too,
335 and all of them (including the device itself) are runtime-suspended, the
336 PM core will skip the ``suspend``, ``suspend_late`` and
337 ``suspend_noirq`` phases as well as all of the corresponding phases of
338 the subsequent device resume for all of these devices. In that case,
339 the ``->complete`` callback will be invoked directly after the
340 ``->prepare`` callback and is entirely responsible for putting the
341 device into a consistent state as appropriate.
342
343 Note that this direct-complete procedure applies even if the device is
344 disabled for runtime PM; only the runtime-PM status matters. It follows
345 that if a device has system-sleep callbacks but does not support runtime
346 PM, then its prepare callback must never return a positive value. This
347 is because all such devices are initially set to runtime-suspended with
348 runtime PM disabled.
349
350 2. The ``->suspend`` methods should quiesce the device to stop it from
351 performing I/O. They also may save the device registers and put it into
352 the appropriate low-power state, depending on the bus type the device is
353 on, and they may enable wakeup events.
354
355 3. For a number of devices it is convenient to split suspend into the
356 "quiesce device" and "save device state" phases, in which cases
357 ``suspend_late`` is meant to do the latter. It is always executed after
358 runtime power management has been disabled for the device in question.
359
360 4. The ``suspend_noirq`` phase occurs after IRQ handlers have been disabled,
361 which means that the driver's interrupt handler will not be called while
362 the callback method is running. The ``->suspend_noirq`` methods should
363 save the values of the device's registers that weren't saved previously
364 and finally put the device into the appropriate low-power state.
365
366 The majority of subsystems and device drivers need not implement this
367 callback. However, bus types allowing devices to share interrupt
368 vectors, like PCI, generally need it; otherwise a driver might encounter
369 an error during the suspend phase by fielding a shared interrupt
370 generated by some other device after its own device had been set to low
371 power.
372
373At the end of these phases, drivers should have stopped all I/O transactions
374(DMA, IRQs), saved enough state that they can re-initialize or restore previous
375state (as needed by the hardware), and placed the device into a low-power state.
376On many platforms they will gate off one or more clock sources; sometimes they
377will also switch off power supplies or reduce voltages. [Drivers supporting
378runtime PM may already have performed some or all of these steps.]
379
380If :c:func:`device_may_wakeup(dev)` returns ``true``, the device should be
381prepared for generating hardware wakeup signals to trigger a system wakeup event
382when the system is in the sleep state. For example, :c:func:`enable_irq_wake()`
383might identify GPIO signals hooked up to a switch or other external hardware,
384and :c:func:`pci_enable_wake()` does something similar for the PCI PME signal.
385
386If any of these callbacks returns an error, the system won't enter the desired
387low-power state. Instead, the PM core will unwind its actions by resuming all
388the devices that were suspended.
389
390
391Leaving System Suspend
392----------------------
393
394When resuming from freeze, standby or memory sleep, the phases are:
395``resume_noirq``, ``resume_early``, ``resume``, ``complete``.
396
397 1. The ``->resume_noirq`` callback methods should perform any actions
398 needed before the driver's interrupt handlers are invoked. This
399 generally means undoing the actions of the ``suspend_noirq`` phase. If
400 the bus type permits devices to share interrupt vectors, like PCI, the
401 method should bring the device and its driver into a state in which the
402 driver can recognize if the device is the source of incoming interrupts,
403 if any, and handle them correctly.
404
405 For example, the PCI bus type's ``->pm.resume_noirq()`` puts the device
406 into the full-power state (D0 in the PCI terminology) and restores the
407 standard configuration registers of the device. Then it calls the
408 device driver's ``->pm.resume_noirq()`` method to perform device-specific
409 actions.
410
411 2. The ``->resume_early`` methods should prepare devices for the execution
412 of the resume methods. This generally involves undoing the actions of
413 the preceding ``suspend_late`` phase.
414
415 3. The ``->resume`` methods should bring the device back to its operating
416 state, so that it can perform normal I/O. This generally involves
417 undoing the actions of the ``suspend`` phase.
418
419 4. The ``complete`` phase should undo the actions of the ``prepare`` phase.
420 For this reason, unlike the other resume-related phases, during the
421 ``complete`` phase the device hierarchy is traversed bottom-up.
422
423 Note, however, that new children may be registered below the device as
424 soon as the ``->resume`` callbacks occur; it's not necessary to wait
425 until the ``complete`` phase with that.
426
427 Moreover, if the preceding ``->prepare`` callback returned a positive
428 number, the device may have been left in runtime suspend throughout the
429 whole system suspend and resume (the ``suspend``, ``suspend_late``,
430 ``suspend_noirq`` phases of system suspend and the ``resume_noirq``,
431 ``resume_early``, ``resume`` phases of system resume may have been
432 skipped for it). In that case, the ``->complete`` callback is entirely
433 responsible for putting the device into a consistent state after system
434 suspend if necessary. [For example, it may need to queue up a runtime
435 resume request for the device for this purpose.] To check if that is
436 the case, the ``->complete`` callback can consult the device's
437 ``power.direct_complete`` flag. Namely, if that flag is set when the
438 ``->complete`` callback is being run, it has been called directly after
439 the preceding ``->prepare`` and special actions may be required
440 to make the device work correctly afterward.
441
442At the end of these phases, drivers should be as functional as they were before
443suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
444gated on.
445
446However, the details here may again be platform-specific. For example,
447some systems support multiple "run" states, and the mode in effect at
448the end of resume might not be the one which preceded suspension.
449That means availability of certain clocks or power supplies changed,
450which could easily affect how a driver works.
451
452Drivers need to be able to handle hardware which has been reset since all of the
453suspend methods were called, for example by complete reinitialization.
454This may be the hardest part, and the one most protected by NDA'd documents
455and chip errata. It's simplest if the hardware state hasn't changed since
456the suspend was carried out, but that can only be guaranteed if the target
457system sleep entered was suspend-to-idle. For the other system sleep states
458that may not be the case (and usually isn't for ACPI-defined system sleep
459states, like S3).
460
461Drivers must also be prepared to notice that the device has been removed
462while the system was powered down, whenever that's physically possible.
463PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
464where common Linux platforms will see such removal. Details of how drivers
465will notice and handle such removals are currently bus-specific, and often
466involve a separate thread.
467
468These callbacks may return an error value, but the PM core will ignore such
469errors since there's nothing it can do about them other than printing them in
470the system log.
471
472
473Entering Hibernation
474--------------------
475
476Hibernating the system is more complicated than putting it into sleep states,
477because it involves creating and saving a system image. Therefore there are
478more phases for hibernation, with a different set of callbacks. These phases
479always run after tasks have been frozen and enough memory has been freed.
480
481The general procedure for hibernation is to quiesce all devices ("freeze"),
482create an image of the system memory while everything is stable, reactivate all
483devices ("thaw"), write the image to permanent storage, and finally shut down
484the system ("power off"). The phases used to accomplish this are: ``prepare``,
485``freeze``, ``freeze_late``, ``freeze_noirq``, ``thaw_noirq``, ``thaw_early``,
486``thaw``, ``complete``, ``prepare``, ``poweroff``, ``poweroff_late``,
487``poweroff_noirq``.
488
489 1. The ``prepare`` phase is discussed in the "Entering System Suspend"
490 section above.
491
492 2. The ``->freeze`` methods should quiesce the device so that it doesn't
493 generate IRQs or DMA, and they may need to save the values of device
494 registers. However the device does not have to be put in a low-power
495 state, and to save time it's best not to do so. Also, the device should
496 not be prepared to generate wakeup events.
497
498 3. The ``freeze_late`` phase is analogous to the ``suspend_late`` phase
499 described earlier, except that the device should not be put into a
500 low-power state and should not be allowed to generate wakeup events.
501
502 4. The ``freeze_noirq`` phase is analogous to the ``suspend_noirq`` phase
503 discussed earlier, except again that the device should not be put into
504 a low-power state and should not be allowed to generate wakeup events.
505
506At this point the system image is created. All devices should be inactive and
507the contents of memory should remain undisturbed while this happens, so that the
508image forms an atomic snapshot of the system state.
509
510 5. The ``thaw_noirq`` phase is analogous to the ``resume_noirq`` phase
511 discussed earlier. The main difference is that its methods can assume
512 the device is in the same state as at the end of the ``freeze_noirq``
513 phase.
514
515 6. The ``thaw_early`` phase is analogous to the ``resume_early`` phase
516 described above. Its methods should undo the actions of the preceding
517 ``freeze_late``, if necessary.
518
519 7. The ``thaw`` phase is analogous to the ``resume`` phase discussed
520 earlier. Its methods should bring the device back to an operating
521 state, so that it can be used for saving the image if necessary.
522
523 8. The ``complete`` phase is discussed in the "Leaving System Suspend"
524 section above.
525
526At this point the system image is saved, and the devices then need to be
527prepared for the upcoming system shutdown. This is much like suspending them
528before putting the system into the suspend-to-idle, shallow or deep sleep state,
529and the phases are similar.
530
531 9. The ``prepare`` phase is discussed above.
532
533 10. The ``poweroff`` phase is analogous to the ``suspend`` phase.
534
535 11. The ``poweroff_late`` phase is analogous to the ``suspend_late`` phase.
536
537 12. The ``poweroff_noirq`` phase is analogous to the ``suspend_noirq`` phase.
538
539The ``->poweroff``, ``->poweroff_late`` and ``->poweroff_noirq`` callbacks
540should do essentially the same things as the ``->suspend``, ``->suspend_late``
541and ``->suspend_noirq`` callbacks, respectively. The only notable difference is
542that they need not store the device register values, because the registers
543should already have been stored during the ``freeze``, ``freeze_late`` or
544``freeze_noirq`` phases.
545
546
547Leaving Hibernation
548-------------------
549
550Resuming from hibernation is, again, more complicated than resuming from a sleep
551state in which the contents of main memory are preserved, because it requires
552a system image to be loaded into memory and the pre-hibernation memory contents
553to be restored before control can be passed back to the image kernel.
554
555Although in principle the image might be loaded into memory and the
556pre-hibernation memory contents restored by the boot loader, in practice this
557can't be done because boot loaders aren't smart enough and there is no
558established protocol for passing the necessary information. So instead, the
559boot loader loads a fresh instance of the kernel, called "the restore kernel",
560into memory and passes control to it in the usual way. Then the restore kernel
561reads the system image, restores the pre-hibernation memory contents, and passes
562control to the image kernel. Thus two different kernel instances are involved
563in resuming from hibernation. In fact, the restore kernel may be completely
564different from the image kernel: a different configuration and even a different
565version. This has important consequences for device drivers and their
566subsystems.
567
568To be able to load the system image into memory, the restore kernel needs to
569include at least a subset of device drivers allowing it to access the storage
570medium containing the image, although it doesn't need to include all of the
571drivers present in the image kernel. After the image has been loaded, the
572devices managed by the boot kernel need to be prepared for passing control back
573to the image kernel. This is very similar to the initial steps involved in
574creating a system image, and it is accomplished in the same way, using
575``prepare``, ``freeze``, and ``freeze_noirq`` phases. However, the devices
576affected by these phases are only those having drivers in the restore kernel;
577other devices will still be in whatever state the boot loader left them.
578
579Should the restoration of the pre-hibernation memory contents fail, the restore
580kernel would go through the "thawing" procedure described above, using the
581``thaw_noirq``, ``thaw_early``, ``thaw``, and ``complete`` phases, and then
582continue running normally. This happens only rarely. Most often the
583pre-hibernation memory contents are restored successfully and control is passed
584to the image kernel, which then becomes responsible for bringing the system back
585to the working state.
586
587To achieve this, the image kernel must restore the devices' pre-hibernation
588functionality. The operation is much like waking up from a sleep state (with
589the memory contents preserved), although it involves different phases:
590``restore_noirq``, ``restore_early``, ``restore``, ``complete``.
591
592 1. The ``restore_noirq`` phase is analogous to the ``resume_noirq`` phase.
593
594 2. The ``restore_early`` phase is analogous to the ``resume_early`` phase.
595
596 3. The ``restore`` phase is analogous to the ``resume`` phase.
597
598 4. The ``complete`` phase is discussed above.
599
600The main difference from ``resume[_early|_noirq]`` is that
601``restore[_early|_noirq]`` must assume the device has been accessed and
602reconfigured by the boot loader or the restore kernel. Consequently, the state
603of the device may be different from the state remembered from the ``freeze``,
604``freeze_late`` and ``freeze_noirq`` phases. The device may even need to be
605reset and completely re-initialized. In many cases this difference doesn't
606matter, so the ``->resume[_early|_noirq]`` and ``->restore[_early|_norq]``
607method pointers can be set to the same routines. Nevertheless, different
608callback pointers are used in case there is a situation where it actually does
609matter.
610
611
612Power Management Notifiers
613==========================
614
615There are some operations that cannot be carried out by the power management
616callbacks discussed above, because the callbacks occur too late or too early.
617To handle these cases, subsystems and device drivers may register power
618management notifiers that are called before tasks are frozen and after they have
619been thawed. Generally speaking, the PM notifiers are suitable for performing
620actions that either require user space to be available, or at least won't
621interfere with user space.
622
Rafael J. Wysocki730c4c02017-02-02 01:38:54 +0100623For details refer to :doc:`notifiers`.
Rafael J. Wysocki2728b2d2017-02-02 01:32:13 +0100624
625
626Device Low-Power (suspend) States
627=================================
628
629Device low-power states aren't standard. One device might only handle
630"on" and "off", while another might support a dozen different versions of
631"on" (how many engines are active?), plus a state that gets back to "on"
632faster than from a full "off".
633
634Some buses define rules about what different suspend states mean. PCI
635gives one example: after the suspend sequence completes, a non-legacy
636PCI device may not perform DMA or issue IRQs, and any wakeup events it
637issues would be issued through the PME# bus signal. Plus, there are
638several PCI-standard device states, some of which are optional.
639
640In contrast, integrated system-on-chip processors often use IRQs as the
641wakeup event sources (so drivers would call :c:func:`enable_irq_wake`) and
642might be able to treat DMA completion as a wakeup event (sometimes DMA can stay
643active too, it'd only be the CPU and some peripherals that sleep).
644
645Some details here may be platform-specific. Systems may have devices that
646can be fully active in certain sleep states, such as an LCD display that's
647refreshed using DMA while most of the system is sleeping lightly ... and
648its frame buffer might even be updated by a DSP or other non-Linux CPU while
649the Linux control processor stays idle.
650
651Moreover, the specific actions taken may depend on the target system state.
652One target system state might allow a given device to be very operational;
653another might require a hard shut down with re-initialization on resume.
654And two different target systems might use the same device in different
655ways; the aforementioned LCD might be active in one product's "standby",
656but a different product using the same SOC might work differently.
657
658
659Device Power Management Domains
660===============================
661
662Sometimes devices share reference clocks or other power resources. In those
663cases it generally is not possible to put devices into low-power states
664individually. Instead, a set of devices sharing a power resource can be put
665into a low-power state together at the same time by turning off the shared
666power resource. Of course, they also need to be put into the full-power state
667together, by turning the shared power resource on. A set of devices with this
668property is often referred to as a power domain. A power domain may also be
669nested inside another power domain. The nested domain is referred to as the
670sub-domain of the parent domain.
671
672Support for power domains is provided through the :c:member:`pm_domain` field of
673|struct| :c:type:`device`. This field is a pointer to an object of
674type |struct| :c:type:`dev_pm_domain`, defined in :file:`include/linux/pm.h``,
675providing a set of power management callbacks analogous to the subsystem-level
676and device driver callbacks that are executed for the given device during all
677power transitions, instead of the respective subsystem-level callbacks.
678Specifically, if a device's :c:member:`pm_domain` pointer is not NULL, the
679``->suspend()`` callback from the object pointed to by it will be executed
680instead of its subsystem's (e.g. bus type's) ``->suspend()`` callback and
681analogously for all of the remaining callbacks. In other words, power
682management domain callbacks, if defined for the given device, always take
683precedence over the callbacks provided by the device's subsystem (e.g. bus type).
684
685The support for device power management domains is only relevant to platforms
686needing to use the same device driver power management callbacks in many
687different power domain configurations and wanting to avoid incorporating the
688support for power domains into subsystem-level callbacks, for example by
689modifying the platform bus type. Other platforms need not implement it or take
690it into account in any way.
691
692Devices may be defined as IRQ-safe which indicates to the PM core that their
693runtime PM callbacks may be invoked with disabled interrupts (see
694:file:`Documentation/power/runtime_pm.txt` for more information). If an
695IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be
696disallowed, unless the domain itself is defined as IRQ-safe. However, it
697makes sense to define a PM domain as IRQ-safe only if all the devices in it
698are IRQ-safe. Moreover, if an IRQ-safe domain has a parent domain, the runtime
699PM of the parent is only allowed if the parent itself is IRQ-safe too with the
700additional restriction that all child domains of an IRQ-safe parent must also
701be IRQ-safe.
702
703
704Runtime Power Management
705========================
706
707Many devices are able to dynamically power down while the system is still
708running. This feature is useful for devices that are not being used, and
709can offer significant power savings on a running system. These devices
710often support a range of runtime power states, which might use names such
711as "off", "sleep", "idle", "active", and so on. Those states will in some
712cases (like PCI) be partially constrained by the bus the device uses, and will
713usually include hardware states that are also used in system sleep states.
714
715A system-wide power transition can be started while some devices are in low
716power states due to runtime power management. The system sleep PM callbacks
717should recognize such situations and react to them appropriately, but the
718necessary actions are subsystem-specific.
719
720In some cases the decision may be made at the subsystem level while in other
721cases the device driver may be left to decide. In some cases it may be
722desirable to leave a suspended device in that state during a system-wide power
723transition, but in other cases the device must be put back into the full-power
724state temporarily, for example so that its system wakeup capability can be
725disabled. This all depends on the hardware and the design of the subsystem and
726device driver in question.
727
728During system-wide resume from a sleep state it's easiest to put devices into
729the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
730Refer to that document for more information regarding this particular issue as
731well as for information on the device runtime power management framework in
732general.