Andrea Arcangeli | 25edd8b | 2015-09-04 15:46:00 -0700 | [diff] [blame] | 1 | = Userfaultfd = |
| 2 | |
| 3 | == Objective == |
| 4 | |
| 5 | Userfaults allow the implementation of on-demand paging from userland |
| 6 | and more generally they allow userland to take control of various |
| 7 | memory page faults, something otherwise only the kernel code could do. |
| 8 | |
| 9 | For example userfaults allows a proper and more optimal implementation |
| 10 | of the PROT_NONE+SIGSEGV trick. |
| 11 | |
| 12 | == Design == |
| 13 | |
| 14 | Userfaults are delivered and resolved through the userfaultfd syscall. |
| 15 | |
| 16 | The userfaultfd (aside from registering and unregistering virtual |
| 17 | memory ranges) provides two primary functionalities: |
| 18 | |
| 19 | 1) read/POLLIN protocol to notify a userland thread of the faults |
| 20 | happening |
| 21 | |
| 22 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions |
| 23 | registered in the userfaultfd that allows userland to efficiently |
| 24 | resolve the userfaults it receives via 1) or to manage the virtual |
| 25 | memory in the background |
| 26 | |
| 27 | The real advantage of userfaults if compared to regular virtual memory |
| 28 | management of mremap/mprotect is that the userfaults in all their |
| 29 | operations never involve heavyweight structures like vmas (in fact the |
| 30 | userfaultfd runtime load never takes the mmap_sem for writing). |
| 31 | |
| 32 | Vmas are not suitable for page- (or hugepage) granular fault tracking |
| 33 | when dealing with virtual address spaces that could span |
| 34 | Terabytes. Too many vmas would be needed for that. |
| 35 | |
| 36 | The userfaultfd once opened by invoking the syscall, can also be |
| 37 | passed using unix domain sockets to a manager process, so the same |
| 38 | manager process could handle the userfaults of a multitude of |
| 39 | different processes without them being aware about what is going on |
| 40 | (well of course unless they later try to use the userfaultfd |
| 41 | themselves on the same region the manager is already tracking, which |
| 42 | is a corner case that would currently return -EBUSY). |
| 43 | |
| 44 | == API == |
| 45 | |
| 46 | When first opened the userfaultfd must be enabled invoking the |
| 47 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or |
| 48 | a later API version) which will specify the read/POLLIN protocol |
Andrea Arcangeli | a9b85f9 | 2015-09-04 15:46:37 -0700 | [diff] [blame] | 49 | userland intends to speak on the UFFD and the uffdio_api.features |
| 50 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the |
| 51 | requested uffdio_api.api is spoken also by the running kernel and the |
| 52 | requested features are going to be enabled) will return into |
| 53 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of |
| 54 | respectively all the available features of the read(2) protocol and |
| 55 | the generic ioctl available. |
Andrea Arcangeli | 25edd8b | 2015-09-04 15:46:00 -0700 | [diff] [blame] | 56 | |
Mike Rapoport | 5a02026 | 2017-02-24 14:58:34 -0800 | [diff] [blame] | 57 | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl |
| 58 | defines what memory types are supported by the userfaultfd and what |
| 59 | events, except page fault notifications, may be generated. |
| 60 | |
| 61 | If the kernel supports registering userfaultfd ranges on hugetlbfs |
| 62 | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in |
| 63 | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be |
| 64 | set if the kernel supports registering userfaultfd ranges on shared |
| 65 | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero |
| 66 | MAP_SHARED, memfd_create, etc). |
| 67 | |
| 68 | The userland application that wants to use userfaultfd with hugetlbfs |
| 69 | or shared memory need to set the corresponding flag in |
| 70 | uffdio_api.features to enable those features. |
| 71 | |
| 72 | If the userland desires to receive notifications for events other than |
| 73 | page faults, it has to verify that uffdio_api.features has appropriate |
| 74 | UFFD_FEATURE_EVENT_* bits set. These events are described in more |
| 75 | detail below in "Non-cooperative userfaultfd" section. |
| 76 | |
Andrea Arcangeli | 25edd8b | 2015-09-04 15:46:00 -0700 | [diff] [blame] | 77 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should |
| 78 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to |
| 79 | register a memory range in the userfaultfd by setting the |
| 80 | uffdio_register structure accordingly. The uffdio_register.mode |
| 81 | bitmask will specify to the kernel which kind of faults to track for |
| 82 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing |
| 83 | pages). The UFFDIO_REGISTER ioctl will return the |
| 84 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve |
| 85 | userfaults on the range registered. Not all ioctls will necessarily be |
| 86 | supported for all memory types depending on the underlying virtual |
| 87 | memory backend (anonymous memory vs tmpfs vs real filebacked |
| 88 | mappings). |
| 89 | |
| 90 | Userland can use the uffdio_register.ioctls to manage the virtual |
| 91 | address space in the background (to add or potentially also remove |
| 92 | memory from the userfaultfd registered range). This means a userfault |
| 93 | could be triggering just before userland maps in the background the |
| 94 | user-faulted page. |
| 95 | |
| 96 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That |
| 97 | atomically copies a page into the userfault registered range and wakes |
| 98 | up the blocked userfaults (unless uffdio_copy.mode & |
| 99 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to |
| 100 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an |
| 101 | half copied page since it'll keep userfaulting until the copy has |
| 102 | finished. |
| 103 | |
| 104 | == QEMU/KVM == |
| 105 | |
| 106 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live |
| 107 | migration. Postcopy live migration is one form of memory |
| 108 | externalization consisting of a virtual machine running with part or |
| 109 | all of its memory residing on a different node in the cloud. The |
| 110 | userfaultfd abstraction is generic enough that not a single line of |
| 111 | KVM kernel code had to be modified in order to add postcopy live |
| 112 | migration to QEMU. |
| 113 | |
| 114 | Guest async page faults, FOLL_NOWAIT and all other GUP features work |
| 115 | just fine in combination with userfaults. Userfaults trigger async |
| 116 | page faults in the guest scheduler so those guest processes that |
| 117 | aren't waiting for userfaults (i.e. network bound) can keep running in |
| 118 | the guest vcpus. |
| 119 | |
| 120 | It is generally beneficial to run one pass of precopy live migration |
| 121 | just before starting postcopy live migration, in order to avoid |
| 122 | generating userfaults for readonly guest regions. |
| 123 | |
| 124 | The implementation of postcopy live migration currently uses one |
| 125 | single bidirectional socket but in the future two different sockets |
| 126 | will be used (to reduce the latency of the userfaults to the minimum |
| 127 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). |
| 128 | |
| 129 | The QEMU in the source node writes all pages that it knows are missing |
| 130 | in the destination node, into the socket, and the migration thread of |
| 131 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE |
| 132 | ioctls on the userfaultfd in order to map the received pages into the |
| 133 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). |
| 134 | |
| 135 | A different postcopy thread in the destination node listens with |
| 136 | poll() to the userfaultfd in parallel. When a POLLIN event is |
| 137 | generated after a userfault triggers, the postcopy thread read() from |
| 138 | the userfaultfd and receives the fault address (or -EAGAIN in case the |
| 139 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run |
| 140 | by the parallel QEMU migration thread). |
| 141 | |
| 142 | After the QEMU postcopy thread (running in the destination node) gets |
| 143 | the userfault address it writes the information about the missing page |
| 144 | into the socket. The QEMU source node receives the information and |
| 145 | roughly "seeks" to that page address and continues sending all |
| 146 | remaining missing pages from that new page offset. Soon after that |
| 147 | (just the time to flush the tcp_wmem queue through the network) the |
| 148 | migration thread in the QEMU running in the destination node will |
| 149 | receive the page that triggered the userfault and it'll map it as |
| 150 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it |
| 151 | was spontaneously sent by the source or if it was an urgent page |
Masahiro Yamada | 9332ef9 | 2017-02-27 14:28:47 -0800 | [diff] [blame] | 152 | requested through a userfault). |
Andrea Arcangeli | 25edd8b | 2015-09-04 15:46:00 -0700 | [diff] [blame] | 153 | |
| 154 | By the time the userfaults start, the QEMU in the destination node |
| 155 | doesn't need to keep any per-page state bitmap relative to the live |
| 156 | migration around and a single per-page bitmap has to be maintained in |
| 157 | the QEMU running in the source node to know which pages are still |
| 158 | missing in the destination node. The bitmap in the source node is |
| 159 | checked to find which missing pages to send in round robin and we seek |
| 160 | over it when receiving incoming userfaults. After sending each page of |
| 161 | course the bitmap is updated accordingly. It's also useful to avoid |
| 162 | sending the same page twice (in case the userfault is read by the |
| 163 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration |
| 164 | thread). |
Mike Rapoport | 5a02026 | 2017-02-24 14:58:34 -0800 | [diff] [blame] | 165 | |
| 166 | == Non-cooperative userfaultfd == |
| 167 | |
| 168 | When the userfaultfd is monitored by an external manager, the manager |
| 169 | must be able to track changes in the process virtual memory |
| 170 | layout. Userfaultfd can notify the manager about such changes using |
| 171 | the same read(2) protocol as for the page fault notifications. The |
| 172 | manager has to explicitly enable these events by setting appropriate |
| 173 | bits in uffdio_api.features passed to UFFDIO_API ioctl: |
| 174 | |
Mike Rapoport | 5a02026 | 2017-02-24 14:58:34 -0800 | [diff] [blame] | 175 | UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When |
| 176 | this feature is enabled, the userfaultfd context of the parent process |
| 177 | is duplicated into the newly created process. The manager receives |
| 178 | UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in |
| 179 | the uffd_msg.fork. |
| 180 | |
| 181 | UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() |
| 182 | calls. When the non-cooperative process moves a virtual memory area to |
| 183 | a different location, the manager will receive UFFD_EVENT_REMAP. The |
| 184 | uffd_msg.remap will contain the old and new addresses of the area and |
| 185 | its original length. |
| 186 | |
| 187 | UFFD_FEATURE_EVENT_REMOVE - enable notifications about |
| 188 | madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event |
| 189 | UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The |
| 190 | uffd_msg.remove will contain start and end addresses of the removed |
| 191 | area. |
| 192 | |
| 193 | UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory |
| 194 | unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove |
| 195 | containing start and end addresses of the unmapped area. |
| 196 | |
| 197 | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP |
| 198 | are pretty similar, they quite differ in the action expected from the |
| 199 | userfaultfd manager. In the former case, the virtual memory is |
| 200 | removed, but the area is not, the area remains monitored by the |
| 201 | userfaultfd, and if a page fault occurs in that area it will be |
| 202 | delivered to the manager. The proper resolution for such page fault is |
| 203 | to zeromap the faulting address. However, in the latter case, when an |
| 204 | area is unmapped, either explicitly (with munmap() system call), or |
| 205 | implicitly (e.g. during mremap()), the area is removed and in turn the |
| 206 | userfaultfd context for such area disappears too and the manager will |
| 207 | not get further userland page faults from the removed area. Still, the |
| 208 | notification is required in order to prevent manager from using |
| 209 | UFFDIO_COPY on the unmapped area. |
| 210 | |
| 211 | Unlike userland page faults which have to be synchronous and require |
| 212 | explicit or implicit wakeup, all the events are delivered |
| 213 | asynchronously and the non-cooperative process resumes execution as |
| 214 | soon as manager executes read(). The userfaultfd manager should |
| 215 | carefully synchronize calls to UFFDIO_COPY with the events |
| 216 | processing. To aid the synchronization, the UFFDIO_COPY ioctl will |
| 217 | return -ENOSPC when the monitored process exits at the time of |
| 218 | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed |
| 219 | its virtual memory layout simultaneously with outstanding UFFDIO_COPY |
| 220 | operation. |
| 221 | |
| 222 | The current asynchronous model of the event delivery is optimal for |
| 223 | single threaded non-cooperative userfaultfd manager implementations. A |
| 224 | synchronous event delivery model can be added later as a new |
| 225 | userfaultfd feature to facilitate multithreading enhancements of the |
| 226 | non cooperative manager, for example to allow UFFDIO_COPY ioctls to |
| 227 | run in parallel to the event reception. Single threaded |
| 228 | implementations should continue to use the current async event |
| 229 | delivery model instead. |