Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 1 | Heterogeneous Memory Management (HMM) |
| 2 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 3 | Provide infrastructure and helpers to integrate non-conventional memory (device |
| 4 | memory like GPU on board memory) into regular kernel path, with the cornerstone |
| 5 | of this being specialized struct page for such memory (see sections 5 to 7 of |
| 6 | this document). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 7 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 8 | HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., |
| 9 | allowing a device to transparently access program address coherently with the |
| 10 | CPU meaning that any valid pointer on the CPU is also a valid pointer for the |
| 11 | device. This is becoming mandatory to simplify the use of advanced hetero- |
| 12 | geneous computing where GPU, DSP, or FPGA are used to perform various |
| 13 | computations on behalf of a process. |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 14 | |
| 15 | This document is divided as follows: in the first section I expose the problems |
| 16 | related to using device specific memory allocators. In the second section, I |
| 17 | expose the hardware limitations that are inherent to many platforms. The third |
| 18 | section gives an overview of the HMM design. The fourth section explains how |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 19 | CPU page-table mirroring works and the purpose of HMM in this context. The |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 20 | fifth section deals with how device memory is represented inside the kernel. |
| 21 | Finally, the last section presents a new migration helper that allows lever- |
| 22 | aging the device DMA engine. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 23 | |
| 24 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 25 | 1) Problems of using a device specific memory allocator: |
| 26 | 2) I/O bus, device memory characteristics |
| 27 | 3) Shared address space and migration |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 28 | 4) Address space mirroring implementation and API |
| 29 | 5) Represent and manage device memory from core kernel point of view |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 30 | 6) Migration to and from device memory |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 31 | 7) Memory cgroup (memcg) and rss accounting |
| 32 | |
| 33 | |
| 34 | ------------------------------------------------------------------------------- |
| 35 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 36 | 1) Problems of using a device specific memory allocator: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 37 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 38 | Devices with a large amount of on board memory (several gigabytes) like GPUs |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 39 | have historically managed their memory through dedicated driver specific APIs. |
| 40 | This creates a disconnect between memory allocated and managed by a device |
| 41 | driver and regular application memory (private anonymous, shared memory, or |
| 42 | regular file backed memory). From here on I will refer to this aspect as split |
| 43 | address space. I use shared address space to refer to the opposite situation: |
| 44 | i.e., one in which any application memory region can be used by a device |
| 45 | transparently. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 46 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 47 | Split address space happens because device can only access memory allocated |
| 48 | through device specific API. This implies that all memory objects in a program |
| 49 | are not equal from the device point of view which complicates large programs |
| 50 | that rely on a wide set of libraries. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 51 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 52 | Concretely this means that code that wants to leverage devices like GPUs needs |
| 53 | to copy object between generically allocated memory (malloc, mmap private, mmap |
| 54 | share) and memory allocated through the device driver API (this still ends up |
| 55 | with an mmap but of the device file). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 56 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 57 | For flat data sets (array, grid, image, ...) this isn't too hard to achieve but |
| 58 | complex data sets (list, tree, ...) are hard to get right. Duplicating a |
| 59 | complex data set needs to re-map all the pointer relations between each of its |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 60 | elements. This is error prone and program gets harder to debug because of the |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 61 | duplicate data set and addresses. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 62 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 63 | Split address space also means that libraries cannot transparently use data |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 64 | they are getting from the core program or another library and thus each library |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 65 | might have to duplicate its input data set using the device specific memory |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 66 | allocator. Large projects suffer from this and waste resources because of the |
| 67 | various memory copies. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 68 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 69 | Duplicating each library API to accept as input or output memory allocated by |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 70 | each device specific allocator is not a viable option. It would lead to a |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 71 | combinatorial explosion in the library entry points. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 72 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 73 | Finally, with the advance of high level language constructs (in C++ but in |
| 74 | other languages too) it is now possible for the compiler to leverage GPUs and |
| 75 | other devices without programmer knowledge. Some compiler identified patterns |
| 76 | are only do-able with a shared address space. It is also more reasonable to use |
| 77 | a shared address space for all other patterns. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 78 | |
| 79 | |
| 80 | ------------------------------------------------------------------------------- |
| 81 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 82 | 2) I/O bus, device memory characteristics |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 83 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 84 | I/O buses cripple shared address spaces due to a few limitations. Most I/O |
| 85 | buses only allow basic memory access from device to main memory; even cache |
| 86 | coherency is often optional. Access to device memory from CPU is even more |
| 87 | limited. More often than not, it is not cache coherent. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 88 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 89 | If we only consider the PCIE bus, then a device can access main memory (often |
| 90 | through an IOMMU) and be cache coherent with the CPUs. However, it only allows |
| 91 | a limited set of atomic operations from device on main memory. This is worse |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 92 | in the other direction: the CPU can only access a limited range of the device |
| 93 | memory and cannot perform atomic operations on it. Thus device memory cannot |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 94 | be considered the same as regular memory from the kernel point of view. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 95 | |
| 96 | Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 97 | and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). |
| 98 | The final limitation is latency. Access to main memory from the device has an |
| 99 | order of magnitude higher latency than when the device accesses its own memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 100 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 101 | Some platforms are developing new I/O buses or additions/modifications to PCIE |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 102 | to address some of these limitations (OpenCAPI, CCIX). They mainly allow two- |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 103 | way cache coherency between CPU and device and allow all atomic operations the |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 104 | architecture supports. Sadly, not all platforms are following this trend and |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 105 | some major architectures are left without hardware solutions to these problems. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 106 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 107 | So for shared address space to make sense, not only must we allow devices to |
| 108 | access any memory but we must also permit any memory to be migrated to device |
| 109 | memory while device is using it (blocking CPU access while it happens). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 110 | |
| 111 | |
| 112 | ------------------------------------------------------------------------------- |
| 113 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 114 | 3) Shared address space and migration |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 115 | |
| 116 | HMM intends to provide two main features. First one is to share the address |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 117 | space by duplicating the CPU page table in the device page table so the same |
| 118 | address points to the same physical memory for any valid main memory address in |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 119 | the process address space. |
| 120 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 121 | To achieve this, HMM offers a set of helpers to populate the device page table |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 122 | while keeping track of CPU page table updates. Device page table updates are |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 123 | not as easy as CPU page table updates. To update the device page table, you must |
| 124 | allocate a buffer (or use a pool of pre-allocated buffers) and write GPU |
| 125 | specific commands in it to perform the update (unmap, cache invalidations, and |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 126 | flush, ...). This cannot be done through common code for all devices. Hence |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 127 | why HMM provides helpers to factor out everything that can be while leaving the |
| 128 | hardware specific details to the device driver. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 129 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 130 | The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 131 | allows allocating a struct page for each page of the device memory. Those pages |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 132 | are special because the CPU cannot map them. However, they allow migrating |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 133 | main memory to device memory using existing migration mechanisms and everything |
| 134 | looks like a page is swapped out to disk from the CPU point of view. Using a |
| 135 | struct page gives the easiest and cleanest integration with existing mm mech- |
| 136 | anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE |
| 137 | memory for the device memory and second to perform migration. Policy decisions |
| 138 | of what and when to migrate things is left to the device driver. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 139 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 140 | Note that any CPU access to a device page triggers a page fault and a migration |
| 141 | back to main memory. For example, when a page backing a given CPU address A is |
| 142 | migrated from a main memory page to a device page, then any CPU access to |
| 143 | address A triggers a page fault and initiates a migration back to main memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 144 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 145 | With these two features, HMM not only allows a device to mirror process address |
| 146 | space and keeping both CPU and device page table synchronized, but also lever- |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 147 | ages device memory by migrating the part of the data set that is actively being |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 148 | used by the device. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 149 | |
| 150 | |
| 151 | ------------------------------------------------------------------------------- |
| 152 | |
| 153 | 4) Address space mirroring implementation and API |
| 154 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 155 | Address space mirroring's main objective is to allow duplication of a range of |
| 156 | CPU page table into a device page table; HMM helps keep both synchronized. A |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 157 | device driver that wants to mirror a process address space must start with the |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 158 | registration of an hmm_mirror struct: |
| 159 | |
| 160 | int hmm_mirror_register(struct hmm_mirror *mirror, |
| 161 | struct mm_struct *mm); |
| 162 | int hmm_mirror_register_locked(struct hmm_mirror *mirror, |
| 163 | struct mm_struct *mm); |
| 164 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 165 | The locked variant is to be used when the driver is already holding mmap_sem |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 166 | of the mm in write mode. The mirror struct has a set of callbacks that are used |
| 167 | to propagate CPU page tables: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 168 | |
| 169 | struct hmm_mirror_ops { |
| 170 | /* sync_cpu_device_pagetables() - synchronize page tables |
| 171 | * |
| 172 | * @mirror: pointer to struct hmm_mirror |
| 173 | * @update_type: type of update that occurred to the CPU page table |
| 174 | * @start: virtual start address of the range to update |
| 175 | * @end: virtual end address of the range to update |
| 176 | * |
| 177 | * This callback ultimately originates from mmu_notifiers when the CPU |
| 178 | * page table is updated. The device driver must update its page table |
| 179 | * in response to this callback. The update argument tells what action |
| 180 | * to perform. |
| 181 | * |
| 182 | * The device driver must not return from this callback until the device |
| 183 | * page tables are completely updated (TLBs flushed, etc); this is a |
| 184 | * synchronous call. |
| 185 | */ |
| 186 | void (*update)(struct hmm_mirror *mirror, |
| 187 | enum hmm_update action, |
| 188 | unsigned long start, |
| 189 | unsigned long end); |
| 190 | }; |
| 191 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 192 | The device driver must perform the update action to the range (mark range |
| 193 | read only, or fully unmap, ...). The device must be done with the update before |
| 194 | the driver callback returns. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 195 | |
| 196 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 197 | When the device driver wants to populate a range of virtual addresses, it can |
| 198 | use either: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 199 | int hmm_vma_get_pfns(struct vm_area_struct *vma, |
| 200 | struct hmm_range *range, |
| 201 | unsigned long start, |
| 202 | unsigned long end, |
| 203 | hmm_pfn_t *pfns); |
| 204 | int hmm_vma_fault(struct vm_area_struct *vma, |
| 205 | struct hmm_range *range, |
| 206 | unsigned long start, |
| 207 | unsigned long end, |
| 208 | hmm_pfn_t *pfns, |
| 209 | bool write, |
| 210 | bool block); |
| 211 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 212 | The first one (hmm_vma_get_pfns()) will only fetch present CPU page table |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 213 | entries and will not trigger a page fault on missing or non-present entries. |
| 214 | The second one does trigger a page fault on missing or read-only entry if the |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 215 | write parameter is true. Page faults use the generic mm page fault code path |
| 216 | just like a CPU page fault. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 217 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 218 | Both functions copy CPU page table entries into their pfns array argument. Each |
| 219 | entry in that array corresponds to an address in the virtual range. HMM |
| 220 | provides a set of flags to help the driver identify special CPU page table |
| 221 | entries. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 222 | |
| 223 | Locking with the update() callback is the most important aspect the driver must |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 224 | respect in order to keep things properly synchronized. The usage pattern is: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 225 | |
| 226 | int driver_populate_range(...) |
| 227 | { |
| 228 | struct hmm_range range; |
| 229 | ... |
| 230 | again: |
| 231 | ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); |
| 232 | if (ret) |
| 233 | return ret; |
| 234 | take_lock(driver->update); |
| 235 | if (!hmm_vma_range_done(vma, &range)) { |
| 236 | release_lock(driver->update); |
| 237 | goto again; |
| 238 | } |
| 239 | |
| 240 | // Use pfns array content to update device page table |
| 241 | |
| 242 | release_lock(driver->update); |
| 243 | return 0; |
| 244 | } |
| 245 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 246 | The driver->update lock is the same lock that the driver takes inside its |
| 247 | update() callback. That lock must be held before hmm_vma_range_done() to avoid |
| 248 | any race with a concurrent CPU page table update. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 249 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 250 | HMM implements all this on top of the mmu_notifier API because we wanted a |
| 251 | simpler API and also to be able to perform optimizations latter on like doing |
| 252 | concurrent device updates in multi-devices scenario. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 253 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 254 | HMM also serves as an impedance mismatch between how CPU page table updates |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 255 | are done (by CPU write to the page table and TLB flushes) and how devices |
| 256 | update their own page table. Device updates are a multi-step process. First, |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 257 | appropriate commands are written to a buffer, then this buffer is scheduled for |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 258 | execution on the device. It is only once the device has executed commands in |
| 259 | the buffer that the update is done. Creating and scheduling the update command |
| 260 | buffer can happen concurrently for multiple devices. Waiting for each device to |
| 261 | report commands as executed is serialized (there is no point in doing this |
| 262 | concurrently). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 263 | |
| 264 | |
| 265 | ------------------------------------------------------------------------------- |
| 266 | |
| 267 | 5) Represent and manage device memory from core kernel point of view |
| 268 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 269 | Several different designs were tried to support device memory. First one used |
| 270 | a device specific data structure to keep information about migrated memory and |
| 271 | HMM hooked itself in various places of mm code to handle any access to |
| 272 | addresses that were backed by device memory. It turns out that this ended up |
| 273 | replicating most of the fields of struct page and also needed many kernel code |
| 274 | paths to be updated to understand this new kind of memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 275 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 276 | Most kernel code paths never try to access the memory behind a page |
| 277 | but only care about struct page contents. Because of this, HMM switched to |
| 278 | directly using struct page for device memory which left most kernel code paths |
| 279 | unaware of the difference. We only need to make sure that no one ever tries to |
| 280 | map those pages from the CPU side. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 281 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 282 | HMM provides a set of helpers to register and hotplug device memory as a new |
| 283 | region needing a struct page. This is offered through a very simple API: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 284 | |
| 285 | struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, |
| 286 | struct device *device, |
| 287 | unsigned long size); |
| 288 | void hmm_devmem_remove(struct hmm_devmem *devmem); |
| 289 | |
| 290 | The hmm_devmem_ops is where most of the important things are: |
| 291 | |
| 292 | struct hmm_devmem_ops { |
| 293 | void (*free)(struct hmm_devmem *devmem, struct page *page); |
| 294 | int (*fault)(struct hmm_devmem *devmem, |
| 295 | struct vm_area_struct *vma, |
| 296 | unsigned long addr, |
| 297 | struct page *page, |
| 298 | unsigned flags, |
| 299 | pmd_t *pmdp); |
| 300 | }; |
| 301 | |
| 302 | The first callback (free()) happens when the last reference on a device page is |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 303 | dropped. This means the device page is now free and no longer used by anyone. |
| 304 | The second callback happens whenever the CPU tries to access a device page |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 305 | which it cannot do. This second callback must trigger a migration back to |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 306 | system memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 307 | |
| 308 | |
| 309 | ------------------------------------------------------------------------------- |
| 310 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 311 | 6) Migration to and from device memory |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 312 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 313 | Because the CPU cannot access device memory, migration must use the device DMA |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 314 | engine to perform copy from and to device memory. For this we need a new |
| 315 | migration helper: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 316 | |
| 317 | int migrate_vma(const struct migrate_vma_ops *ops, |
| 318 | struct vm_area_struct *vma, |
| 319 | unsigned long mentries, |
| 320 | unsigned long start, |
| 321 | unsigned long end, |
| 322 | unsigned long *src, |
| 323 | unsigned long *dst, |
| 324 | void *private); |
| 325 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 326 | Unlike other migration functions it works on a range of virtual address, there |
| 327 | are two reasons for that. First, device DMA copy has a high setup overhead cost |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 328 | and thus batching multiple pages is needed as otherwise the migration overhead |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 329 | makes the whole exercise pointless. The second reason is because the |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 330 | migration might be for a range of addresses the device is actively accessing. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 331 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 332 | The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) |
| 333 | controls destination memory allocation and copy operation. Second one is there |
| 334 | to allow the device driver to perform cleanup operations after migration. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 335 | |
| 336 | struct migrate_vma_ops { |
| 337 | void (*alloc_and_copy)(struct vm_area_struct *vma, |
| 338 | const unsigned long *src, |
| 339 | unsigned long *dst, |
| 340 | unsigned long start, |
| 341 | unsigned long end, |
| 342 | void *private); |
| 343 | void (*finalize_and_map)(struct vm_area_struct *vma, |
| 344 | const unsigned long *src, |
| 345 | const unsigned long *dst, |
| 346 | unsigned long start, |
| 347 | unsigned long end, |
| 348 | void *private); |
| 349 | }; |
| 350 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 351 | It is important to stress that these migration helpers allow for holes in the |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 352 | virtual address range. Some pages in the range might not be migrated for all |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 353 | the usual reasons (page is pinned, page is locked, ...). This helper does not |
| 354 | fail but just skips over those pages. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 355 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 356 | The alloc_and_copy() might decide to not migrate all pages in the |
| 357 | range (for reasons under the callback control). For those, the callback just |
| 358 | has to leave the corresponding dst entry empty. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 359 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 360 | Finally, the migration of the struct page might fail (for file backed page) for |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 361 | various reasons (failure to freeze reference, or update page cache, ...). If |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 362 | that happens, then the finalize_and_map() can catch any pages that were not |
| 363 | migrated. Note those pages were still copied to a new page and thus we wasted |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 364 | bandwidth but this is considered as a rare event and a price that we are |
| 365 | willing to pay to keep all the code simpler. |
| 366 | |
| 367 | |
| 368 | ------------------------------------------------------------------------------- |
| 369 | |
| 370 | 7) Memory cgroup (memcg) and rss accounting |
| 371 | |
| 372 | For now device memory is accounted as any regular page in rss counters (either |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 373 | anonymous if device page is used for anonymous, file if device page is used for |
| 374 | file backed page or shmem if device page is used for shared memory). This is a |
| 375 | deliberate choice to keep existing applications, that might start using device |
| 376 | memory without knowing about it, running unimpacted. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 377 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 378 | A drawback is that the OOM killer might kill an application using a lot of |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 379 | device memory and not a lot of regular system memory and thus not freeing much |
| 380 | system memory. We want to gather more real world experience on how applications |
| 381 | and system react under memory pressure in the presence of device memory before |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 382 | deciding to account device memory differently. |
| 383 | |
| 384 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 385 | Same decision was made for memory cgroup. Device memory pages are accounted |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 386 | against same memory cgroup a regular page would be accounted to. This does |
| 387 | simplify migration to and from device memory. This also means that migration |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 388 | back from device memory to regular memory cannot fail because it would |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 389 | go above memory cgroup limit. We might revisit this choice latter on once we |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 390 | get more experience in how device memory is used and its impact on memory |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 391 | resource control. |
| 392 | |
| 393 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 394 | Note that device memory can never be pinned by device driver nor through GUP |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 395 | and thus such memory is always free upon process exit. Or when last reference |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 396 | is dropped in case of shared memory or file backed memory. |