Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | ============================= |
| 2 | NO-MMU MEMORY MAPPING SUPPORT |
| 3 | ============================= |
| 4 | |
| 5 | The kernel has limited support for memory mapping under no-MMU conditions, such |
| 6 | as are used in uClinux environments. From the userspace point of view, memory |
| 7 | mapping is made use of in conjunction with the mmap() system call, the shmat() |
| 8 | call and the execve() system call. From the kernel's point of view, execve() |
| 9 | mapping is actually performed by the binfmt drivers, which call back into the |
| 10 | mmap() routines to do the actual work. |
| 11 | |
| 12 | Memory mapping behaviour also involves the way fork(), vfork(), clone() and |
| 13 | ptrace() work. Under uClinux there is no fork(), and clone() must be supplied |
| 14 | the CLONE_VM flag. |
| 15 | |
| 16 | The behaviour is similar between the MMU and no-MMU cases, but not identical; |
| 17 | and it's also much more restricted in the latter case: |
| 18 | |
| 19 | (*) Anonymous mapping, MAP_PRIVATE |
| 20 | |
| 21 | In the MMU case: VM regions backed by arbitrary pages; copy-on-write |
| 22 | across fork. |
| 23 | |
| 24 | In the no-MMU case: VM regions backed by arbitrary contiguous runs of |
| 25 | pages. |
| 26 | |
| 27 | (*) Anonymous mapping, MAP_SHARED |
| 28 | |
| 29 | These behave very much like private mappings, except that they're |
| 30 | shared across fork() or clone() without CLONE_VM in the MMU case. Since |
| 31 | the no-MMU case doesn't support these, behaviour is identical to |
| 32 | MAP_PRIVATE there. |
| 33 | |
| 34 | (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE |
| 35 | |
| 36 | In the MMU case: VM regions backed by pages read from file; changes to |
| 37 | the underlying file are reflected in the mapping; copied across fork. |
| 38 | |
| 39 | In the no-MMU case: |
| 40 | |
| 41 | - If one exists, the kernel will re-use an existing mapping to the |
| 42 | same segment of the same file if that has compatible permissions, |
| 43 | even if this was created by another process. |
| 44 | |
| 45 | - If possible, the file mapping will be directly on the backing device |
| 46 | if the backing device has the BDI_CAP_MAP_DIRECT capability and |
| 47 | appropriate mapping protection capabilities. Ramfs, romfs, cramfs |
| 48 | and mtd might all permit this. |
| 49 | |
| 50 | - If the backing device device can't or won't permit direct sharing, |
| 51 | but does have the BDI_CAP_MAP_COPY capability, then a copy of the |
| 52 | appropriate bit of the file will be read into a contiguous bit of |
| 53 | memory and any extraneous space beyond the EOF will be cleared |
| 54 | |
| 55 | - Writes to the file do not affect the mapping; writes to the mapping |
| 56 | are visible in other processes (no MMU protection), but should not |
| 57 | happen. |
| 58 | |
| 59 | (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE |
| 60 | |
| 61 | In the MMU case: like the non-PROT_WRITE case, except that the pages in |
| 62 | question get copied before the write actually happens. From that point |
| 63 | on writes to the file underneath that page no longer get reflected into |
| 64 | the mapping's backing pages. The page is then backed by swap instead. |
| 65 | |
| 66 | In the no-MMU case: works much like the non-PROT_WRITE case, except |
| 67 | that a copy is always taken and never shared. |
| 68 | |
| 69 | (*) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE |
| 70 | |
| 71 | In the MMU case: VM regions backed by pages read from file; changes to |
| 72 | pages written back to file; writes to file reflected into pages backing |
| 73 | mapping; shared across fork. |
| 74 | |
| 75 | In the no-MMU case: not supported. |
| 76 | |
| 77 | (*) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE |
| 78 | |
| 79 | In the MMU case: As for ordinary regular files. |
| 80 | |
| 81 | In the no-MMU case: The filesystem providing the memory-backed file |
| 82 | (such as ramfs or tmpfs) may choose to honour an open, truncate, mmap |
| 83 | sequence by providing a contiguous sequence of pages to map. In that |
| 84 | case, a shared-writable memory mapping will be possible. It will work |
| 85 | as for the MMU case. If the filesystem does not provide any such |
| 86 | support, then the mapping request will be denied. |
| 87 | |
| 88 | (*) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE |
| 89 | |
| 90 | In the MMU case: As for ordinary regular files. |
| 91 | |
| 92 | In the no-MMU case: As for memory backed regular files, but the |
| 93 | blockdev must be able to provide a contiguous run of pages without |
| 94 | truncate being called. The ramdisk driver could do this if it allocated |
| 95 | all its memory as a contiguous array upfront. |
| 96 | |
| 97 | (*) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE |
| 98 | |
| 99 | In the MMU case: As for ordinary regular files. |
| 100 | |
| 101 | In the no-MMU case: The character device driver may choose to honour |
| 102 | the mmap() by providing direct access to the underlying device if it |
| 103 | provides memory or quasi-memory that can be accessed directly. Examples |
| 104 | of such are frame buffers and flash devices. If the driver does not |
| 105 | provide any such support, then the mapping request will be denied. |
| 106 | |
| 107 | |
| 108 | ============================ |
| 109 | FURTHER NOTES ON NO-MMU MMAP |
| 110 | ============================ |
| 111 | |
David Howells | 8feae13 | 2009-01-08 12:04:47 +0000 | [diff] [blame] | 112 | (*) A request for a private mapping of a file may return a buffer that is not |
| 113 | page-aligned. This is because XIP may take place, and the data may not be |
| 114 | paged aligned in the backing store. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 115 | |
David Howells | 8feae13 | 2009-01-08 12:04:47 +0000 | [diff] [blame] | 116 | (*) A request for an anonymous mapping will always be page aligned. If |
| 117 | possible the size of the request should be a power of two otherwise some |
| 118 | of the space may be wasted as the kernel must allocate a power-of-2 |
| 119 | granule but will only discard the excess if appropriately configured as |
| 120 | this has an effect on fragmentation. |
| 121 | |
Jie Zhang | ea63763 | 2009-12-14 18:00:02 -0800 | [diff] [blame] | 122 | (*) The memory allocated by a request for an anonymous mapping will normally |
| 123 | be cleared by the kernel before being returned in accordance with the |
| 124 | Linux man pages (ver 2.22 or later). |
| 125 | |
| 126 | In the MMU case this can be achieved with reasonable performance as |
| 127 | regions are backed by virtual pages, with the contents only being mapped |
| 128 | to cleared physical pages when a write happens on that specific page |
| 129 | (prior to which, the pages are effectively mapped to the global zero page |
| 130 | from which reads can take place). This spreads out the time it takes to |
| 131 | initialize the contents of a page - depending on the write-usage of the |
| 132 | mapping. |
| 133 | |
| 134 | In the no-MMU case, however, anonymous mappings are backed by physical |
| 135 | pages, and the entire map is cleared at allocation time. This can cause |
| 136 | significant delays during a userspace malloc() as the C library does an |
| 137 | anonymous mapping and the kernel then does a memset for the entire map. |
| 138 | |
| 139 | However, for memory that isn't required to be precleared - such as that |
| 140 | returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to |
| 141 | indicate to the kernel that it shouldn't bother clearing the memory before |
| 142 | returning it. Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled |
| 143 | to permit this, otherwise the flag will be ignored. |
| 144 | |
| 145 | uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this |
| 146 | to allocate the brk and stack region. |
| 147 | |
David Howells | 8feae13 | 2009-01-08 12:04:47 +0000 | [diff] [blame] | 148 | (*) A list of all the private copy and anonymous mappings on the system is |
| 149 | visible through /proc/maps in no-MMU mode. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 150 | |
David Howells | dbf8685 | 2006-09-27 01:50:19 -0700 | [diff] [blame] | 151 | (*) A list of all the mappings in use by a process is visible through |
| 152 | /proc/<pid>/maps in no-MMU mode. |
| 153 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 154 | (*) Supplying MAP_FIXED or a requesting a particular mapping address will |
| 155 | result in an error. |
| 156 | |
| 157 | (*) Files mapped privately usually have to have a read method provided by the |
| 158 | driver or filesystem so that the contents can be read into the memory |
| 159 | allocated if mmap() chooses not to map the backing device directly. An |
| 160 | error will result if they don't. This is most likely to be encountered |
| 161 | with character device files, pipes, fifos and sockets. |
| 162 | |
David Howells | 6fa5f80 | 2006-09-27 01:50:21 -0700 | [diff] [blame] | 163 | |
David Howells | 0112c4c | 2006-09-27 01:50:21 -0700 | [diff] [blame] | 164 | ========================== |
| 165 | INTERPROCESS SHARED MEMORY |
| 166 | ========================== |
| 167 | |
| 168 | Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU |
| 169 | mode. The former through the usual mechanism, the latter through files created |
| 170 | on ramfs or tmpfs mounts. |
| 171 | |
| 172 | |
David Howells | 930e652 | 2006-09-27 01:50:22 -0700 | [diff] [blame] | 173 | ======= |
| 174 | FUTEXES |
| 175 | ======= |
| 176 | |
| 177 | Futexes are supported in NOMMU mode if the arch supports them. An error will |
| 178 | be given if an address passed to the futex system call lies outside the |
| 179 | mappings made by a process or if the mapping in which the address lies does not |
| 180 | support futexes (such as an I/O chardev mapping). |
| 181 | |
| 182 | |
David Howells | 6fa5f80 | 2006-09-27 01:50:21 -0700 | [diff] [blame] | 183 | ============= |
| 184 | NO-MMU MREMAP |
| 185 | ============= |
| 186 | |
| 187 | The mremap() function is partially supported. It may change the size of a |
| 188 | mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size |
| 189 | of the mapping exceeds the size of the slab object currently occupied by the |
| 190 | memory to which the mapping refers, or if a smaller slab object could be used. |
| 191 | |
| 192 | MREMAP_FIXED is not supported, though it is ignored if there's no change of |
| 193 | address and the object does not need to be moved. |
| 194 | |
| 195 | Shared mappings may not be moved. Shareable mappings may not be moved either, |
| 196 | even if they are not currently shared. |
| 197 | |
| 198 | The mremap() function must be given an exact match for base address and size of |
| 199 | a previously mapped object. It may not be used to create holes in existing |
| 200 | mappings, move parts of existing mappings or resize parts of mappings. It must |
| 201 | act on a complete mapping. |
| 202 | |
| 203 | [*] Not currently supported. |
| 204 | |
| 205 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 206 | ============================================ |
| 207 | PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT |
| 208 | ============================================ |
| 209 | |
| 210 | To provide shareable character device support, a driver must provide a |
| 211 | file->f_op->get_unmapped_area() operation. The mmap() routines will call this |
| 212 | to get a proposed address for the mapping. This may return an error if it |
| 213 | doesn't wish to honour the mapping because it's too long, at a weird offset, |
| 214 | under some unsupported combination of flags or whatever. |
| 215 | |
| 216 | The driver should also provide backing device information with capabilities set |
| 217 | to indicate the permitted types of mapping on such devices. The default is |
| 218 | assumed to be readable and writable, not executable, and only shareable |
| 219 | directly (can't be copied). |
| 220 | |
| 221 | The file->f_op->mmap() operation will be called to actually inaugurate the |
| 222 | mapping. It can be rejected at that point. Returning the ENOSYS error will |
| 223 | cause the mapping to be copied instead if BDI_CAP_MAP_COPY is specified. |
| 224 | |
| 225 | The vm_ops->close() routine will be invoked when the last mapping on a chardev |
| 226 | is removed. An existing mapping will be shared, partially or not, if possible |
| 227 | without notifying the driver. |
| 228 | |
| 229 | It is permitted also for the file->f_op->get_unmapped_area() operation to |
| 230 | return -ENOSYS. This will be taken to mean that this operation just doesn't |
| 231 | want to handle it, despite the fact it's got an operation. For instance, it |
| 232 | might try directing the call to a secondary driver which turns out not to |
| 233 | implement it. Such is the case for the framebuffer driver which attempts to |
| 234 | direct the call to the device-specific driver. Under such circumstances, the |
| 235 | mapping request will be rejected if BDI_CAP_MAP_COPY is not specified, and a |
| 236 | copy mapped otherwise. |
| 237 | |
| 238 | IMPORTANT NOTE: |
| 239 | |
| 240 | Some types of device may present a different appearance to anyone |
| 241 | looking at them in certain modes. Flash chips can be like this; for |
| 242 | instance if they're in programming or erase mode, you might see the |
| 243 | status reflected in the mapping, instead of the data. |
| 244 | |
| 245 | In such a case, care must be taken lest userspace see a shared or a |
| 246 | private mapping showing such information when the driver is busy |
| 247 | controlling the device. Remember especially: private executable |
| 248 | mappings may still be mapped directly off the device under some |
| 249 | circumstances! |
| 250 | |
| 251 | |
| 252 | ============================================== |
| 253 | PROVIDING SHAREABLE MEMORY-BACKED FILE SUPPORT |
| 254 | ============================================== |
| 255 | |
| 256 | Provision of shared mappings on memory backed files is similar to the provision |
| 257 | of support for shared mapped character devices. The main difference is that the |
| 258 | filesystem providing the service will probably allocate a contiguous collection |
| 259 | of pages and permit mappings to be made on that. |
| 260 | |
| 261 | It is recommended that a truncate operation applied to such a file that |
| 262 | increases the file size, if that file is empty, be taken as a request to gather |
| 263 | enough pages to honour a mapping. This is required to support POSIX shared |
| 264 | memory. |
| 265 | |
| 266 | Memory backed devices are indicated by the mapping's backing device info having |
| 267 | the memory_backed flag set. |
| 268 | |
| 269 | |
| 270 | ======================================== |
| 271 | PROVIDING SHAREABLE BLOCK DEVICE SUPPORT |
| 272 | ======================================== |
| 273 | |
| 274 | Provision of shared mappings on block device files is exactly the same as for |
| 275 | character devices. If there isn't a real device underneath, then the driver |
| 276 | should allocate sufficient contiguous memory to honour any supported mapping. |
Paul Mundt | dd8632a | 2009-01-08 12:04:47 +0000 | [diff] [blame] | 277 | |
| 278 | |
| 279 | ================================= |
| 280 | ADJUSTING PAGE TRIMMING BEHAVIOUR |
| 281 | ================================= |
| 282 | |
| 283 | NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages |
| 284 | when performing an allocation. This can have adverse effects on memory |
| 285 | fragmentation, and as such, is left configurable. The default behaviour is to |
| 286 | aggressively trim allocations and discard any excess pages back in to the page |
| 287 | allocator. In order to retain finer-grained control over fragmentation, this |
| 288 | behaviour can either be disabled completely, or bumped up to a higher page |
| 289 | watermark where trimming begins. |
| 290 | |
| 291 | Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'. |