Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 1 | ramfs, rootfs and initramfs |
| 2 | October 17, 2005 |
| 3 | Rob Landley <rob@landley.net> |
| 4 | ============================= |
| 5 | |
| 6 | What is ramfs? |
| 7 | -------------- |
| 8 | |
| 9 | Ramfs is a very simple filesystem that exports Linux's disk caching |
| 10 | mechanisms (the page cache and dentry cache) as a dynamically resizable |
| 11 | ram-based filesystem. |
| 12 | |
| 13 | Normally all files are cached in memory by Linux. Pages of data read from |
| 14 | backing store (usually the block device the filesystem is mounted on) are kept |
| 15 | around in case it's needed again, but marked as clean (freeable) in case the |
| 16 | Virtual Memory system needs the memory for something else. Similarly, data |
| 17 | written to files is marked clean as soon as it has been written to backing |
| 18 | store, but kept around for caching purposes until the VM reallocates the |
| 19 | memory. A similar mechanism (the dentry cache) greatly speeds up access to |
| 20 | directories. |
| 21 | |
| 22 | With ramfs, there is no backing store. Files written into ramfs allocate |
| 23 | dentries and page cache as usual, but there's nowhere to write them to. |
| 24 | This means the pages are never marked clean, so they can't be freed by the |
| 25 | VM when it's looking to recycle memory. |
| 26 | |
| 27 | The amount of code required to implement ramfs is tiny, because all the |
| 28 | work is done by the existing Linux caching infrastructure. Basically, |
| 29 | you're mounting the disk cache as a filesystem. Because of this, ramfs is not |
| 30 | an optional component removable via menuconfig, since there would be negligible |
| 31 | space savings. |
| 32 | |
| 33 | ramfs and ramdisk: |
| 34 | ------------------ |
| 35 | |
| 36 | The older "ram disk" mechanism created a synthetic block device out of |
| 37 | an area of ram and used it as backing store for a filesystem. This block |
| 38 | device was of fixed size, so the filesystem mounted on it was of fixed |
| 39 | size. Using a ram disk also required unnecessarily copying memory from the |
| 40 | fake block device into the page cache (and copying changes back out), as well |
| 41 | as creating and destroying dentries. Plus it needed a filesystem driver |
| 42 | (such as ext2) to format and interpret this data. |
| 43 | |
| 44 | Compared to ramfs, this wastes memory (and memory bus bandwidth), creates |
| 45 | unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks |
| 46 | to avoid this copying by playing with the page tables, but they're unpleasantly |
| 47 | complicated and turn out to be about as expensive as the copying anyway.) |
| 48 | More to the point, all the work ramfs is doing has to happen _anyway_, |
| 49 | since all file access goes through the page and dentry caches. The ram |
| 50 | disk is simply unnecessary, ramfs is internally much simpler. |
| 51 | |
| 52 | Another reason ramdisks are semi-obsolete is that the introduction of |
| 53 | loopback devices offered a more flexible and convenient way to create |
| 54 | synthetic block devices, now from files instead of from chunks of memory. |
| 55 | See losetup (8) for details. |
| 56 | |
| 57 | ramfs and tmpfs: |
| 58 | ---------------- |
| 59 | |
| 60 | One downside of ramfs is you can keep writing data into it until you fill |
| 61 | up all memory, and the VM can't free it because the VM thinks that files |
| 62 | should get written to backing store (rather than swap space), but ramfs hasn't |
| 63 | got any backing store. Because of this, only root (or a trusted user) should |
| 64 | be allowed write access to a ramfs mount. |
| 65 | |
| 66 | A ramfs derivative called tmpfs was created to add size limits, and the ability |
| 67 | to write the data to swap space. Normal users can be allowed write access to |
| 68 | tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. |
| 69 | |
| 70 | What is rootfs? |
| 71 | --------------- |
| 72 | |
| 73 | Rootfs is a special instance of ramfs, which is always present in 2.6 systems. |
| 74 | (It's used internally as the starting and stopping point for searches of the |
| 75 | kernel's doubly-linked list of mount points.) |
| 76 | |
| 77 | Most systems just mount another filesystem over it and ignore it. The |
| 78 | amount of space an empty instance of ramfs takes up is tiny. |
| 79 | |
| 80 | What is initramfs? |
| 81 | ------------------ |
| 82 | |
| 83 | All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is |
| 84 | extracted into rootfs when the kernel boots up. After extracting, the kernel |
| 85 | checks to see if rootfs contains a file "init", and if so it executes it as PID |
| 86 | 1. If found, this init process is responsible for bringing the system the |
| 87 | rest of the way up, including locating and mounting the real root device (if |
| 88 | any). If rootfs does not contain an init program after the embedded cpio |
| 89 | archive is extracted into it, the kernel will fall through to the older code |
| 90 | to locate and mount a root partition, then exec some variant of /sbin/init |
| 91 | out of that. |
| 92 | |
| 93 | All this differs from the old initrd in several ways: |
| 94 | |
| 95 | - The old initrd was a separate file, while the initramfs archive is linked |
| 96 | into the linux kernel image. (The directory linux-*/usr is devoted to |
| 97 | generating this archive during the build.) |
| 98 | |
| 99 | - The old initrd file was a gzipped filesystem image (in some file format, |
| 100 | such as ext2, that had to be built into the kernel), while the new |
| 101 | initramfs archive is a gzipped cpio archive (like tar only simpler, |
| 102 | see cpio(1) and Documentation/early-userspace/buffer-format.txt). |
| 103 | |
| 104 | - The program run by the old initrd (which was called /initrd, not /init) did |
| 105 | some setup and then returned to the kernel, while the init program from |
| 106 | initramfs is not expected to return to the kernel. (If /init needs to hand |
| 107 | off control it can overmount / with a new root device and exec another init |
| 108 | program. See the switch_root utility, below.) |
| 109 | |
| 110 | - When switching another root device, initrd would pivot_root and then |
| 111 | umount the ramdisk. But initramfs is rootfs: you can neither pivot_root |
| 112 | rootfs, nor unmount it. Instead delete everything out of rootfs to |
| 113 | free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs |
| 114 | with the new root (cd /newmount; mount --move . /; chroot .), attach |
| 115 | stdin/stdout/stderr to the new /dev/console, and exec the new init. |
| 116 | |
| 117 | Since this is a remarkably persnickity process (and involves deleting |
| 118 | commands before you can run them), the klibc package introduced a helper |
| 119 | program (utils/run_init.c) to do all this for you. Most other packages |
| 120 | (such as busybox) have named this command "switch_root". |
| 121 | |
| 122 | Populating initramfs: |
| 123 | --------------------- |
| 124 | |
| 125 | The 2.6 kernel build process always creates a gzipped cpio format initramfs |
| 126 | archive and links it into the resulting kernel binary. By default, this |
| 127 | archive is empty (consuming 134 bytes on x86). The config option |
| 128 | CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices |
| 129 | in menuconfig, and living in usr/Kconfig) can be used to specify a source for |
| 130 | the initramfs archive, which will automatically be incorporated into the |
| 131 | resulting binary. This option can point to an existing gzipped cpio archive, a |
| 132 | directory containing files to be archived, or a text file specification such |
| 133 | as the following example: |
| 134 | |
| 135 | dir /dev 755 0 0 |
| 136 | nod /dev/console 644 0 0 c 5 1 |
| 137 | nod /dev/loop0 644 0 0 b 7 0 |
| 138 | dir /bin 755 1000 1000 |
| 139 | slink /bin/sh busybox 777 0 0 |
| 140 | file /bin/busybox initramfs/busybox 755 0 0 |
| 141 | dir /proc 755 0 0 |
| 142 | dir /sys 755 0 0 |
| 143 | dir /mnt 755 0 0 |
| 144 | file /init initramfs/init.sh 755 0 0 |
| 145 | |
| 146 | One advantage of the text file is that root access is not required to |
| 147 | set permissions or create device nodes in the new archive. (Note that those |
| 148 | two example "file" entries expect to find files named "init.sh" and "busybox" in |
| 149 | a directory called "initramfs", under the linux-2.6.* directory. See |
| 150 | Documentation/early-userspace/README for more details.) |
| 151 | |
| 152 | If you don't already understand what shared libraries, devices, and paths |
| 153 | you need to get a minimal root filesystem up and running, here are some |
| 154 | references: |
| 155 | http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ |
| 156 | http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html |
| 157 | http://www.linuxfromscratch.org/lfs/view/stable/ |
| 158 | |
| 159 | The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is |
| 160 | designed to be a tiny C library to statically link early userspace |
| 161 | code against, along with some related utilities. It is BSD licensed. |
| 162 | |
| 163 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) |
| 164 | myself. These are LGPL and GPL, respectively. |
| 165 | |
| 166 | In theory you could use glibc, but that's not well suited for small embedded |
| 167 | uses like this. (A "hello world" program statically linked against glibc is |
| 168 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do |
| 169 | name lookups, even when otherwise statically linked.) |
| 170 | |
| 171 | Future directions: |
| 172 | ------------------ |
| 173 | |
| 174 | Today (2.6.14), initramfs is always compiled in, but not always used. The |
| 175 | kernel falls back to legacy boot code that is reached only if initramfs does |
| 176 | not contain an /init program. The fallback is legacy code, there to ensure a |
| 177 | smooth transition and allowing early boot functionality to gradually move to |
| 178 | "early userspace" (I.E. initramfs). |
| 179 | |
| 180 | The move to early userspace is necessary because finding and mounting the real |
| 181 | root device is complex. Root partitions can span multiple devices (raid or |
| 182 | separate journal). They can be out on the network (requiring dhcp, setting a |
| 183 | specific mac address, logging into a server, etc). They can live on removable |
| 184 | media, with dynamically allocated major/minor numbers and persistent naming |
| 185 | issues requiring a full udev implementation to sort out. They can be |
| 186 | compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, |
| 187 | and so on. |
| 188 | |
| 189 | This kind of complexity (which inevitably includes policy) is rightly handled |
| 190 | in userspace. Both klibc and busybox/uClibc are working on simple initramfs |
| 191 | packages to drop into a kernel build, and when standard solutions are ready |
| 192 | and widely deployed, the kernel's legacy early boot code will become obsolete |
| 193 | and a candidate for the feature removal schedule. |
| 194 | |
| 195 | But that's a while off yet. |