Joern Engel | 5db53f3 | 2009-11-20 20:13:39 +0100 | [diff] [blame] | 1 | |
| 2 | The LogFS Flash Filesystem |
| 3 | ========================== |
| 4 | |
| 5 | Specification |
| 6 | ============= |
| 7 | |
| 8 | Superblocks |
| 9 | ----------- |
| 10 | |
| 11 | Two superblocks exist at the beginning and end of the filesystem. |
| 12 | Each superblock is 256 Bytes large, with another 3840 Bytes reserved |
| 13 | for future purposes, making a total of 4096 Bytes. |
| 14 | |
| 15 | Superblock locations may differ for MTD and block devices. On MTD the |
| 16 | first non-bad block contains a superblock in the first 4096 Bytes and |
| 17 | the last non-bad block contains a superblock in the last 4096 Bytes. |
| 18 | On block devices, the first 4096 Bytes of the device contain the first |
| 19 | superblock and the last aligned 4096 Byte-block contains the second |
| 20 | superblock. |
| 21 | |
| 22 | For the most part, the superblocks can be considered read-only. They |
| 23 | are written only to correct errors detected within the superblocks, |
| 24 | move the journal and change the filesystem parameters through tunefs. |
| 25 | As a result, the superblock does not contain any fields that require |
| 26 | constant updates, like the amount of free space, etc. |
| 27 | |
| 28 | Segments |
| 29 | -------- |
| 30 | |
| 31 | The space in the device is split up into equal-sized segments. |
| 32 | Segments are the primary write unit of LogFS. Within each segments, |
| 33 | writes happen from front (low addresses) to back (high addresses. If |
| 34 | only a partial segment has been written, the segment number, the |
| 35 | current position within and optionally a write buffer are stored in |
| 36 | the journal. |
| 37 | |
| 38 | Segments are erased as a whole. Therefore Garbage Collection may be |
| 39 | required to completely free a segment before doing so. |
| 40 | |
| 41 | Journal |
| 42 | -------- |
| 43 | |
| 44 | The journal contains all global information about the filesystem that |
| 45 | is subject to frequent change. At mount time, it has to be scanned |
| 46 | for the most recent commit entry, which contains a list of pointers to |
| 47 | all currently valid entries. |
| 48 | |
| 49 | Object Store |
| 50 | ------------ |
| 51 | |
| 52 | All space except for the superblocks and journal is part of the object |
| 53 | store. Each segment contains a segment header and a number of |
| 54 | objects, each consisting of the object header and the payload. |
| 55 | Objects are either inodes, directory entries (dentries), file data |
| 56 | blocks or indirect blocks. |
| 57 | |
| 58 | Levels |
| 59 | ------ |
| 60 | |
| 61 | Garbage collection (GC) may fail if all data is written |
Anand Gadiyar | a8cd456 | 2010-05-10 14:51:19 +0530 | [diff] [blame] | 62 | indiscriminately. One requirement of GC is that data is separated |
Joern Engel | 5db53f3 | 2009-11-20 20:13:39 +0100 | [diff] [blame] | 63 | roughly according to the distance between the tree root and the data. |
| 64 | Effectively that means all file data is on level 0, indirect blocks |
| 65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, |
| 66 | respectively. Inode file data is on level 6 for the inodes and 7-11 |
| 67 | for indirect blocks. |
| 68 | |
| 69 | Each segment contains objects of a single level only. As a result, |
Anand Gadiyar | a8cd456 | 2010-05-10 14:51:19 +0530 | [diff] [blame] | 70 | each level requires its own separate segment to be open for writing. |
Joern Engel | 5db53f3 | 2009-11-20 20:13:39 +0100 | [diff] [blame] | 71 | |
| 72 | Inode File |
| 73 | ---------- |
| 74 | |
| 75 | All inodes are stored in a special file, the inode file. Single |
| 76 | exception is the inode file's inode (master inode) which for obvious |
| 77 | reasons is stored in the journal instead. Instead of data blocks, the |
| 78 | leaf nodes of the inode files are inodes. |
| 79 | |
| 80 | Aliases |
| 81 | ------- |
| 82 | |
| 83 | Writes in LogFS are done by means of a wandering tree. A naïve |
| 84 | implementation would require that for each write or a block, all |
| 85 | parent blocks are written as well, since the block pointers have |
| 86 | changed. Such an implementation would not be very efficient. |
| 87 | |
| 88 | In LogFS, the block pointer changes are cached in the journal by means |
| 89 | of alias entries. Each alias consists of its logical address - inode |
| 90 | number, block index, level and child number (index into block) - and |
| 91 | the changed data. Any 8-byte word can be changes in this manner. |
| 92 | |
| 93 | Currently aliases are used for block pointers, file size, file used |
| 94 | bytes and the height of an inodes indirect tree. |
| 95 | |
| 96 | Segment Aliases |
| 97 | --------------- |
| 98 | |
| 99 | Related to regular aliases, these are used to handle bad blocks. |
| 100 | Initially, bad blocks are handled by moving the affected segment |
| 101 | content to a spare segment and noting this move in the journal with a |
| 102 | segment alias, a simple (to, from) tupel. GC will later empty this |
| 103 | segment and the alias can be removed again. This is used on MTD only. |
| 104 | |
| 105 | Vim |
| 106 | --- |
| 107 | |
| 108 | By cleverly predicting the life time of data, it is possible to |
Anand Gadiyar | a8cd456 | 2010-05-10 14:51:19 +0530 | [diff] [blame] | 109 | separate long-living data from short-living data and thereby reduce |
Joern Engel | 5db53f3 | 2009-11-20 20:13:39 +0100 | [diff] [blame] | 110 | the GC overhead later. Each type of distinc life expectency (vim) can |
Anand Gadiyar | a8cd456 | 2010-05-10 14:51:19 +0530 | [diff] [blame] | 111 | have a separate segment open for writing. Each (level, vim) tupel can |
Joern Engel | 5db53f3 | 2009-11-20 20:13:39 +0100 | [diff] [blame] | 112 | be open just once. If an open segment with unknown vim is encountered |
| 113 | at mount time, it is closed and ignored henceforth. |
| 114 | |
| 115 | Indirect Tree |
| 116 | ------------- |
| 117 | |
| 118 | Inodes in LogFS are similar to FFS-style filesystems with direct and |
| 119 | indirect block pointers. One difference is that LogFS uses a single |
| 120 | indirect pointer that can be either a 1x, 2x, etc. indirect pointer. |
| 121 | A height field in the inode defines the height of the indirect tree |
| 122 | and thereby the indirection of the pointer. |
| 123 | |
| 124 | Another difference is the addressing of indirect blocks. In LogFS, |
| 125 | the first 16 pointers in the first indirect block are left empty, |
| 126 | corresponding to the 16 direct pointers in the inode. In ext2 (maybe |
| 127 | others as well) the first pointer in the first indirect block |
| 128 | corresponds to logical block 12, skipping the 12 direct pointers. |
| 129 | So where ext2 is using arithmetic to better utilize space, LogFS keeps |
| 130 | arithmetic simple and uses compression to save space. |
| 131 | |
| 132 | Compression |
| 133 | ----------- |
| 134 | |
| 135 | Both file data and metadata can be compressed. Compression for file |
| 136 | data can be enabled with chattr +c and disabled with chattr -c. Doing |
| 137 | so has no effect on existing data, but new data will be stored |
| 138 | accordingly. New inodes will inherit the compression flag of the |
| 139 | parent directory. |
| 140 | |
| 141 | Metadata is always compressed. However, the space accounting ignores |
| 142 | this and charges for the uncompressed size. Failing to do so could |
| 143 | result in GC failures when, after moving some data, indirect blocks |
| 144 | compress worse than previously. Even on a 100% full medium, GC may |
| 145 | not consume any extra space, so the compression gains are lost space |
| 146 | to the user. |
| 147 | |
| 148 | However, they are not lost space to the filesystem internals. By |
| 149 | cheating the user for those bytes, the filesystem gained some slack |
| 150 | space and GC will run less often and faster. |
| 151 | |
| 152 | Garbage Collection and Wear Leveling |
| 153 | ------------------------------------ |
| 154 | |
| 155 | Garbage collection is invoked whenever the number of free segments |
| 156 | falls below a threshold. The best (known) candidate is picked based |
| 157 | on the least amount of valid data contained in the segment. All |
| 158 | remaining valid data is copied elsewhere, thereby invalidating it. |
| 159 | |
| 160 | The GC code also checks for aliases and writes then back if their |
| 161 | number gets too large. |
| 162 | |
| 163 | Wear leveling is done by occasionally picking a suboptimal segment for |
| 164 | garbage collection. If a stale segments erase count is significantly |
| 165 | lower than the active segments' erase counts, it will be picked. Wear |
| 166 | leveling is rate limited, so it will never monopolize the device for |
| 167 | more than one segment worth at a time. |
| 168 | |
| 169 | Values for "occasionally", "significantly lower" are compile time |
| 170 | constants. |
| 171 | |
| 172 | Hashed directories |
| 173 | ------------------ |
| 174 | |
| 175 | To satisfy efficient lookup(), directory entries are hashed and |
| 176 | located based on the hash. In order to both support large directories |
| 177 | and not be overly inefficient for small directories, several hash |
| 178 | tables of increasing size are used. For each table, the hash value |
| 179 | modulo the table size gives the table index. |
| 180 | |
| 181 | Tables sizes are chosen to limit the number of indirect blocks with a |
| 182 | fully populated table to 0, 1, 2 or 3 respectively. So the first |
| 183 | table contains 16 entries, the second 512-16, etc. |
| 184 | |
| 185 | The last table is special in several ways. First its size depends on |
| 186 | the effective 32bit limit on telldir/seekdir cookies. Since logfs |
| 187 | uses the upper half of the address space for indirect blocks, the size |
| 188 | is limited to 2^31. Secondly the table contains hash buckets with 16 |
| 189 | entries each. |
| 190 | |
| 191 | Using single-entry buckets would result in birthday "attacks". At |
| 192 | just 2^16 used entries, hash collisions would be likely (P >= 0.5). |
| 193 | My math skills are insufficient to do the combinatorics for the 17x |
| 194 | collisions necessary to overflow a bucket, but testing showed that in |
| 195 | 10,000 runs the lowest directory fill before a bucket overflow was |
| 196 | 188,057,130 entries with an average of 315,149,915 entries. So for |
| 197 | directory sizes of up to a million, bucket overflows should be |
| 198 | virtually impossible under normal circumstances. |
| 199 | |
| 200 | With carefully chosen filenames, it is obviously possible to cause an |
| 201 | overflow with just 21 entries (4 higher tables + 16 entries + 1). So |
| 202 | there may be a security concern if a malicious user has write access |
| 203 | to a directory. |
| 204 | |
| 205 | Open For Discussion |
| 206 | =================== |
| 207 | |
| 208 | Device Address Space |
| 209 | -------------------- |
| 210 | |
| 211 | A device address space is used for caching. Both block devices and |
| 212 | MTD provide functions to either read a single page or write a segment. |
| 213 | Partial segments may be written for data integrity, but where possible |
| 214 | complete segments are written for performance on simple block device |
| 215 | flash media. |
| 216 | |
| 217 | Meta Inodes |
| 218 | ----------- |
| 219 | |
| 220 | Inodes are stored in the inode file, which is just a regular file for |
| 221 | most purposes. At umount time, however, the inode file needs to |
| 222 | remain open until all dirty inodes are written. So |
| 223 | generic_shutdown_super() may not close this inode, but shouldn't |
| 224 | complain about remaining inodes due to the inode file either. Same |
| 225 | goes for mapping inode of the device address space. |
| 226 | |
| 227 | Currently logfs uses a hack that essentially copies part of fs/inode.c |
| 228 | code over. A general solution would be preferred. |
| 229 | |
| 230 | Indirect block mapping |
| 231 | ---------------------- |
| 232 | |
| 233 | With compression, the block device (or mapping inode) cannot be used |
| 234 | to cache indirect blocks. Some other place is required. Currently |
| 235 | logfs uses the top half of each inode's address space. The low 8TB |
| 236 | (on 32bit) are filled with file data, the high 8TB are used for |
| 237 | indirect blocks. |
| 238 | |
| 239 | One problem is that 16TB files created on 64bit systems actually have |
| 240 | data in the top 8TB. But files >16TB would cause problems anyway, so |
| 241 | only the limit has changed. |