Darrick J. Wong | c09f3ba | 2018-07-29 15:38:00 -0400 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | Block and Inode Allocation Policy |
| 4 | --------------------------------- |
| 5 | |
| 6 | ext4 recognizes (better than ext3, anyway) that data locality is |
| 7 | generally a desirably quality of a filesystem. On a spinning disk, |
| 8 | keeping related blocks near each other reduces the amount of movement |
| 9 | that the head actuator and disk must perform to access a data block, |
| 10 | thus speeding up disk IO. On an SSD there of course are no moving parts, |
| 11 | but locality can increase the size of each transfer request while |
| 12 | reducing the total number of requests. This locality may also have the |
| 13 | effect of concentrating writes on a single erase block, which can speed |
| 14 | up file rewrites significantly. Therefore, it is useful to reduce |
| 15 | fragmentation whenever possible. |
| 16 | |
| 17 | The first tool that ext4 uses to combat fragmentation is the multi-block |
| 18 | allocator. When a file is first created, the block allocator |
| 19 | speculatively allocates 8KiB of disk space to the file on the assumption |
| 20 | that the space will get written soon. When the file is closed, the |
| 21 | unused speculative allocations are of course freed, but if the |
| 22 | speculation is correct (typically the case for full writes of small |
| 23 | files) then the file data gets written out in a single multi-block |
| 24 | extent. A second related trick that ext4 uses is delayed allocation. |
| 25 | Under this scheme, when a file needs more blocks to absorb file writes, |
| 26 | the filesystem defers deciding the exact placement on the disk until all |
| 27 | the dirty buffers are being written out to disk. By not committing to a |
| 28 | particular placement until it's absolutely necessary (the commit timeout |
| 29 | is hit, or sync() is called, or the kernel runs out of memory), the hope |
| 30 | is that the filesystem can make better location decisions. |
| 31 | |
| 32 | The third trick that ext4 (and ext3) uses is that it tries to keep a |
| 33 | file's data blocks in the same block group as its inode. This cuts down |
| 34 | on the seek penalty when the filesystem first has to read a file's inode |
| 35 | to learn where the file's data blocks live and then seek over to the |
| 36 | file's data blocks to begin I/O operations. |
| 37 | |
| 38 | The fourth trick is that all the inodes in a directory are placed in the |
| 39 | same block group as the directory, when feasible. The working assumption |
| 40 | here is that all the files in a directory might be related, therefore it |
| 41 | is useful to try to keep them all together. |
| 42 | |
| 43 | The fifth trick is that the disk volume is cut up into 128MB block |
| 44 | groups; these mini-containers are used as outlined above to try to |
| 45 | maintain data locality. However, there is a deliberate quirk -- when a |
| 46 | directory is created in the root directory, the inode allocator scans |
| 47 | the block groups and puts that directory into the least heavily loaded |
| 48 | block group that it can find. This encourages directories to spread out |
| 49 | over a disk; as the top-level directory/file blobs fill up one block |
| 50 | group, the allocators simply move on to the next block group. Allegedly |
| 51 | this scheme evens out the loading on the block groups, though the author |
| 52 | suspects that the directories which are so unlucky as to land towards |
| 53 | the end of a spinning drive get a raw deal performance-wise. |
| 54 | |
| 55 | Of course if all of these mechanisms fail, one can always use e4defrag |
| 56 | to defragment files. |