Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | |
| 2 | The Second Extended Filesystem |
| 3 | ============================== |
| 4 | |
| 5 | ext2 was originally released in January 1993. Written by R\'emy Card, |
| 6 | Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the |
| 7 | Extended Filesystem. It is currently still (April 2001) the predominant |
| 8 | filesystem in use by Linux. There are also implementations available |
| 9 | for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. |
| 10 | |
| 11 | Options |
| 12 | ======= |
| 13 | |
| 14 | Most defaults are determined by the filesystem superblock, and can be |
| 15 | set using tune2fs(8). Kernel-determined defaults are indicated by (*). |
| 16 | |
| 17 | bsddf (*) Makes `df' act like BSD. |
| 18 | minixdf Makes `df' act like Minix. |
| 19 | |
| 20 | check Check block and inode bitmaps at mount time |
| 21 | (requires CONFIG_EXT2_CHECK). |
| 22 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount |
| 23 | (check=normal and check=strict options removed) |
| 24 | |
| 25 | debug Extra debugging information is sent to the |
| 26 | kernel syslog. Useful for developers. |
| 27 | |
| 28 | errors=continue Keep going on a filesystem error. |
| 29 | errors=remount-ro Remount the filesystem read-only on an error. |
| 30 | errors=panic Panic and halt the machine if an error occurs. |
| 31 | |
| 32 | grpid, bsdgroups Give objects the same group ID as their parent. |
| 33 | nogrpid, sysvgroups New objects have the group ID of their creator. |
| 34 | |
| 35 | nouid32 Use 16-bit UIDs and GIDs. |
| 36 | |
| 37 | oldalloc Enable the old block allocator. Orlov should |
| 38 | have better performance, we'd like to get some |
| 39 | feedback if it's the contrary for you. |
| 40 | orlov (*) Use the Orlov block allocator. |
| 41 | (See http://lwn.net/Articles/14633/ and |
| 42 | http://lwn.net/Articles/14446/.) |
| 43 | |
| 44 | resuid=n The user ID which may use the reserved blocks. |
| 45 | resgid=n The group ID which may use the reserved blocks. |
| 46 | |
| 47 | sb=n Use alternate superblock at this location. |
| 48 | |
| 49 | user_xattr Enable "user." POSIX Extended Attributes |
| 50 | (requires CONFIG_EXT2_FS_XATTR). |
| 51 | See also http://acl.bestbits.at |
| 52 | nouser_xattr Don't support "user." extended attributes. |
| 53 | |
| 54 | acl Enable POSIX Access Control Lists support |
| 55 | (requires CONFIG_EXT2_FS_POSIX_ACL). |
| 56 | See also http://acl.bestbits.at |
| 57 | noacl Don't support POSIX ACLs. |
| 58 | |
| 59 | nobh Do not attach buffer_heads to file pagecache. |
| 60 | |
Carsten Otte | d763b7a | 2005-06-23 22:05:31 -0700 | [diff] [blame] | 61 | xip Use execute in place (no caching) if possible |
| 62 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 63 | grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. |
| 64 | |
| 65 | |
| 66 | Specification |
| 67 | ============= |
| 68 | |
| 69 | ext2 shares many properties with traditional Unix filesystems. It has |
| 70 | the concepts of blocks, inodes and directories. It has space in the |
| 71 | specification for Access Control Lists (ACLs), fragments, undeletion and |
| 72 | compression though these are not yet implemented (some are available as |
| 73 | separate patches). There is also a versioning mechanism to allow new |
| 74 | features (such as journalling) to be added in a maximally compatible |
| 75 | manner. |
| 76 | |
| 77 | Blocks |
| 78 | ------ |
| 79 | |
| 80 | The space in the device or file is split up into blocks. These are |
| 81 | a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), |
| 82 | which is decided when the filesystem is created. Smaller blocks mean |
| 83 | less wasted space per file, but require slightly more accounting overhead, |
| 84 | and also impose other limits on the size of files and the filesystem. |
| 85 | |
| 86 | Block Groups |
| 87 | ------------ |
| 88 | |
| 89 | Blocks are clustered into block groups in order to reduce fragmentation |
| 90 | and minimise the amount of head seeking when reading a large amount |
| 91 | of consecutive data. Information about each block group is kept in a |
| 92 | descriptor table stored in the block(s) immediately after the superblock. |
| 93 | Two blocks near the start of each group are reserved for the block usage |
| 94 | bitmap and the inode usage bitmap which show which blocks and inodes |
| 95 | are in use. Since each bitmap is limited to a single block, this means |
| 96 | that the maximum size of a block group is 8 times the size of a block. |
| 97 | |
| 98 | The block(s) following the bitmaps in each block group are designated |
| 99 | as the inode table for that block group and the remainder are the data |
| 100 | blocks. The block allocation algorithm attempts to allocate data blocks |
| 101 | in the same block group as the inode which contains them. |
| 102 | |
| 103 | The Superblock |
| 104 | -------------- |
| 105 | |
| 106 | The superblock contains all the information about the configuration of |
| 107 | the filing system. The primary copy of the superblock is stored at an |
| 108 | offset of 1024 bytes from the start of the device, and it is essential |
| 109 | to mounting the filesystem. Since it is so important, backup copies of |
| 110 | the superblock are stored in block groups throughout the filesystem. |
| 111 | The first version of ext2 (revision 0) stores a copy at the start of |
| 112 | every block group, along with backups of the group descriptor block(s). |
| 113 | Because this can consume a considerable amount of space for large |
| 114 | filesystems, later revisions can optionally reduce the number of backup |
| 115 | copies by only putting backups in specific groups (this is the sparse |
| 116 | superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. |
| 117 | |
| 118 | The information in the superblock contains fields such as the total |
| 119 | number of inodes and blocks in the filesystem and how many are free, |
| 120 | how many inodes and blocks are in each block group, when the filesystem |
| 121 | was mounted (and if it was cleanly unmounted), when it was modified, |
| 122 | what version of the filesystem it is (see the Revisions section below) |
| 123 | and which OS created it. |
| 124 | |
| 125 | If the filesystem is revision 1 or higher, then there are extra fields, |
| 126 | such as a volume name, a unique identification number, the inode size, |
| 127 | and space for optional filesystem features to store configuration info. |
| 128 | |
| 129 | All fields in the superblock (as in all other ext2 structures) are stored |
| 130 | on the disc in little endian format, so a filesystem is portable between |
| 131 | machines without having to know what machine it was created on. |
| 132 | |
| 133 | Inodes |
| 134 | ------ |
| 135 | |
| 136 | The inode (index node) is a fundamental concept in the ext2 filesystem. |
| 137 | Each object in the filesystem is represented by an inode. The inode |
| 138 | structure contains pointers to the filesystem blocks which contain the |
| 139 | data held in the object and all of the metadata about an object except |
| 140 | its name. The metadata about an object includes the permissions, owner, |
| 141 | group, flags, size, number of blocks used, access time, change time, |
| 142 | modification time, deletion time, number of links, fragments, version |
| 143 | (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). |
| 144 | |
| 145 | There are some reserved fields which are currently unused in the inode |
| 146 | structure and several which are overloaded. One field is reserved for the |
| 147 | directory ACL if the inode is a directory and alternately for the top 32 |
| 148 | bits of the file size if the inode is a regular file (allowing file sizes |
| 149 | larger than 2GB). The translator field is unused under Linux, but is used |
| 150 | by the HURD to reference the inode of a program which will be used to |
| 151 | interpret this object. Most of the remaining reserved fields have been |
| 152 | used up for both Linux and the HURD for larger owner and group fields, |
| 153 | The HURD also has a larger mode field so it uses another of the remaining |
| 154 | fields to store the extra more bits. |
| 155 | |
| 156 | There are pointers to the first 12 blocks which contain the file's data |
| 157 | in the inode. There is a pointer to an indirect block (which contains |
| 158 | pointers to the next set of blocks), a pointer to a doubly-indirect |
| 159 | block (which contains pointers to indirect blocks) and a pointer to a |
| 160 | trebly-indirect block (which contains pointers to doubly-indirect blocks). |
| 161 | |
| 162 | The flags field contains some ext2-specific flags which aren't catered |
| 163 | for by the standard chmod flags. These flags can be listed with lsattr |
| 164 | and changed with the chattr command, and allow specific filesystem |
| 165 | behaviour on a per-file basis. There are flags for secure deletion, |
| 166 | undeletable, compression, synchronous updates, immutability, append-only, |
| 167 | dumpable, no-atime, indexed directories, and data-journaling. Not all |
| 168 | of these are supported yet. |
| 169 | |
| 170 | Directories |
| 171 | ----------- |
| 172 | |
| 173 | A directory is a filesystem object and has an inode just like a file. |
| 174 | It is a specially formatted file containing records which associate |
| 175 | each name with an inode number. Later revisions of the filesystem also |
| 176 | encode the type of the object (file, directory, symlink, device, fifo, |
| 177 | socket) to avoid the need to check the inode itself for this information |
| 178 | (support for taking advantage of this feature does not yet exist in |
| 179 | Glibc 2.2). |
| 180 | |
| 181 | The inode allocation code tries to assign inodes which are in the same |
| 182 | block group as the directory in which they are first created. |
| 183 | |
| 184 | The current implementation of ext2 uses a singly-linked list to store |
| 185 | the filenames in the directory; a pending enhancement uses hashing of the |
| 186 | filenames to allow lookup without the need to scan the entire directory. |
| 187 | |
| 188 | The current implementation never removes empty directory blocks once they |
| 189 | have been allocated to hold more files. |
| 190 | |
| 191 | Special files |
| 192 | ------------- |
| 193 | |
| 194 | Symbolic links are also filesystem objects with inodes. They deserve |
| 195 | special mention because the data for them is stored within the inode |
| 196 | itself if the symlink is less than 60 bytes long. It uses the fields |
| 197 | which would normally be used to store the pointers to data blocks. |
| 198 | This is a worthwhile optimisation as it we avoid allocating a full |
| 199 | block for the symlink, and most symlinks are less than 60 characters long. |
| 200 | |
| 201 | Character and block special devices never have data blocks assigned to |
| 202 | them. Instead, their device number is stored in the inode, again reusing |
| 203 | the fields which would be used to point to the data blocks. |
| 204 | |
| 205 | Reserved Space |
| 206 | -------------- |
| 207 | |
| 208 | In ext2, there is a mechanism for reserving a certain number of blocks |
| 209 | for a particular user (normally the super-user). This is intended to |
| 210 | allow for the system to continue functioning even if non-priveleged users |
| 211 | fill up all the space available to them (this is independent of filesystem |
| 212 | quotas). It also keeps the filesystem from filling up entirely which |
| 213 | helps combat fragmentation. |
| 214 | |
| 215 | Filesystem check |
| 216 | ---------------- |
| 217 | |
| 218 | At boot time, most systems run a consistency check (e2fsck) on their |
| 219 | filesystems. The superblock of the ext2 filesystem contains several |
| 220 | fields which indicate whether fsck should actually run (since checking |
| 221 | the filesystem at boot can take a long time if it is large). fsck will |
| 222 | run if the filesystem was not cleanly unmounted, if the maximum mount |
| 223 | count has been exceeded or if the maximum time between checks has been |
| 224 | exceeded. |
| 225 | |
| 226 | Feature Compatibility |
| 227 | --------------------- |
| 228 | |
| 229 | The compatibility feature mechanism used in ext2 is sophisticated. |
| 230 | It safely allows features to be added to the filesystem, without |
| 231 | unnecessarily sacrificing compatibility with older versions of the |
| 232 | filesystem code. The feature compatibility mechanism is not supported by |
| 233 | the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in |
| 234 | revision 1. There are three 32-bit fields, one for compatible features |
| 235 | (COMPAT), one for read-only compatible (RO_COMPAT) features and one for |
| 236 | incompatible (INCOMPAT) features. |
| 237 | |
| 238 | These feature flags have specific meanings for the kernel as follows: |
| 239 | |
| 240 | A COMPAT flag indicates that a feature is present in the filesystem, |
| 241 | but the on-disk format is 100% compatible with older on-disk formats, so |
| 242 | a kernel which didn't know anything about this feature could read/write |
| 243 | the filesystem without any chance of corrupting the filesystem (or even |
| 244 | making it inconsistent). This is essentially just a flag which says |
| 245 | "this filesystem has a (hidden) feature" that the kernel or e2fsck may |
| 246 | want to be aware of (more on e2fsck and feature flags later). The ext3 |
| 247 | HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply |
| 248 | a regular file with data blocks in it so the kernel does not need to |
| 249 | take any special notice of it if it doesn't understand ext3 journaling. |
| 250 | |
| 251 | An RO_COMPAT flag indicates that the on-disk format is 100% compatible |
| 252 | with older on-disk formats for reading (i.e. the feature does not change |
| 253 | the visible on-disk format). However, an old kernel writing to such a |
| 254 | filesystem would/could corrupt the filesystem, so this is prevented. The |
| 255 | most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because |
| 256 | sparse groups allow file data blocks where superblock/group descriptor |
| 257 | backups used to live, and ext2_free_blocks() refuses to free these blocks, |
| 258 | which would leading to inconsistent bitmaps. An old kernel would also |
| 259 | get an error if it tried to free a series of blocks which crossed a group |
| 260 | boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. |
| 261 | |
| 262 | An INCOMPAT flag indicates the on-disk format has changed in some |
| 263 | way that makes it unreadable by older kernels, or would otherwise |
| 264 | cause a problem if an old kernel tried to mount it. FILETYPE is an |
| 265 | INCOMPAT flag because older kernels would think a filename was longer |
| 266 | than 256 characters, which would lead to corrupt directory listings. |
| 267 | The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel |
| 268 | doesn't understand compression, you would just get garbage back from |
| 269 | read() instead of it automatically decompressing your data. The ext3 |
| 270 | RECOVER flag is needed to prevent a kernel which does not understand the |
| 271 | ext3 journal from mounting the filesystem without replaying the journal. |
| 272 | |
| 273 | For e2fsck, it needs to be more strict with the handling of these |
| 274 | flags than the kernel. If it doesn't understand ANY of the COMPAT, |
| 275 | RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, |
| 276 | because it has no way of verifying whether a given feature is valid |
| 277 | or not. Allowing e2fsck to succeed on a filesystem with an unknown |
| 278 | feature is a false sense of security for the user. Refusing to check |
| 279 | a filesystem with unknown features is a good incentive for the user to |
| 280 | update to the latest e2fsck. This also means that anyone adding feature |
| 281 | flags to ext2 also needs to update e2fsck to verify these features. |
| 282 | |
| 283 | Metadata |
| 284 | -------- |
| 285 | |
| 286 | It is frequently claimed that the ext2 implementation of writing |
| 287 | asynchronous metadata is faster than the ffs synchronous metadata |
| 288 | scheme but less reliable. Both methods are equally resolvable by their |
| 289 | respective fsck programs. |
| 290 | |
| 291 | If you're exceptionally paranoid, there are 3 ways of making metadata |
| 292 | writes synchronous on ext2: |
| 293 | |
| 294 | per-file if you have the program source: use the O_SYNC flag to open() |
| 295 | per-file if you don't have the source: use "chattr +S" on the file |
| 296 | per-filesystem: add the "sync" option to mount (or in /etc/fstab) |
| 297 | |
| 298 | the first and last are not ext2 specific but do force the metadata to |
| 299 | be written synchronously. See also Journaling below. |
| 300 | |
| 301 | Limitations |
| 302 | ----------- |
| 303 | |
| 304 | There are various limits imposed by the on-disk layout of ext2. Other |
| 305 | limits are imposed by the current implementation of the kernel code. |
| 306 | Many of the limits are determined at the time the filesystem is first |
| 307 | created, and depend upon the block size chosen. The ratio of inodes to |
| 308 | data blocks is fixed at filesystem creation time, so the only way to |
| 309 | increase the number of inodes is to increase the size of the filesystem. |
| 310 | No tools currently exist which can change the ratio of inodes to blocks. |
| 311 | |
| 312 | Most of these limits could be overcome with slight changes in the on-disk |
| 313 | format and using a compatibility flag to signal the format change (at |
| 314 | the expense of some compatibility). |
| 315 | |
| 316 | Filesystem block size: 1kB 2kB 4kB 8kB |
| 317 | |
| 318 | File size limit: 16GB 256GB 2048GB 2048GB |
| 319 | Filesystem size limit: 2047GB 8192GB 16384GB 32768GB |
| 320 | |
| 321 | There is a 2.4 kernel limit of 2048GB for a single block device, so no |
| 322 | filesystem larger than that can be created at this time. There is also |
| 323 | an upper limit on the block size imposed by the page size of the kernel, |
| 324 | so 8kB blocks are only allowed on Alpha systems (and other architectures |
| 325 | which support larger pages). |
| 326 | |
| 327 | There is an upper limit of 32768 subdirectories in a single directory. |
| 328 | |
| 329 | There is a "soft" upper limit of about 10-15k files in a single directory |
| 330 | with the current linear linked-list directory implementation. This limit |
| 331 | stems from performance problems when creating and deleting (and also |
| 332 | finding) files in such large directories. Using a hashed directory index |
| 333 | (under development) allows 100k-1M+ files in a single directory without |
| 334 | performance problems (although RAM size becomes an issue at this point). |
| 335 | |
| 336 | The (meaningless) absolute upper limit of files in a single directory |
| 337 | (imposed by the file size, the realistic limit is obviously much less) |
| 338 | is over 130 trillion files. It would be higher except there are not |
| 339 | enough 4-character names to make up unique directory entries, so they |
| 340 | have to be 8 character filenames, even then we are fairly close to |
| 341 | running out of unique filenames. |
| 342 | |
| 343 | Journaling |
| 344 | ---------- |
| 345 | |
| 346 | A journaling extension to the ext2 code has been developed by Stephen |
| 347 | Tweedie. It avoids the risks of metadata corruption and the need to |
| 348 | wait for e2fsck to complete after a crash, without requiring a change |
| 349 | to the on-disk ext2 layout. In a nutshell, the journal is a regular |
| 350 | file which stores whole metadata (and optionally data) blocks that have |
| 351 | been modified, prior to writing them into the filesystem. This means |
| 352 | it is possible to add a journal to an existing ext2 filesystem without |
| 353 | the need for data conversion. |
| 354 | |
| 355 | When changes to the filesystem (e.g. a file is renamed) they are stored in |
| 356 | a transaction in the journal and can either be complete or incomplete at |
| 357 | the time of a crash. If a transaction is complete at the time of a crash |
| 358 | (or in the normal case where the system does not crash), then any blocks |
| 359 | in that transaction are guaranteed to represent a valid filesystem state, |
| 360 | and are copied into the filesystem. If a transaction is incomplete at |
| 361 | the time of the crash, then there is no guarantee of consistency for |
| 362 | the blocks in that transaction so they are discarded (which means any |
| 363 | filesystem changes they represent are also lost). |
| 364 | Check Documentation/filesystems/ext3.txt if you want to read more about |
| 365 | ext3 and journaling. |
| 366 | |
| 367 | References |
| 368 | ========== |
| 369 | |
| 370 | The kernel source file:/usr/src/linux/fs/ext2/ |
| 371 | e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ |
| 372 | Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html |
| 373 | Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ |
| 374 | Hashed Directories http://kernelnewbies.org/~phillips/htree/ |
| 375 | Filesystem Resizing http://ext2resize.sourceforge.net/ |
| 376 | Compression (*) http://www.netspace.net.au/~reiter/e2compr/ |
| 377 | |
| 378 | Implementations for: |
| 379 | Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm |
| 380 | Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 |
| 381 | DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ |
| 382 | OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ |
| 383 | RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ |
| 384 | |
| 385 | (*) no longer actively developed/supported (as of Apr 2001) |