blob: 403c090aca39954d378809b55b0585f74085422c [file] [log] [blame]
Phillip Lougher9eb425c2009-01-05 08:46:29 +00001SQUASHFS 4.0 FILESYSTEM
2=======================
3
4Squashfs is a compressed read-only filesystem for Linux.
Phillip Lougher812753d2011-07-22 02:26:52 +01005It uses zlib/lzo/xz compression to compress files, inodes and directories.
Phillip Lougher9eb425c2009-01-05 08:46:29 +00006Inodes in the system are very small and all blocks are packed to minimise
7data overhead. Block sizes greater than 4K are supported up to a maximum
8of 1Mbytes (default block size 128K).
9
10Squashfs is intended for general read-only filesystem use, for archival
11use (i.e. in cases where a .tar.gz file may be used), and in constrained
12block device/memory systems (e.g. embedded systems) where low overhead is
13needed.
14
15Mailing list: squashfs-devel@lists.sourceforge.net
16Web site: www.squashfs.org
17
181. FILESYSTEM FEATURES
19----------------------
20
21Squashfs filesystem features versus Cramfs:
22
23 Squashfs Cramfs
24
Phillip Lougheredf2e282009-03-05 00:40:13 +000025Max filesystem size: 2^64 256 MiB
Phillip Lougher9eb425c2009-01-05 08:46:29 +000026Max file size: ~ 2 TiB 16 MiB
27Max files: unlimited unlimited
28Max directories: unlimited unlimited
29Max entries per directory: unlimited unlimited
30Max block size: 1 MiB 4 KiB
31Metadata compression: yes no
32Directory indexes: yes no
33Sparse file support: yes no
34Tail-end packing (fragments): yes no
35Exportable (NFS etc.): yes no
36Hard link support: yes no
37"." and ".." in readdir: yes no
38Real inode numbers: yes no
3932-bit uids/gids: yes no
40File creation time: yes no
Phillip Lougher899f4532010-05-25 02:47:00 +010041Xattr support: yes no
42ACL support: no no
Phillip Lougher9eb425c2009-01-05 08:46:29 +000043
44Squashfs compresses data, inodes and directories. In addition, inode and
45directory data are highly compacted, and packed on byte boundaries. Each
46compressed inode is on average 8 bytes in length (the exact length varies on
47file type, i.e. regular file, directory, symbolic link, and block/char device
48inodes have different sizes).
49
502. USING SQUASHFS
51-----------------
52
53As squashfs is a read-only filesystem, the mksquashfs program must be used to
54create populated squashfs filesystems. This and other squashfs utilities
55can be obtained from http://www.squashfs.org. Usage instructions can be
56obtained from this site also.
57
Phillip Lougher812753d2011-07-22 02:26:52 +010058The squashfs-tools development tree is now located on kernel.org
59 git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
Phillip Lougher9eb425c2009-01-05 08:46:29 +000060
613. SQUASHFS FILESYSTEM DESIGN
62-----------------------------
63
Phillip Lougher4c1d2042011-02-28 16:32:39 +000064A squashfs filesystem consists of a maximum of nine parts, packed together on a
65byte alignment:
Phillip Lougher9eb425c2009-01-05 08:46:29 +000066
67 ---------------
68 | superblock |
69 |---------------|
Phillip Lougher4c1d2042011-02-28 16:32:39 +000070 | compression |
71 | options |
72 |---------------|
Phillip Lougher9eb425c2009-01-05 08:46:29 +000073 | datablocks |
74 | & fragments |
75 |---------------|
76 | inode table |
77 |---------------|
78 | directory |
79 | table |
80 |---------------|
81 | fragment |
82 | table |
83 |---------------|
84 | export |
85 | table |
86 |---------------|
87 | uid/gid |
88 | lookup table |
Phillip Lougher899f4532010-05-25 02:47:00 +010089 |---------------|
90 | xattr |
91 | table |
Phillip Lougher9eb425c2009-01-05 08:46:29 +000092 ---------------
93
94Compressed data blocks are written to the filesystem as files are read from
95the source directory, and checked for duplicates. Once all file data has been
Phillip Lougher89cab5b2011-12-29 13:54:17 +000096written the completed inode, directory, fragment, export, uid/gid lookup and
97xattr tables are written.
Phillip Lougher9eb425c2009-01-05 08:46:29 +000098
Phillip Lougher4c1d2042011-02-28 16:32:39 +0000993.1 Compression options
100-----------------------
101
102Compressors can optionally support compression specific options (e.g.
103dictionary size). If non-default compression options have been used, then
104these are stored here.
105
1063.2 Inodes
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000107----------
108
109Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each
110compressed block is prefixed by a two byte length, the top bit is set if the
111block is uncompressed. A block will be uncompressed if the -noI option is set,
112or if the compressed block was larger than the uncompressed block.
113
114Inodes are packed into the metadata blocks, and are not aligned to block
115boundaries, therefore inodes overlap compressed blocks. Inodes are identified
116by a 48-bit number which encodes the location of the compressed metadata block
117containing the inode, and the byte offset into that block where the inode is
118placed (<block, offset>).
119
120To maximise compression there are different inodes for each file type
121(regular file, directory, device, etc.), the inode contents and length
122varying with the type.
123
124To further maximise compression, two types of regular file inode and
125directory inode are defined: inodes optimised for frequently occurring
126regular files and directories, and extended types where extra
127information has to be stored.
128
Phillip Lougher4c1d2042011-02-28 16:32:39 +00001293.3 Directories
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000130---------------
131
132Like inodes, directories are packed into compressed metadata blocks, stored
133in a directory table. Directories are accessed using the start address of
134the metablock containing the directory and the offset into the
135decompressed block (<block, offset>).
136
137Directories are organised in a slightly complex way, and are not simply
138a list of file names. The organisation takes advantage of the
139fact that (in most cases) the inodes of the files will be in the same
140compressed metadata block, and therefore, can share the start block.
141Directories are therefore organised in a two level list, a directory
142header containing the shared start block value, and a sequence of directory
143entries, each of which share the shared start block. A new directory header
144is written once/if the inode start block changes. The directory
145header/directory entry list is repeated as many times as necessary.
146
147Directories are sorted, and can contain a directory index to speed up
148file lookup. Directory indexes store one entry per metablock, each entry
149storing the index/filename mapping to the first directory header
150in each metadata block. Directories are sorted in alphabetical order,
151and at lookup the index is scanned linearly looking for the first filename
152alphabetically larger than the filename being looked up. At this point the
153location of the metadata block the filename is in has been found.
Phillip Lougher89cab5b2011-12-29 13:54:17 +0000154The general idea of the index is to ensure only one metadata block needs to be
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000155decompressed to do a lookup irrespective of the length of the directory.
156This scheme has the advantage that it doesn't require extra memory overhead
157and doesn't require much extra storage on disk.
158
Phillip Lougher4c1d2042011-02-28 16:32:39 +00001593.4 File data
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000160-------------
161
162Regular files consist of a sequence of contiguous compressed blocks, and/or a
163compressed fragment block (tail-end packed block). The compressed size
164of each datablock is stored in a block list contained within the
165file inode.
166
167To speed up access to datablocks when reading 'large' files (256 Mbytes or
168larger), the code implements an index cache that caches the mapping from
169block index to datablock location on disk.
170
171The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
172retaining a simple and space-efficient block list on disk. The cache
173is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
174Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
175The index cache is designed to be memory efficient, and by default uses
17616 KiB.
177
Phillip Lougher4c1d2042011-02-28 16:32:39 +00001783.5 Fragment lookup table
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000179-------------------------
180
181Regular files can contain a fragment index which is mapped to a fragment
182location on disk and compressed size using a fragment lookup table. This
183fragment lookup table is itself stored compressed into metadata blocks.
184A second index table is used to locate these. This second index table for
185speed of access (and because it is small) is read at mount time and cached
186in memory.
187
Phillip Lougher4c1d2042011-02-28 16:32:39 +00001883.6 Uid/gid lookup table
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000189------------------------
190
191For space efficiency regular files store uid and gid indexes, which are
192converted to 32-bit uids/gids using an id look up table. This table is
193stored compressed into metadata blocks. A second index table is used to
194locate these. This second index table for speed of access (and because it
195is small) is read at mount time and cached in memory.
196
Phillip Lougher4c1d2042011-02-28 16:32:39 +00001973.7 Export table
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000198----------------
199
200To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
201can optionally (disabled with the -no-exports Mksquashfs option) contain
202an inode number to inode disk location lookup table. This is required to
203enable Squashfs to map inode numbers passed in filehandles to the inode
204location on disk, which is necessary when the export code reinstantiates
205expired/flushed inodes.
206
207This table is stored compressed into metadata blocks. A second index table is
208used to locate these. This second index table for speed of access (and because
209it is small) is read at mount time and cached in memory.
210
Phillip Lougher4c1d2042011-02-28 16:32:39 +00002113.8 Xattr table
Phillip Lougher899f4532010-05-25 02:47:00 +0100212---------------
213
214The xattr table contains extended attributes for each inode. The xattrs
215for each inode are stored in a list, each list entry containing a type,
216name and value field. The type field encodes the xattr prefix
217("user.", "trusted." etc) and it also encodes how the name/value fields
218should be interpreted. Currently the type indicates whether the value
219is stored inline (in which case the value field contains the xattr value),
220or if it is stored out of line (in which case the value field stores a
221reference to where the actual value is stored). This allows large values
222to be stored out of line improving scanning and lookup performance and it
223also allows values to be de-duplicated, the value being stored once, and
Lucas De Marchi25985ed2011-03-30 22:57:33 -0300224all other occurrences holding an out of line reference to that value.
Phillip Lougher899f4532010-05-25 02:47:00 +0100225
226The xattr lists are packed into compressed 8K metadata blocks.
227To reduce overhead in inodes, rather than storing the on-disk
228location of the xattr list inside each inode, a 32-bit xattr id
229is stored. This xattr id is mapped into the location of the xattr
230list using a second xattr id lookup table.
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000231
2324. TODOS AND OUTSTANDING ISSUES
233-------------------------------
234
2354.1 Todo list
236-------------
237
Phillip Lougher899f4532010-05-25 02:47:00 +0100238Implement ACL support.
Phillip Lougher9eb425c2009-01-05 08:46:29 +0000239
2404.2 Squashfs internal cache
241---------------------------
242
243Blocks in Squashfs are compressed. To avoid repeatedly decompressing
244recently accessed data Squashfs uses two small metadata and fragment caches.
245
246The cache is not used for file datablocks, these are decompressed and cached in
247the page-cache in the normal way. The cache is used to temporarily cache
248fragment and metadata blocks which have been read as a result of a metadata
249(i.e. inode or directory) or fragment access. Because metadata and fragments
250are packed together into blocks (to gain greater compression) the read of a
251particular piece of metadata or fragment will retrieve other metadata/fragments
252which have been packed with it, these because of locality-of-reference may be
253read in the near future. Temporarily caching them ensures they are available
254for near future access without requiring an additional read and decompress.
255
256In the future this internal cache may be replaced with an implementation which
257uses the kernel page cache. Because the page cache operates on page sized
258units this may introduce additional complexity in terms of locking and
259associated race conditions.