Release files - Squashfs4.1

4.1     19 SEPT 2010    Major filesystem and tools improvements

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
diff --git a/RELEASE-README b/RELEASE-README
index 6113537..b554b52 100644
--- a/RELEASE-README
+++ b/RELEASE-README
@@ -1,10 +1,10 @@
-	SQUASHFS 4.0 - A squashed read-only filesystem for Linux
+	SQUASHFS 4.1 - A squashed read-only filesystem for Linux
 
-	Copyright 2002-2009 Phillip Lougher <phillip@lougher.demon.co.uk>
+	Copyright 2002-2010 Phillip Lougher <phillip@lougher.demon.co.uk>
 
 	Released under the GPL licence (version 2 or later).
 
-Welcome to Squashfs version 4.0.  Please read the README-4.0 and CHANGES files
+Welcome to Squashfs version 4.1.  Please read the README-4.1 and CHANGES files
 for details of changes.
 
 Squashfs is a highly compressed read-only filesystem for Linux.
@@ -72,45 +72,38 @@
 As squashfs is a read-only filesystem, the mksquashfs program must be used to
 create populated squashfs filesystems.
 
-SYNTAX:./mksquashfs source1 source2 ...  dest [options] [-e list of exclude
+SYNTAX:mksquashfs source1 source2 ...  dest [options] [-e list of exclude
 dirs/files]
 
-Options are
--version		print version, licence and copyright message
--recover <name>		recover filesystem data using recovery file <name>
--no-recovery		don't generate a recovery file
--info			print files written to filesystem
--no-exports		don't make the filesystem exportable via NFS
--no-progress		don't display the progress bar
--no-sparse		don't detect sparse files
+Filesystem build options:
+-comp <comp>		select <comp> compression
+			Compressors available:
+				gzip (default)
+				lzma
+				lzo
 -b <block_size>		set data block to <block_size>.  Default 131072 bytes
--processors <number>	Use <number> processors.  By default will use number of
-			processors available
--read-queue <size>	Set input queue to <size> Mbytes.  Default 64 Mbytes
--write-queue <size>	Set output queue to <size> Mbytes.  Default 512 Mbytes
--fragment-queue <size>	Set fagment queue to <size> Mbytes.  Default 64 Mbytes
+-no-exports		don't make the filesystem exportable via NFS
+-no-sparse		don't detect sparse files
+-no-xattrs		don't store extended attributes
+-xattrs			store extended attributes (default)
 -noI			do not compress inode table
 -noD			do not compress data blocks
 -noF			do not compress fragment blocks
+-noX			do not compress extended attributes
 -no-fragments		do not use fragments
 -always-use-fragments	use fragment blocks for files larger than block size
 -no-duplicates		do not perform duplicate checking
--noappend		do not append to existing filesystem
--keep-as-directory	if one source directory is specified, create a root
-			directory containing that directory, rather than the
-			contents of the directory
--root-becomes <name>	when appending source files/directories, make the
-			original root become a subdirectory in the new root
-			called <name>, rather than adding the new source items
-			to the original root
 -all-root		make all files owned by root
 -force-uid uid		set all file uids to uid
 -force-gid gid		set all file gids to gid
 -nopad			do not pad filesystem to a multiple of 4K
--root-owned		alternative name for -all-root
--noInodeCompression	alternative name for -noI
--noDataCompression	alternative name for -noD
--noFragmentCompression	alternative name for -noF
+-keep-as-directory	if one source directory is specified, create a root
+			directory containing that directory, rather than the
+			contents of the directory
+
+Filesystem filter options:
+-p <pseudo-definition>	Add pseudo file definition
+-pf <pseudo-file>	Add list of pseudo file definitions
 -sort <sort_file>	sort files according to priorities in <sort_file>.  One
 			file or dir with priority per line.  Priority -32768 to
 			32767, default priority 0
@@ -119,8 +112,38 @@
 			exclude dirs/files
 -regex			Allow POSIX regular expressions to be used in exclude
 			dirs/files
--p <pseudo-definition>	Add pseudo file definition
--pf <pseudo-file>	Add list of pseudo file definitions
+
+Filesystem append options:
+-noappend		do not append to existing filesystem
+-root-becomes <name>	when appending source files/directories, make the
+			original root become a subdirectory in the new root
+			called <name>, rather than adding the new source items
+			to the original root
+
+Mksquashfs runtime options:
+-version		print version, licence and copyright message
+-recover <name>		recover filesystem data using recovery file <name>
+-no-recovery		don't generate a recovery file
+-info			print files written to filesystem
+-no-progress		don't display the progress bar
+-processors <number>	Use <number> processors.  By default will use number of
+			processors available
+-read-queue <size>	Set input queue to <size> Mbytes.  Default 64 Mbytes
+-write-queue <size>	Set output queue to <size> Mbytes.  Default 512 Mbytes
+-fragment-queue <size>	Set fagment queue to <size> Mbytes.  Default 64 Mbytes
+
+Miscellaneous options:
+-root-owned		alternative name for -all-root
+-noInodeCompression	alternative name for -noI
+-noDataCompression	alternative name for -noD
+-noFragmentCompression	alternative name for -noF
+-noXattrCompression	alternative name for -noX
+
+Compressors available:
+	gzip (default)
+	lzma
+	lzo
+
 
 Source1 source2 ... are the source directories/files containing the
 files/directories that will form the squashfs filesystem.  If a single
@@ -435,6 +458,8 @@
 	-v[ersion]		print version, licence and copyright information
 	-d[est] <pathname>	unsquash to <pathname>, default "squashfs-root"
 	-n[o-progress]		don't display the progress bar
+	-no[-xattrs]		don't extract xattrs in file system
+	-x[attrs]		extract xattrs in file system (default)
 	-p[rocessors] <number>	use <number> processors.  By default will use
 				number of processors available
 	-i[nfo]			print files as they are unsquashed
@@ -450,11 +475,17 @@
 	-da[ta-queue] <size>	Set data queue to <size> Mbytes.  Default 256
 				Mbytes
 	-fr[ag-queue] <size>	Set fagment queue to <size> Mbytes.  Default 256
-				Mbytes
+				 Mbytes
 	-r[egex]		treat extract names as POSIX regular expressions
 				rather than use the default shell wildcard
 				expansion (globbing)
 
+Decompressors available:
+	gzip
+	lzma
+	lzo
+
+
 To extract a subset of the filesystem, the filenames or directory
 trees that are to be extracted can be specified on the command line.  The
 files/directories should be specified using the full path to the
@@ -518,33 +549,40 @@
 5. FILESYSTEM LAYOUT
 --------------------
 
-Brief filesystem design notes follow for the original 1.x filesystem
-layout.  A description of the 2.x and 3.x filesystem layouts will be written
-sometime!
-
-A squashfs filesystem consists of five parts, packed together on a byte
-alignment:
+A squashfs filesystem consists of a maximum of eight parts, packed
+together on a byte alignment:
 
 	 ---------------
 	|  superblock 	|
 	|---------------|
-	|     data	|
-	|    blocks	|
+	|  datablocks   |
+	|  & fragments  |
 	|---------------|
-	|    inodes	|
+	|  inode table	|
 	|---------------|
-	|   directories	|
+	|   directory	|
+	|     table     |
+	|---------------|
+	|   fragment	|
+	|    table      |
+	|---------------|
+	|    export     |
+	|    table      |
 	|---------------|
 	|    uid/gid	|
 	|  lookup table	|
+	|---------------|
+	|     xattr     |
+	|     table	|
 	 ---------------
 
 Compressed data blocks are written to the filesystem as files are read from
 the source directory, and checked for duplicates.  Once all file data has been
-written the completed inode, directory and uid/gid lookup tables are written.
+written the completed inode, directory, fragment, export and uid/gid lookup
+tables are written.
 
-5.1 Metadata
-------------
+5.1 Inodes
+----------
 
 Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
 compressed block is prefixed by a two byte length, the top bit is set if the
@@ -552,98 +590,122 @@
 or if the compressed block was larger than the uncompressed block.
 
 Inodes are packed into the metadata blocks, and are not aligned to block
-boundaries, therefore inodes overlap compressed blocks.  An inode is
-identified by a two field tuple <start address of compressed block : offset
-into de-compressed block>.
+boundaries, therefore inodes overlap compressed blocks.  Inodes are identified
+by a 48-bit number which encodes the location of the compressed metadata block
+containing the inode, and the byte offset into that block where the inode is
+placed (<block, offset>).
 
-Inode contents vary depending on the file type.  The base inode consists of:
+To maximise compression there are different inodes for each file type
+(regular file, directory, device, etc.), the inode contents and length
+varying with the type.
 
-	base inode:
-		Inode type
-		Mode
-		uid index
-		gid index
-
-The inode type is 4 bits in size, and the mode is 12 bits.
-
-The uid and gid indexes are 4 bits in length.  Ordinarily, this will allow 16
-unique indexes into the uid table.  To minimise overhead, the uid index is
-used in conjunction with the spare bit in the file type to form a 48 entry
-index as follows:
-
-	inode type 1 - 5: uid index = uid
-	inode type 5 -10: uid index = 16 + uid
-	inode type 11 - 15: uid index = 32 + uid
-
-In this way 48 unique uids are supported using 4 bits, minimising data inode
-overhead.  The 4 bit gid index is used to index into a 15 entry gid table.
-Gid index 15 is used to indicate that the gid is the same as the uid.
-This prevents the 15 entry gid table filling up with the common case where
-the uid/gid is the same.
-
-The data contents of symbolic links are stored immediately after the symbolic
-link inode, inside the inode table.  This allows the normally small symbolic
-link to be compressed as part of the inode table, achieving much greater
-compression than if the symbolic link was compressed individually.
-
-Similarly, the block index for regular files is stored immediately after the
-regular file inode.  The block index is a list of block lengths (two bytes
-each), rather than block addresses, saving two bytes per block.  The block
-address for a given block is computed by the summation of the previous
-block lengths.  This takes advantage of the fact that the blocks making up a
-file are stored contiguously in the filesystem.  The top bit of each block
-length is set if the block is uncompressed, either because the -noD option is
-set, or if the compressed block was larger than the uncompressed block.
+To further maximise compression, two types of regular file inode and
+directory inode are defined: inodes optimised for frequently occurring
+regular files and directories, and extended types where extra
+information has to be stored.
 
 5.2 Directories
 ---------------
 
-Like inodes, directories are packed into the metadata blocks, and are not
-aligned on block boundaries, therefore directories can overlap compressed
-blocks.  A directory is, again, identified by a two field tuple
-<start address of compressed block containing directory start : offset
-into de-compressed block>.
+Like inodes, directories are packed into compressed metadata blocks, stored
+in a directory table.  Directories are accessed using the start address of
+the metablock containing the directory and the offset into the
+decompressed block (<block, offset>).
 
 Directories are organised in a slightly complex way, and are not simply
-a list of file names and inode tuples.  The organisation takes advantage of the
-observation that in most cases, the inodes of the files in the directory
-will be in the same compressed metadata block, and therefore, the
-inode tuples will have the same start block.
-
+a list of file names.  The organisation takes advantage of the
+fact that (in most cases) the inodes of the files will be in the same
+compressed metadata block, and therefore, can share the start block.
 Directories are therefore organised in a two level list, a directory
-header containing the shared start block value, and a sequence of
-directory entries, each of which share the shared start block.  A
-new directory header is written once/if the inode start block
-changes.  The directory header/directory entry list is repeated as many times
-as necessary.  The organisation is as follows:
+header containing the shared start block value, and a sequence of directory
+entries, each of which share the shared start block.  A new directory header
+is written once/if the inode start block changes.  The directory
+header/directory entry list is repeated as many times as necessary.
 
-	directory_header:
-		count (8 bits)
-		inode start block (24 bits)
-		
-		directory entry: * count
-			inode offset (13 bits)
-			inode type (3 bits)
-			filename size (8 bits)
-			filename
-			
-This organisation saves on average 3 bytes per filename.
+Directories are sorted, and can contain a directory index to speed up
+file lookup.  Directory indexes store one entry per metablock, each entry
+storing the index/filename mapping to the first directory header
+in each metadata block.  Directories are sorted in alphabetical order,
+and at lookup the index is scanned linearly looking for the first filename
+alphabetically larger than the filename being looked up.  At this point the
+location of the metadata block the filename is in has been found.
+The general idea of the index is ensure only one metadata block needs to be
+decompressed to do a lookup irrespective of the length of the directory.
+This scheme has the advantage that it doesn't require extra memory overhead
+and doesn't require much extra storage on disk.
 
 5.3 File data
 -------------
 
-File data is compressed on a block by block basis and written to the
-filesystem.  The filesystem supports up to 32K blocks, which achieves
-greater compression ratios than the Linux 4K page size.
+Regular files consist of a sequence of contiguous compressed blocks, and/or a
+compressed fragment block (tail-end packed block).   The compressed size
+of each datablock is stored in a block list contained within the
+file inode.
 
-The disadvantage with using greater than 4K blocks (and the reason why
-most filesystems do not), is that the VFS reads data in 4K pages.
-The filesystem reads and decompresses a larger block containing that page
-(e.g. 32K).  However, only 4K can be returned to the VFS, resulting in a
-very inefficient filesystem, as 28K must be thrown away.   Squashfs,
-solves this problem by explicitly pushing the extra pages into the page
-cache.
+To speed up access to datablocks when reading 'large' files (256 Mbytes or
+larger), the code implements an index cache that caches the mapping from
+block index to datablock location on disk.
 
+The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
+retaining a simple and space-efficient block list on disk.  The cache
+is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
+Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
+The index cache is designed to be memory efficient, and by default uses
+16 KiB.
+
+5.4 Fragment lookup table
+-------------------------
+
+Regular files can contain a fragment index which is mapped to a fragment
+location on disk and compressed size using a fragment lookup table.  This
+fragment lookup table is itself stored compressed into metadata blocks.
+A second index table is used to locate these.  This second index table for
+speed of access (and because it is small) is read at mount time and cached
+in memory.
+
+5.5 Uid/gid lookup table
+------------------------
+
+For space efficiency regular files store uid and gid indexes, which are
+converted to 32-bit uids/gids using an id look up table.  This table is
+stored compressed into metadata blocks.  A second index table is used to
+locate these.  This second index table for speed of access (and because it
+is small) is read at mount time and cached in memory.
+
+5.6 Export table
+----------------
+
+To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
+can optionally (disabled with the -no-exports Mksquashfs option) contain
+an inode number to inode disk location lookup table.  This is required to
+enable Squashfs to map inode numbers passed in filehandles to the inode
+location on disk, which is necessary when the export code reinstantiates
+expired/flushed inodes.
+
+This table is stored compressed into metadata blocks.  A second index table is
+used to locate these.  This second index table for speed of access (and because
+it is small) is read at mount time and cached in memory.
+
+5.7 Xattr table
+---------------
+
+The xattr table contains extended attributes for each inode.  The xattrs
+for each inode are stored in a list, each list entry containing a type,
+name and value field.  The type field encodes the xattr prefix
+("user.", "trusted." etc) and it also encodes how the name/value fields
+should be interpreted.  Currently the type indicates whether the value
+is stored inline (in which case the value field contains the xattr value),
+or if it is stored out of line (in which case the value field stores a
+reference to where the actual value is stored).  This allows large values
+to be stored out of line improving scanning and lookup performance and it
+also allows values to be de-duplicated, the value being stored once, and
+all other occurences holding an out of line reference to that value.
+
+The xattr lists are packed into compressed 8K metadata blocks.
+To reduce overhead in inodes, rather than storing the on-disk
+location of the xattr list inside each inode, a 32-bit xattr id
+is stored.  This xattr id is mapped into the location of the xattr
+list using a second xattr id lookup table.
 
 6. AUTHOR INFO
 --------------