Blame - Documentation/filesystems/vfs.txt - kernel/msm

blob: 281c19ff7f4525cc4bcb03c3900419de2732cead [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	2	Overview of the Linux Virtual File System
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	3
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	4	Original author: Richard Gooch <rgooch@atnf.csiro.au>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	5
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	6	Last updated on June 24, 2007.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	7
				8	Copyright (C) 1999 Richard Gooch
				9	Copyright (C) 2005 Pekka Enberg
				10
				11	This file is released under the GPLv2.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	12
				13
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	14	Introduction
				15	============
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	16
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	17	The Virtual File System (also known as the Virtual Filesystem Switch)
				18	is the software layer in the kernel that provides the filesystem
				19	interface to userspace programs. It also provides an abstraction
				20	within the kernel which allows different filesystem implementations to
				21	coexist.
				22
				23	VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
				24	on are called from a process context. Filesystem locking is described
				25	in the document Documentation/filesystems/Locking.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	26
				27
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	28	Directory Entry Cache (dcache)
				29	------------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	30
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	31	The VFS implements the open(2), stat(2), chmod(2), and similar system
				32	calls. The pathname argument that is passed to them is used by the VFS
				33	to search through the directory entry cache (also known as the dentry
				34	cache or dcache). This provides a very fast look-up mechanism to
				35	translate a pathname (filename) into a specific dentry. Dentries live
				36	in RAM and are never saved to disc: they exist only for performance.
				37
				38	The dentry cache is meant to be a view into your entire filespace. As
				39	most computers cannot fit all dentries in the RAM at the same time,
				40	some bits of the cache are missing. In order to resolve your pathname
				41	into a dentry, the VFS may have to resort to creating dentries along
				42	the way, and then loading the inode. This is done by looking up the
				43	inode.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	44
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	45
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	46	The Inode Object
				47	----------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	48
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	49	An individual dentry usually has a pointer to an inode. Inodes are
				50	filesystem objects such as regular files, directories, FIFOs and other
				51	beasts. They live either on the disc (for block device filesystems)
				52	or in the memory (for pseudo filesystems). Inodes that live on the
				53	disc are copied into the memory when required and changes to the inode
				54	are written back to disc. A single inode can be pointed to by multiple
				55	dentries (hard links, for example, do this).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	56
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	57	To look up an inode requires that the VFS calls the lookup() method of
				58	the parent directory inode. This method is installed by the specific
				59	filesystem implementation that the inode lives in. Once the VFS has
				60	the required dentry (and hence the inode), we can do all those boring
				61	things like open(2) the file, or stat(2) it to peek at the inode
				62	data. The stat(2) operation is fairly simple: once the VFS has the
				63	dentry, it peeks at the inode data and passes some of it back to
				64	userspace.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	65
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	66
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	67	The File Object
				68	---------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	69
				70	Opening a file requires another operation: allocation of a file
				71	structure (this is the kernel-side implementation of file
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	72	descriptors). The freshly allocated file structure is initialized with
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	73	a pointer to the dentry and a set of file operation member functions.
				74	These are taken from the inode data. The open() file method is then
				75	called so the specific filesystem implementation can do it's work. You
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	76	can see that this is another switch performed by the VFS. The file
				77	structure is placed into the file descriptor table for the process.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	78
				79	Reading, writing and closing files (and other assorted VFS operations)
				80	is done by using the userspace file descriptor to grab the appropriate
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	81	file structure, and then calling the required file structure method to
				82	do whatever is required. For as long as the file is open, it keeps the
				83	dentry in use, which in turn means that the VFS inode is still in use.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	84
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	85
				86	Registering and Mounting a Filesystem
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	87	=====================================
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	88
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	89	To register and unregister a filesystem, use the following API
				90	functions:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	91
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	92	#include <linux/fs.h>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	93
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	94	extern int register_filesystem(struct file_system_type *);
				95	extern int unregister_filesystem(struct file_system_type *);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	96
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	97	The passed struct file_system_type describes your filesystem. When a
				98	request is made to mount a device onto a directory in your filespace,
				99	the VFS will call the appropriate get_sb() method for the specific
				100	filesystem. The dentry for the mount point will then be updated to
				101	point to the root inode for the new filesystem.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	102
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	103	You can see all filesystems that are registered to the kernel in the
				104	file /proc/filesystems.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	105
				106
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	107	struct file_system_type
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	108	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	109
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	110	This describes the filesystem. As of kernel 2.6.22, the following
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	111	members are defined:
				112
				113	struct file_system_type {
				114	const char *name;
				115	int fs_flags;
Jonathan Corbet	5d8b2eb	2006-07-10 04:44:07 -0700	[diff] [blame]	116	int (get_sb) (struct file_system_type , int,
				117	const char , void , struct vfsmount *);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	118	void (kill_sb) (struct super_block );
				119	struct module *owner;
				120	struct file_system_type * next;
				121	struct list_head fs_supers;
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	122	struct lock_class_key s_lock_key;
				123	struct lock_class_key s_umount_key;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	124	};
				125
				126	name: the name of the filesystem type, such as "ext2", "iso9660",
				127	"msdos" and so on
				128
				129	fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
				130
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	131	get_sb: the method to call when a new instance of this
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	132	filesystem should be mounted
				133
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	134	kill_sb: the method to call when an instance of this filesystem
				135	should be unmounted
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	136
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	137	owner: for internal VFS use: you should initialize this to THIS_MODULE in
				138	most cases.
				139
				140	next: for internal VFS use: you should initialize this to NULL
				141
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	142	s_lock_key, s_umount_key: lockdep-specific
				143
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	144	The get_sb() method has the following arguments:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	145
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	146	struct file_system_type *fs_type: decribes the filesystem, partly initialized
				147	by the specific filesystem code
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	148
				149	int flags: mount flags
				150
				151	const char *dev_name: the device name we are mounting.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	152
				153	void *data: arbitrary mount options, usually comes as an ASCII
				154	string
				155
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	156	struct vfsmount *mnt: a vfs-internal representation of a mount point
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	157
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	158	The get_sb() method must determine if the block device specified
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	159	in the dev_name and fs_type contains a filesystem of the type the method
				160	supports. If it succeeds in opening the named block device, it initializes a
				161	struct super_block descriptor for the filesystem contained by the block device.
				162	On failure it returns an error.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	163
				164	The most interesting member of the superblock structure that the
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	165	get_sb() method fills in is the "s_op" field. This is a pointer to
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	166	a "struct super_operations" which describes the next level of the
				167	filesystem implementation.
				168
Jim Cromie	e3e1bfe	2006-01-03 13:35:41 +0100	[diff] [blame]	169	Usually, a filesystem uses one of the generic get_sb() implementations
				170	and provides a fill_super() method instead. The generic methods are:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	171
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	172	get_sb_bdev: mount a filesystem residing on a block device
				173
				174	get_sb_nodev: mount a filesystem that is not backed by a device
				175
				176	get_sb_single: mount a filesystem which shares the instance between
				177	all mounts
				178
				179	A fill_super() method implementation has the following arguments:
				180
				181	struct super_block *sb: the superblock structure. The method fill_super()
				182	must initialize this properly.
				183
				184	void *data: arbitrary mount options, usually comes as an ASCII
				185	string
				186
				187	int silent: whether or not to be silent on error
				188
				189
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	190	The Superblock Object
				191	=====================
				192
				193	A superblock object represents a mounted filesystem.
				194
				195
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	196	struct super_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	197	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	198
				199	This describes how the VFS can manipulate the superblock of your
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	200	filesystem. As of kernel 2.6.22, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	201
				202	struct super_operations {
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	203	struct inode (alloc_inode)(struct super_block *sb);
				204	void (destroy_inode)(struct inode );
				205
				206	void (read_inode) (struct inode );
				207
				208	void (dirty_inode) (struct inode );
				209	int (write_inode) (struct inode , int);
				210	void (put_inode) (struct inode );
				211	void (drop_inode) (struct inode );
				212	void (delete_inode) (struct inode );
				213	void (put_super) (struct super_block );
				214	void (write_super) (struct super_block );
				215	int (sync_fs)(struct super_block sb, int wait);
				216	void (write_super_lockfs) (struct super_block );
				217	void (unlockfs) (struct super_block );
David Howells	726c334	2006-06-23 02:02:58 -0700	[diff] [blame]	218	int (statfs) (struct dentry , struct kstatfs *);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	219	int (remount_fs) (struct super_block , int , char );
				220	void (clear_inode) (struct inode );
				221	void (umount_begin) (struct super_block );
				222
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	223	int (show_options)(struct seq_file , struct vfsmount *);
				224
				225	ssize_t (quota_read)(struct super_block , int, char *, size_t, loff_t);
				226	ssize_t (quota_write)(struct super_block , int, const char *, size_t, loff_t);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	227	};
				228
				229	All methods are called without any locks being held, unless otherwise
				230	noted. This means that most methods can block safely. All methods are
				231	only called from a process context (i.e. not from an interrupt handler
				232	or bottom half).
				233
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	234	alloc_inode: this method is called by inode_alloc() to allocate memory
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	235	for struct inode and initialize it. If this function is not
				236	defined, a simple 'struct inode' is allocated. Normally
				237	alloc_inode will be used to allocate a larger structure which
				238	contains a 'struct inode' embedded within it.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	239
				240	destroy_inode: this method is called by destroy_inode() to release
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	241	resources allocated for struct inode. It is only required if
				242	->alloc_inode was defined and simply undoes anything done by
				243	->alloc_inode.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	244
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	245	read_inode: this method is called to read a specific inode from the
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	246	mounted filesystem. The i_ino member in the struct inode is
				247	initialized by the VFS to indicate which inode to read. Other
				248	members are filled in by this method.
				249
				250	You can set this to NULL and use iget5_locked() instead of iget()
				251	to read inodes. This is necessary for filesystems for which the
				252	inode number is not sufficient to identify an inode.
				253
				254	dirty_inode: this method is called by the VFS to mark an inode dirty.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	255
				256	write_inode: this method is called when the VFS needs to write an
				257	inode to disc. The second parameter indicates whether the write
				258	should be synchronous or not, not all filesystems check this flag.
				259
				260	put_inode: called when the VFS inode is removed from the inode
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	261	cache.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	262
				263	drop_inode: called when the last access to the inode is dropped,
				264	with the inode_lock spinlock held.
				265
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	266	This method should be either NULL (normal UNIX filesystem
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	267	semantics) or "generic_delete_inode" (for filesystems that do not
				268	want to cache inodes - causing "delete_inode" to always be
				269	called regardless of the value of i_nlink)
				270
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	271	The "generic_delete_inode()" behavior is equivalent to the
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	272	old practice of using "force_delete" in the put_inode() case,
				273	but does not have the races that the "force_delete()" approach
				274	had.
				275
				276	delete_inode: called when the VFS wants to delete an inode
				277
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	278	put_super: called when the VFS wishes to free the superblock
				279	(i.e. unmount). This is called with the superblock lock held
				280
				281	write_super: called when the VFS superblock needs to be written to
				282	disc. This method is optional
				283
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	284	sync_fs: called when VFS is writing out all dirty data associated with
				285	a superblock. The second parameter indicates whether the method
				286	should wait until the write out has been completed. Optional.
				287
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	288	write_super_lockfs: called when VFS is locking a filesystem and
				289	forcing it into a consistent state. This method is currently
				290	used by the Logical Volume Manager (LVM).
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	291
				292	unlockfs: called when VFS is unlocking a filesystem and making it writable
				293	again.
				294
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	295	statfs: called when the VFS needs to get filesystem statistics. This
				296	is called with the kernel lock held
				297
				298	remount_fs: called when the filesystem is remounted. This is called
				299	with the kernel lock held
				300
				301	clear_inode: called then the VFS clears the inode. Optional
				302
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	303	umount_begin: called when the VFS is unmounting a filesystem.
				304
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	305	show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
				306
				307	quota_read: called by the VFS to read from filesystem quota file.
				308
				309	quota_write: called by the VFS to write to filesystem quota file.
				310
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	311	The read_inode() method is responsible for filling in the "i_op"
				312	field. This is a pointer to a "struct inode_operations" which
				313	describes the methods that can be performed on individual inodes.
				314
				315
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	316	The Inode Object
				317	================
				318
				319	An inode object represents an object within the filesystem.
				320
				321
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	322	struct inode_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	323	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	324
				325	This describes how the VFS can manipulate an inode in your
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	326	filesystem. As of kernel 2.6.22, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	327
				328	struct inode_operations {
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	329	int (create) (struct inode ,struct dentry ,int, struct nameidata );
				330	struct dentry * (lookup) (struct inode ,struct dentry , struct nameidata );
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	331	int (link) (struct dentry ,struct inode ,struct dentry );
				332	int (unlink) (struct inode ,struct dentry *);
				333	int (symlink) (struct inode ,struct dentry ,const char );
				334	int (mkdir) (struct inode ,struct dentry *,int);
				335	int (rmdir) (struct inode ,struct dentry *);
				336	int (mknod) (struct inode ,struct dentry *,int,dev_t);
				337	int (rename) (struct inode , struct dentry *,
				338	struct inode , struct dentry );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	339	int (readlink) (struct dentry , char __user *,int);
				340	void * (follow_link) (struct dentry , struct nameidata *);
				341	void (put_link) (struct dentry , struct nameidata , void );
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	342	void (truncate) (struct inode );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	343	int (permission) (struct inode , int, struct nameidata *);
				344	int (setattr) (struct dentry , struct iattr *);
				345	int (getattr) (struct vfsmount mnt, struct dentry , struct kstat );
				346	int (setxattr) (struct dentry , const char ,const void ,size_t,int);
				347	ssize_t (getxattr) (struct dentry , const char , void , size_t);
				348	ssize_t (listxattr) (struct dentry , char *, size_t);
				349	int (removexattr) (struct dentry , const char *);
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	350	void (truncate_range)(struct inode , loff_t, loff_t);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	351	};
				352
				353	Again, all methods are called without any locks being held, unless
				354	otherwise noted.
				355
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	356	create: called by the open(2) and creat(2) system calls. Only
				357	required if you want to support regular files. The dentry you
				358	get should not have an inode (i.e. it should be a negative
				359	dentry). Here you will probably call d_instantiate() with the
				360	dentry and the newly created inode
				361
				362	lookup: called when the VFS needs to look up an inode in a parent
				363	directory. The name to look for is found in the dentry. This
				364	method must call d_add() to insert the found inode into the
				365	dentry. The "i_count" field in the inode structure should be
				366	incremented. If the named inode does not exist a NULL inode
				367	should be inserted into the dentry (this is called a negative
				368	dentry). Returning an error code from this routine must only
				369	be done on a real error, otherwise creating inodes with system
				370	calls like create(2), mknod(2), mkdir(2) and so on will fail.
				371	If you wish to overload the dentry methods then you should
				372	initialise the "d_dop" field in the dentry; this is a pointer
				373	to a struct "dentry_operations".
				374	This method is called with the directory inode semaphore held
				375
				376	link: called by the link(2) system call. Only required if you want
				377	to support hard links. You will probably need to call
				378	d_instantiate() just as you would in the create() method
				379
				380	unlink: called by the unlink(2) system call. Only required if you
				381	want to support deleting inodes
				382
				383	symlink: called by the symlink(2) system call. Only required if you
				384	want to support symlinks. You will probably need to call
				385	d_instantiate() just as you would in the create() method
				386
				387	mkdir: called by the mkdir(2) system call. Only required if you want
				388	to support creating subdirectories. You will probably need to
				389	call d_instantiate() just as you would in the create() method
				390
				391	rmdir: called by the rmdir(2) system call. Only required if you want
				392	to support deleting subdirectories
				393
				394	mknod: called by the mknod(2) system call to create a device (char,
				395	block) inode or a named pipe (FIFO) or socket. Only required
				396	if you want to support creating these types of inodes. You
				397	will probably need to call d_instantiate() just as you would
				398	in the create() method
				399
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	400	rename: called by the rename(2) system call to rename the object to
				401	have the parent and name given by the second inode and dentry.
				402
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	403	readlink: called by the readlink(2) system call. Only required if
				404	you want to support reading symbolic links
				405
				406	follow_link: called by the VFS to follow a symbolic link to the
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	407	inode it points to. Only required if you want to support
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	408	symbolic links. This method returns a void pointer cookie
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	409	that is passed to put_link().
				410
				411	put_link: called by the VFS to release resources allocated by
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	412	follow_link(). The cookie returned by follow_link() is passed
Paolo Ornati	670e9f3	2006-10-03 22:57:56 +0200	[diff] [blame]	413	to this method as the last parameter. It is used by
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	414	filesystems such as NFS where page cache is not stable
				415	(i.e. page that was installed when the symbolic link walk
				416	started might not be in the page cache at the end of the
				417	walk).
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	418
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	419	truncate: called by the VFS to change the size of a file. The
				420	i_size field of the inode is set to the desired size by the
				421	VFS before this method is called. This method is called by
				422	the truncate(2) system call and related functionality.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	423
				424	permission: called by the VFS to check for access rights on a POSIX-like
				425	filesystem.
				426
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	427	setattr: called by the VFS to set attributes for a file. This method
				428	is called by chmod(2) and related system calls.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	429
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	430	getattr: called by the VFS to get attributes of a file. This method
				431	is called by stat(2) and related system calls.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	432
				433	setxattr: called by the VFS to set an extended attribute for a file.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	434	Extended attribute is a name:value pair associated with an
				435	inode. This method is called by setxattr(2) system call.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	436
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	437	getxattr: called by the VFS to retrieve the value of an extended
				438	attribute name. This method is called by getxattr(2) function
				439	call.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	440
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	441	listxattr: called by the VFS to list all extended attributes for a
				442	given file. This method is called by listxattr(2) system call.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	443
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	444	removexattr: called by the VFS to remove an extended attribute from
				445	a file. This method is called by removexattr(2) system call.
				446
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	447	truncate_range: a method provided by the underlying filesystem to truncate a
				448	range of blocks , i.e. punch a hole somewhere in a file.
				449
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	450
				451	The Address Space Object
				452	========================
				453
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	454	The address space object is used to group and manage pages in the page
				455	cache. It can be used to keep track of the pages in a file (or
				456	anything else) and also track the mapping of sections of the file into
				457	process address spaces.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	458
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	459	There are a number of distinct yet related services that an
				460	address-space can provide. These include communicating memory
				461	pressure, page lookup by address, and keeping track of pages tagged as
				462	Dirty or Writeback.
				463
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	464	The first can be used independently to the others. The VM can try to
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	465	either write dirty pages in order to clean them, or release clean
				466	pages in order to reuse them. To do this it can call the ->writepage
				467	method on dirty pages, and ->releasepage on clean pages with
				468	PagePrivate set. Clean pages without PagePrivate and with no external
				469	references will be released without notice being given to the
				470	address_space.
				471
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	472	To achieve this functionality, pages need to be placed on an LRU with
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	473	lru_cache_add and mark_page_active needs to be called whenever the
				474	page is used.
				475
				476	Pages are normally kept in a radix tree index by ->index. This tree
				477	maintains information about the PG_Dirty and PG_Writeback status of
				478	each page, so that pages with either of these flags can be found
				479	quickly.
				480
				481	The Dirty tag is primarily used by mpage_writepages - the default
				482	->writepages method. It uses the tag to find dirty pages to call
				483	->writepage on. If mpage_writepages is not used (i.e. the address
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	484	provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	485	almost unused. write_inode_now and sync_inode do use it (through
				486	__sync_single_inode) to check if ->writepages has been successful in
				487	writing out the whole address_space.
				488
				489	The Writeback tag is used by filemapwait and sync_page* functions,
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	490	via wait_on_page_writeback_range, to wait for all writeback to
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	491	complete. While waiting ->sync_page (if defined) will be called on
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	492	each page that is found to require writeback.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	493
				494	An address_space handler may attach extra information to a page,
				495	typically using the 'private' field in the 'struct page'. If such
				496	information is attached, the PG_Private flag should be set. This will
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	497	cause various VM routines to make extra calls into the address_space
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	498	handler to deal with that data.
				499
				500	An address space acts as an intermediate between storage and
				501	application. Data is read into the address space a whole page at a
				502	time, and provided to the application either by copying of the page,
				503	or by memory-mapping the page.
				504	Data is written into the address space by the application, and then
				505	written-back to storage typically in whole pages, however the
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	506	address_space has finer control of write sizes.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	507
				508	The read process essentially only requires 'readpage'. The write
				509	process is more complicated and uses prepare_write/commit_write or
				510	set_page_dirty to write data into the address_space, and writepage,
				511	sync_page, and writepages to writeback data to storage.
				512
				513	Adding and removing pages to/from an address_space is protected by the
				514	inode's i_mutex.
				515
				516	When data is written to a page, the PG_Dirty flag should be set. It
				517	typically remains set until writepage asks for it to be written. This
				518	should clear PG_Dirty and set PG_Writeback. It can be actually
				519	written at any point after PG_Dirty is clear. Once it is known to be
				520	safe, PG_Writeback is cleared.
				521
				522	Writeback makes use of a writeback_control structure...
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	523
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	524	struct address_space_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	525	-------------------------------
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	526
				527	This describes how the VFS can manipulate mapping of a file to page cache in
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	528	your filesystem. As of kernel 2.6.22, the following members are defined:
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	529
				530	struct address_space_operations {
				531	int (writepage)(struct page page, struct writeback_control *wbc);
				532	int (readpage)(struct file , struct page *);
				533	int (sync_page)(struct page );
				534	int (writepages)(struct address_space , struct writeback_control *);
				535	int (set_page_dirty)(struct page page);
				536	int (readpages)(struct file filp, struct address_space *mapping,
				537	struct list_head *pages, unsigned nr_pages);
				538	int (prepare_write)(struct file , struct page *, unsigned, unsigned);
				539	int (commit_write)(struct file , struct page *, unsigned, unsigned);
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame^]	540	int (write_begin)(struct file , struct address_space *mapping,
				541	loff_t pos, unsigned len, unsigned flags,
				542	struct page pagep, void fsdata);
				543	int (write_end)(struct file , struct address_space *mapping,
				544	loff_t pos, unsigned len, unsigned copied,
				545	struct page page, void fsdata);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	546	sector_t (bmap)(struct address_space , sector_t);
				547	int (invalidatepage) (struct page , unsigned long);
				548	int (releasepage) (struct page , int);
				549	ssize_t (direct_IO)(int, struct kiocb , const struct iovec *iov,
				550	loff_t offset, unsigned long nr_segs);
				551	struct page* (get_xip_page)(struct address_space , sector_t,
				552	int);
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	553	/* migrate the contents of a page to the specified target */
				554	int (migratepage) (struct page , struct page *);
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	555	int (launder_page) (struct page );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	556	};
				557
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	558	writepage: called by the VM to write a dirty page to backing store.
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	559	This may happen for data integrity reasons (i.e. 'sync'), or
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	560	to free up memory (flush). The difference can be seen in
				561	wbc->sync_mode.
				562	The PG_Dirty flag has been cleared and PageLocked is true.
				563	writepage should start writeout, should set PG_Writeback,
				564	and should make sure the page is unlocked, either synchronously
				565	or asynchronously when the write operation completes.
				566
				567	If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	568	try too hard if there are problems, and may choose to write out
				569	other pages from the mapping if that is easier (e.g. due to
				570	internal dependencies). If it chooses not to start writeout, it
				571	should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	572	calling ->writepage on that page.
				573
				574	See the file "Locking" for more details.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	575
				576	readpage: called by the VM to read a page from backing store.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	577	The page will be Locked when readpage is called, and should be
				578	unlocked and marked uptodate once the read completes.
				579	If ->readpage discovers that it needs to unlock the page for
				580	some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	581	In this case, the page will be relocated, relocked and if
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	582	that all succeeds, ->readpage will be called again.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	583
				584	sync_page: called by the VM to notify the backing store to perform all
				585	queued I/O operations for a page. I/O operations for other pages
				586	associated with this address_space object may also be performed.
				587
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	588	This function is optional and is called only for pages with
				589	PG_Writeback set while waiting for the writeback to complete.
				590
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	591	writepages: called by the VM to write out pages associated with the
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	592	address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
				593	the writeback_control will specify a range of pages that must be
				594	written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	595	and that many pages should be written if possible.
				596	If no ->writepages is given, then mpage_writepages is used
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	597	instead. This will choose pages from the address space that are
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	598	tagged as DIRTY and will pass them to ->writepage.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	599
				600	set_page_dirty: called by the VM to set a page dirty.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	601	This is particularly needed if an address space attaches
				602	private data to a page, and that data needs to be updated when
				603	a page is dirtied. This is called, for example, when a memory
				604	mapped page gets modified.
				605	If defined, it should set the PageDirty flag, and the
				606	PAGECACHE_TAG_DIRTY tag in the radix tree.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	607
				608	readpages: called by the VM to read pages associated with the address_space
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	609	object. This is essentially just a vector version of
				610	readpage. Instead of just one page, several pages are
				611	requested.
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	612	readpages is only used for read-ahead, so read errors are
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	613	ignored. If anything goes wrong, feel free to give up.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	614
				615	prepare_write: called by the generic write path in VM to set up a write
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	616	request for a page. This indicates to the address space that
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	617	the given range of bytes is about to be written. The
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	618	address_space should check that the write will be able to
				619	complete, by allocating space if necessary and doing any other
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	620	internal housekeeping. If the write will update parts of
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	621	any basic-blocks on storage, then those blocks should be
				622	pre-read (if they haven't been read already) so that the
				623	updated blocks can be written out properly.
				624	The page will be locked. If prepare_write wants to unlock the
				625	page it, like readpage, may do so and return
				626	AOP_TRUNCATED_PAGE.
				627	In this case the prepare_write will be retried one the lock is
				628	regained.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	629
Nick Piggin	955eff5	2007-02-20 13:58:08 -0800	[diff] [blame]	630	Note: the page _must not_ be marked uptodate in this function
				631	(or anywhere else) unless it actually is uptodate right now. As
				632	soon as a page is marked uptodate, it is possible for a concurrent
				633	read(2) to copy it to userspace.
				634
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	635	commit_write: If prepare_write succeeds, new data will be copied
				636	into the page and then commit_write will be called. It will
				637	typically update the size of the file (if appropriate) and
				638	mark the inode as dirty, and do any other related housekeeping
				639	operations. It should avoid returning an error if possible -
				640	errors should have been handled by prepare_write.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	641
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame^]	642	write_begin: This is intended as a replacement for prepare_write. The
				643	key differences being that:
				644	- it returns a locked page (in *pagep) rather than being
				645	given a pre locked page;
				646	- it must be able to cope with short writes (where the
				647	length passed to write_begin is greater than the number
				648	of bytes copied into the page).
				649
				650	Called by the generic buffered write code to ask the filesystem to
				651	prepare to write len bytes at the given offset in the file. The
				652	address_space should check that the write will be able to complete,
				653	by allocating space if necessary and doing any other internal
				654	housekeeping. If the write will update parts of any basic-blocks on
				655	storage, then those blocks should be pre-read (if they haven't been
				656	read already) so that the updated blocks can be written out properly.
				657
				658	The filesystem must return the locked pagecache page for the specified
				659	offset, in *pagep, for the caller to write into.
				660
				661	flags is a field for AOP_FLAG_xxx flags, described in
				662	include/linux/fs.h.
				663
				664	A void * may be returned in fsdata, which then gets passed into
				665	write_end.
				666
				667	Returns 0 on success; < 0 on failure (which is the error code), in
				668	which case write_end is not called.
				669
				670	write_end: After a successful write_begin, and data copy, write_end must
				671	be called. len is the original len passed to write_begin, and copied
				672	is the amount that was able to be copied (copied == len is always true
				673	if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
				674
				675	The filesystem must take care of unlocking the page and releasing it
				676	refcount, and updating i_size.
				677
				678	Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
				679	that were able to be copied into pagecache.
				680
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	681	bmap: called by the VFS to map a logical block offset within object to
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	682	physical block number. This method is used by the FIBMAP
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	683	ioctl and for working with swap-files. To be able to swap to
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	684	a file, the file must have a stable mapping to a block
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	685	device. The swap system does not go through the filesystem
				686	but instead uses bmap to find out where the blocks in the file
				687	are and uses those addresses directly.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	688
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	689
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	690	invalidatepage: If a page has PagePrivate set, then invalidatepage
				691	will be called when part or all of the page is to be removed
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	692	from the address space. This generally corresponds to either a
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	693	truncation or a complete invalidation of the address space
				694	(in the latter case 'offset' will always be 0).
				695	Any private data associated with the page should be updated
				696	to reflect this truncation. If offset is 0, then
				697	the private data should be released, because the page
				698	must be able to be completely discarded. This may be done by
				699	calling the ->releasepage function, but in this case the
				700	release MUST succeed.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	701
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	702	releasepage: releasepage is called on PagePrivate pages to indicate
				703	that the page should be freed if possible. ->releasepage
				704	should remove any private data from the page and clear the
				705	PagePrivate flag. It may also remove the page from the
				706	address_space. If this fails for some reason, it may indicate
				707	failure with a 0 return value.
				708	This is used in two distinct though related cases. The first
				709	is when the VM finds a clean page with no active users and
				710	wants to make it a free page. If ->releasepage succeeds, the
				711	page will be removed from the address_space and become free.
				712
				713	The second case if when a request has been made to invalidate
				714	some or all pages in an address_space. This can happen
				715	through the fadvice(POSIX_FADV_DONTNEED) system call or by the
				716	filesystem explicitly requesting it as nfs and 9fs do (when
				717	they believe the cache may be out of date with storage) by
				718	calling invalidate_inode_pages2().
				719	If the filesystem makes such a call, and needs to be certain
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	720	that all pages are invalidated, then its releasepage will
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	721	need to ensure this. Possibly it can clear the PageUptodate
				722	bit if it cannot free private data yet.
				723
				724	direct_IO: called by the generic read/write routines to perform
				725	direct_IO - that is IO requests which bypass the page cache
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	726	and transfer data directly between the storage and the
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	727	application's address space.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	728
				729	get_xip_page: called by the VM to translate a block number to a page.
				730	The page is valid until the corresponding filesystem is unmounted.
				731	Filesystems that want to use execute-in-place (XIP) need to implement
				732	it. An example implementation can be found in fs/ext2/xip.c.
				733
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	734	migrate_page: This is used to compact the physical memory usage.
				735	If the VM wants to relocate a page (maybe off a memory card
				736	that is signalling imminent failure) it will pass a new page
				737	and an old page to this function. migrate_page should
				738	transfer any private data across and update any references
				739	that it has to the page.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	740
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	741	launder_page: Called before freeing a page - it writes back the dirty page. To
				742	prevent redirtying the page, it is kept locked during the whole
				743	operation.
				744
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	745	The File Object
				746	===============
				747
				748	A file object represents a file opened by a process.
				749
				750
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	751	struct file_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	752	----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	753
				754	This describes how the VFS can manipulate an open file. As of kernel
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	755	2.6.22, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	756
				757	struct file_operations {
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	758	struct module *owner;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	759	loff_t (llseek) (struct file , loff_t, int);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	760	ssize_t (read) (struct file , char __user , size_t, loff_t );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	761	ssize_t (write) (struct file , const char __user , size_t, loff_t );
Badari Pulavarty	027445c	2006-09-30 23:28:46 -0700	[diff] [blame]	762	ssize_t (aio_read) (struct kiocb , const struct iovec *, unsigned long, loff_t);
				763	ssize_t (aio_write) (struct kiocb , const struct iovec *, unsigned long, loff_t);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	764	int (readdir) (struct file , void *, filldir_t);
				765	unsigned int (poll) (struct file , struct poll_table_struct *);
				766	int (ioctl) (struct inode , struct file *, unsigned int, unsigned long);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	767	long (unlocked_ioctl) (struct file , unsigned int, unsigned long);
				768	long (compat_ioctl) (struct file , unsigned int, unsigned long);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	769	int (mmap) (struct file , struct vm_area_struct *);
				770	int (open) (struct inode , struct file *);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	771	int (flush) (struct file );
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	772	int (release) (struct inode , struct file *);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	773	int (fsync) (struct file , struct dentry *, int datasync);
				774	int (aio_fsync) (struct kiocb , int datasync);
				775	int (fasync) (int, struct file , int);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	776	int (lock) (struct file , int, struct file_lock *);
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	777	ssize_t (readv) (struct file , const struct iovec , unsigned long, loff_t );
				778	ssize_t (writev) (struct file , const struct iovec , unsigned long, loff_t );
				779	ssize_t (sendfile) (struct file , loff_t , size_t, read_actor_t, void );
				780	ssize_t (sendpage) (struct file , struct page , int, size_t, loff_t , int);
				781	unsigned long (get_unmapped_area)(struct file , unsigned long, unsigned long, unsigned long, unsigned long);
				782	int (*check_flags)(int);
				783	int (dir_notify)(struct file filp, unsigned long arg);
				784	int (flock) (struct file , int, struct file_lock *);
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	785	ssize_t (splice_write)(struct pipe_inode_info , struct file *, size_t, unsigned int);
				786	ssize_t (splice_read)(struct file , struct pipe_inode_info *, size_t, unsigned int);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	787	};
				788
				789	Again, all methods are called without any locks being held, unless
				790	otherwise noted.
				791
				792	llseek: called when the VFS needs to move the file position index
				793
				794	read: called by read(2) and related system calls
				795
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	796	aio_read: called by io_submit(2) and other asynchronous I/O operations
				797
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	798	write: called by write(2) and related system calls
				799
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	800	aio_write: called by io_submit(2) and other asynchronous I/O operations
				801
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	802	readdir: called when the VFS needs to read the directory contents
				803
				804	poll: called by the VFS when a process wants to check if there is
				805	activity on this file and (optionally) go to sleep until there
				806	is activity. Called by the select(2) and poll(2) system calls
				807
				808	ioctl: called by the ioctl(2) system call
				809
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	810	unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
				811	require the BKL should use this method instead of the ioctl() above.
				812
				813	compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
				814	are used on 64 bit kernels.
				815
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	816	mmap: called by the mmap(2) system call
				817
				818	open: called by the VFS when an inode should be opened. When the VFS
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	819	opens a file, it creates a new "struct file". It then calls the
				820	open method for the newly allocated file structure. You might
				821	think that the open method really belongs in
				822	"struct inode_operations", and you may be right. I think it's
				823	done the way it is because it makes filesystems simpler to
				824	implement. The open() method is a good place to initialize the
				825	"private_data" member in the file structure if you want to point
				826	to a device structure
				827
				828	flush: called by the close(2) system call to flush a file
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	829
				830	release: called when the last reference to an open file is closed
				831
				832	fsync: called by the fsync(2) system call
				833
				834	fasync: called by the fcntl(2) system call when asynchronous
				835	(non-blocking) mode is enabled for a file
				836
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	837	lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
				838	commands
				839
				840	readv: called by the readv(2) system call
				841
				842	writev: called by the writev(2) system call
				843
				844	sendfile: called by the sendfile(2) system call
				845
				846	get_unmapped_area: called by the mmap(2) system call
				847
				848	check_flags: called by the fcntl(2) system call for F_SETFL command
				849
				850	dir_notify: called by the fcntl(2) system call for F_NOTIFY command
				851
				852	flock: called by the flock(2) system call
				853
Pekka J Enberg	d1195c5	2006-04-11 14:21:59 +0200	[diff] [blame]	854	splice_write: called by the VFS to splice data from a pipe to a file. This
				855	method is used by the splice(2) system call
				856
				857	splice_read: called by the VFS to splice data from file to a pipe. This
				858	method is used by the splice(2) system call
				859
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	860	Note that the file operations are implemented by the specific
				861	filesystem in which the inode resides. When opening a device node
				862	(character or block special) most filesystems will call special
				863	support routines in the VFS which will locate the required device
				864	driver information. These support routines replace the filesystem file
				865	operations with those for the device driver, and then proceed to call
				866	the new open() method for the file. This is how opening a device file
				867	in the filesystem eventually ends up calling the device driver open()
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	868	method.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	869
				870
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	871	Directory Entry Cache (dcache)
				872	==============================
				873
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	874
				875	struct dentry_operations
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	876	------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	877
				878	This describes how a filesystem can overload the standard dentry
				879	operations. Dentries and the dcache are the domain of the VFS and the
				880	individual filesystem implementations. Device drivers have no business
				881	here. These methods may be set to NULL, as they are either optional or
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	882	the VFS uses a default. As of kernel 2.6.22, the following members are
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	883	defined:
				884
				885	struct dentry_operations {
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	886	int (d_revalidate)(struct dentry , struct nameidata *);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	887	int (d_hash) (struct dentry , struct qstr *);
				888	int (d_compare) (struct dentry , struct qstr , struct qstr );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	889	int (d_delete)(struct dentry );
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	890	void (d_release)(struct dentry );
				891	void (d_iput)(struct dentry , struct inode *);
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	892	char (d_dname)(struct dentry , char , int);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	893	};
				894
				895	d_revalidate: called when the VFS needs to revalidate a dentry. This
				896	is called whenever a name look-up finds a dentry in the
				897	dcache. Most filesystems leave this as NULL, because all their
				898	dentries in the dcache are valid
				899
				900	d_hash: called when the VFS adds a dentry to the hash table
				901
				902	d_compare: called when a dentry should be compared with another
				903
				904	d_delete: called when the last reference to a dentry is
				905	deleted. This means no-one is using the dentry, however it is
				906	still valid and in the dcache
				907
				908	d_release: called when a dentry is really deallocated
				909
				910	d_iput: called when a dentry loses its inode (just prior to its
				911	being deallocated). The default when this is NULL is that the
				912	VFS calls iput(). If you define this method, you must call
				913	iput() yourself
				914
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	915	d_dname: called when the pathname of a dentry should be generated.
				916	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
				917	pathname generation. (Instead of doing it when dentry is created,
				918	its done only when the path is needed.). Real filesystems probably
				919	dont want to use it, because their dentries are present in global
				920	dcache hash, so their hash should be an invariant. As no lock is
				921	held, d_dname() should not try to modify the dentry itself, unless
				922	appropriate SMP safety is used. CAUTION : d_path() logic is quite
				923	tricky. The correct way to return for example "Hello" is to put it
				924	at the end of the buffer, and returns a pointer to the first char.
				925	dynamic_dname() helper function is provided to take care of this.
				926
				927	Example :
				928
				929	static char pipefs_dname(struct dentry dent, char *buffer, int buflen)
				930	{
				931	return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
				932	dentry->d_inode->i_ino);
				933	}
				934
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	935	Each dentry has a pointer to its parent dentry, as well as a hash list
				936	of child dentries. Child dentries are basically like files in a
				937	directory.
				938
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	939
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	940	Directory Entry Cache API
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	941	--------------------------
				942
				943	There are a number of functions defined which permit a filesystem to
				944	manipulate dentries:
				945
				946	dget: open a new handle for an existing dentry (this just increments
				947	the usage count)
				948
				949	dput: close a handle for a dentry (decrements the usage count). If
				950	the usage count drops to 0, the "d_delete" method is called
				951	and the dentry is placed on the unused list if the dentry is
				952	still in its parents hash list. Putting the dentry on the
				953	unused list just means that if the system needs some RAM, it
				954	goes through the unused list of dentries and deallocates them.
				955	If the dentry has already been unhashed and the usage count
				956	drops to 0, in this case the dentry is deallocated after the
				957	"d_delete" method is called
				958
				959	d_drop: this unhashes a dentry from its parents hash list. A
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	960	subsequent call to dput() will deallocate the dentry if its
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	961	usage count drops to 0
				962
				963	d_delete: delete a dentry. If there are no other open references to
				964	the dentry then the dentry is turned into a negative dentry
				965	(the d_iput() method is called). If there are other
				966	references, then d_drop() is called instead
				967
				968	d_add: add a dentry to its parents hash list and then calls
				969	d_instantiate()
				970
				971	d_instantiate: add a dentry to the alias hash list for the inode and
				972	updates the "d_inode" member. The "i_count" member in the
				973	inode structure should be set/incremented. If the inode
				974	pointer is NULL, the dentry is called a "negative
				975	dentry". This function is commonly called when an inode is
				976	created for an existing negative dentry
				977
				978	d_lookup: look up a dentry given its parent and path name component
				979	It looks up the child of that given name from the dcache
				980	hash table. If it is found, the reference count is incremented
				981	and the dentry is returned. The caller must use d_put()
				982	to free the dentry when it finishes using it.
				983
Pekka Enberg	cbf8f0f	2005-11-07 01:01:09 -0800	[diff] [blame]	984	For further information on dentry locking, please refer to the document
				985	Documentation/filesystems/dentry-locking.txt.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	986
				987
				988	Resources
				989	=========
				990
				991	(Note some of these resources are not up-to-date with the latest kernel
				992	version.)
				993
				994	Creating Linux virtual filesystems. 2002
				995	<http://lwn.net/Articles/13325/>
				996
				997	The Linux Virtual File-system Layer by Neil Brown. 1999
				998	<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
				999
				1000	A tour of the Linux VFS by Michael K. Johnson. 1996
				1001	<http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
				1002
				1003	A small trail through the Linux kernel by Andries Brouwer. 2001
				1004	<http://www.win.tue.nl/~aeb/linux/vfs/trail.html>