Blame - Documentation/filesystems/vfs.txt - kernel/msm

blob: 3f318dd44c775fa21a5a407b85032e4683e03905 [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1	/* -- auto-fill -- */
				2
				3	Overview of the Virtual File System
				4
				5	Richard Gooch <rgooch@atnf.csiro.au>
				6
				7	5-JUL-1999
				8
				9
				10	Conventions used in this document <section>
				11	=================================
				12
				13	Each section in this document will have the string "<section>" at the
				14	right-hand side of the section title. Each subsection will have
				15	"<subsection>" at the right-hand side. These strings are meant to make
				16	it easier to search through the document.
				17
				18	NOTE that the master copy of this document is available online at:
				19	http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
				20
				21
				22	What is it? <section>
				23	===========
				24
				25	The Virtual File System (otherwise known as the Virtual Filesystem
				26	Switch) is the software layer in the kernel that provides the
				27	filesystem interface to userspace programs. It also provides an
				28	abstraction within the kernel which allows different filesystem
				29	implementations to co-exist.
				30
				31
				32	A Quick Look At How It Works <section>
				33	============================
				34
				35	In this section I'll briefly describe how things work, before
				36	launching into the details. I'll start with describing what happens
				37	when user programs open and manipulate files, and then look from the
				38	other view which is how a filesystem is supported and subsequently
				39	mounted.
				40
				41	Opening a File <subsection>
				42	--------------
				43
				44	The VFS implements the open(2), stat(2), chmod(2) and similar system
				45	calls. The pathname argument is used by the VFS to search through the
				46	directory entry cache (dentry cache or "dcache"). This provides a very
				47	fast look-up mechanism to translate a pathname (filename) into a
				48	specific dentry.
				49
				50	An individual dentry usually has a pointer to an inode. Inodes are the
				51	things that live on disc drives, and can be regular files (you know:
				52	those things that you write data into), directories, FIFOs and other
				53	beasts. Dentries live in RAM and are never saved to disc: they exist
				54	only for performance. Inodes live on disc and are copied into memory
				55	when required. Later any changes are written back to disc. The inode
				56	that lives in RAM is a VFS inode, and it is this which the dentry
				57	points to. A single inode can be pointed to by multiple dentries
				58	(think about hardlinks).
				59
				60	The dcache is meant to be a view into your entire filespace. Unlike
				61	Linus, most of us losers can't fit enough dentries into RAM to cover
				62	all of our filespace, so the dcache has bits missing. In order to
				63	resolve your pathname into a dentry, the VFS may have to resort to
				64	creating dentries along the way, and then loading the inode. This is
				65	done by looking up the inode.
				66
				67	To look up an inode (usually read from disc) requires that the VFS
				68	calls the lookup() method of the parent directory inode. This method
				69	is installed by the specific filesystem implementation that the inode
				70	lives in. There will be more on this later.
				71
				72	Once the VFS has the required dentry (and hence the inode), we can do
				73	all those boring things like open(2) the file, or stat(2) it to peek
				74	at the inode data. The stat(2) operation is fairly simple: once the
				75	VFS has the dentry, it peeks at the inode data and passes some of it
				76	back to userspace.
				77
				78	Opening a file requires another operation: allocation of a file
				79	structure (this is the kernel-side implementation of file
				80	descriptors). The freshly allocated file structure is initialised with
				81	a pointer to the dentry and a set of file operation member functions.
				82	These are taken from the inode data. The open() file method is then
				83	called so the specific filesystem implementation can do it's work. You
				84	can see that this is another switch performed by the VFS.
				85
				86	The file structure is placed into the file descriptor table for the
				87	process.
				88
				89	Reading, writing and closing files (and other assorted VFS operations)
				90	is done by using the userspace file descriptor to grab the appropriate
				91	file structure, and then calling the required file structure method
				92	function to do whatever is required.
				93
				94	For as long as the file is open, it keeps the dentry "open" (in use),
				95	which in turn means that the VFS inode is still in use.
				96
				97	All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
				98	chmod(2) and so on) are called from a process context. You should
				99	assume that these calls are made without any kernel locks being
				100	held. This means that the processes may be executing the same piece of
				101	filesystem or driver code at the same time, on different
				102	processors. You should ensure that access to shared resources is
				103	protected by appropriate locks.
				104
				105	Registering and Mounting a Filesystem <subsection>
				106	-------------------------------------
				107
				108	If you want to support a new kind of filesystem in the kernel, all you
				109	need to do is call register_filesystem(). You pass a structure
				110	describing the filesystem implementation (struct file_system_type)
				111	which is then added to an internal table of supported filesystems. You
				112	can do:
				113
				114	% cat /proc/filesystems
				115
				116	to see what filesystems are currently available on your system.
				117
				118	When a request is made to mount a block device onto a directory in
				119	your filespace the VFS will call the appropriate method for the
				120	specific filesystem. The dentry for the mount point will then be
				121	updated to point to the root inode for the new filesystem.
				122
				123	It's now time to look at things in more detail.
				124
				125
				126	struct file_system_type <section>
				127	=======================
				128
				129	This describes the filesystem. As of kernel 2.1.99, the following
				130	members are defined:
				131
				132	struct file_system_type {
				133	const char *name;
				134	int fs_flags;
				135	struct super_block (read_super) (struct super_block , void , int);
				136	struct file_system_type * next;
				137	};
				138
				139	name: the name of the filesystem type, such as "ext2", "iso9660",
				140	"msdos" and so on
				141
				142	fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
				143
				144	read_super: the method to call when a new instance of this
				145	filesystem should be mounted
				146
				147	next: for internal VFS use: you should initialise this to NULL
				148
				149	The read_super() method has the following arguments:
				150
				151	struct super_block *sb: the superblock structure. This is partially
				152	initialised by the VFS and the rest must be initialised by the
				153	read_super() method
				154
				155	void *data: arbitrary mount options, usually comes as an ASCII
				156	string
				157
				158	int silent: whether or not to be silent on error
				159
				160	The read_super() method must determine if the block device specified
				161	in the superblock contains a filesystem of the type the method
				162	supports. On success the method returns the superblock pointer, on
				163	failure it returns NULL.
				164
				165	The most interesting member of the superblock structure that the
				166	read_super() method fills in is the "s_op" field. This is a pointer to
				167	a "struct super_operations" which describes the next level of the
				168	filesystem implementation.
				169
				170
				171	struct super_operations <section>
				172	=======================
				173
				174	This describes how the VFS can manipulate the superblock of your
				175	filesystem. As of kernel 2.1.99, the following members are defined:
				176
				177	struct super_operations {
				178	void (read_inode) (struct inode );
				179	int (write_inode) (struct inode , int);
				180	void (put_inode) (struct inode );
				181	void (drop_inode) (struct inode );
				182	void (delete_inode) (struct inode );
				183	int (notify_change) (struct dentry , struct iattr *);
				184	void (put_super) (struct super_block );
				185	void (write_super) (struct super_block );
				186	int (statfs) (struct super_block , struct statfs *, int);
				187	int (remount_fs) (struct super_block , int , char );
				188	void (clear_inode) (struct inode );
				189	};
				190
				191	All methods are called without any locks being held, unless otherwise
				192	noted. This means that most methods can block safely. All methods are
				193	only called from a process context (i.e. not from an interrupt handler
				194	or bottom half).
				195
				196	read_inode: this method is called to read a specific inode from the
				197	mounted filesystem. The "i_ino" member in the "struct inode"
				198	will be initialised by the VFS to indicate which inode to
				199	read. Other members are filled in by this method
				200
				201	write_inode: this method is called when the VFS needs to write an
				202	inode to disc. The second parameter indicates whether the write
				203	should be synchronous or not, not all filesystems check this flag.
				204
				205	put_inode: called when the VFS inode is removed from the inode
				206	cache. This method is optional
				207
				208	drop_inode: called when the last access to the inode is dropped,
				209	with the inode_lock spinlock held.
				210
				211	This method should be either NULL (normal unix filesystem
				212	semantics) or "generic_delete_inode" (for filesystems that do not
				213	want to cache inodes - causing "delete_inode" to always be
				214	called regardless of the value of i_nlink)
				215
				216	The "generic_delete_inode()" behaviour is equivalent to the
				217	old practice of using "force_delete" in the put_inode() case,
				218	but does not have the races that the "force_delete()" approach
				219	had.
				220
				221	delete_inode: called when the VFS wants to delete an inode
				222
				223	notify_change: called when VFS inode attributes are changed. If this
				224	is NULL the VFS falls back to the write_inode() method. This
				225	is called with the kernel lock held
				226
				227	put_super: called when the VFS wishes to free the superblock
				228	(i.e. unmount). This is called with the superblock lock held
				229
				230	write_super: called when the VFS superblock needs to be written to
				231	disc. This method is optional
				232
				233	statfs: called when the VFS needs to get filesystem statistics. This
				234	is called with the kernel lock held
				235
				236	remount_fs: called when the filesystem is remounted. This is called
				237	with the kernel lock held
				238
				239	clear_inode: called then the VFS clears the inode. Optional
				240
				241	The read_inode() method is responsible for filling in the "i_op"
				242	field. This is a pointer to a "struct inode_operations" which
				243	describes the methods that can be performed on individual inodes.
				244
				245
				246	struct inode_operations <section>
				247	=======================
				248
				249	This describes how the VFS can manipulate an inode in your
				250	filesystem. As of kernel 2.1.99, the following members are defined:
				251
				252	struct inode_operations {
				253	struct file_operations * default_file_ops;
				254	int (create) (struct inode ,struct dentry *,int);
				255	int (lookup) (struct inode ,struct dentry *);
				256	int (link) (struct dentry ,struct inode ,struct dentry );
				257	int (unlink) (struct inode ,struct dentry *);
				258	int (symlink) (struct inode ,struct dentry ,const char );
				259	int (mkdir) (struct inode ,struct dentry *,int);
				260	int (rmdir) (struct inode ,struct dentry *);
				261	int (mknod) (struct inode ,struct dentry *,int,dev_t);
				262	int (rename) (struct inode , struct dentry *,
				263	struct inode , struct dentry );
				264	int (readlink) (struct dentry , char *,int);
				265	struct dentry * (follow_link) (struct dentry , struct dentry *);
				266	int (readpage) (struct file , struct page *);
				267	int (writepage) (struct page page, struct writeback_control *wbc);
				268	int (bmap) (struct inode ,int);
				269	void (truncate) (struct inode );
				270	int (permission) (struct inode , int);
				271	int (smap) (struct inode ,int);
				272	int (updatepage) (struct file , struct page , const char ,
				273	unsigned long, unsigned int, int);
				274	int (revalidate) (struct dentry );
				275	};
				276
				277	Again, all methods are called without any locks being held, unless
				278	otherwise noted.
				279
				280	default_file_ops: this is a pointer to a "struct file_operations"
				281	which describes how to open and then manipulate open files
				282
				283	create: called by the open(2) and creat(2) system calls. Only
				284	required if you want to support regular files. The dentry you
				285	get should not have an inode (i.e. it should be a negative
				286	dentry). Here you will probably call d_instantiate() with the
				287	dentry and the newly created inode
				288
				289	lookup: called when the VFS needs to look up an inode in a parent
				290	directory. The name to look for is found in the dentry. This
				291	method must call d_add() to insert the found inode into the
				292	dentry. The "i_count" field in the inode structure should be
				293	incremented. If the named inode does not exist a NULL inode
				294	should be inserted into the dentry (this is called a negative
				295	dentry). Returning an error code from this routine must only
				296	be done on a real error, otherwise creating inodes with system
				297	calls like create(2), mknod(2), mkdir(2) and so on will fail.
				298	If you wish to overload the dentry methods then you should
				299	initialise the "d_dop" field in the dentry; this is a pointer
				300	to a struct "dentry_operations".
				301	This method is called with the directory inode semaphore held
				302
				303	link: called by the link(2) system call. Only required if you want
				304	to support hard links. You will probably need to call
				305	d_instantiate() just as you would in the create() method
				306
				307	unlink: called by the unlink(2) system call. Only required if you
				308	want to support deleting inodes
				309
				310	symlink: called by the symlink(2) system call. Only required if you
				311	want to support symlinks. You will probably need to call
				312	d_instantiate() just as you would in the create() method
				313
				314	mkdir: called by the mkdir(2) system call. Only required if you want
				315	to support creating subdirectories. You will probably need to
				316	call d_instantiate() just as you would in the create() method
				317
				318	rmdir: called by the rmdir(2) system call. Only required if you want
				319	to support deleting subdirectories
				320
				321	mknod: called by the mknod(2) system call to create a device (char,
				322	block) inode or a named pipe (FIFO) or socket. Only required
				323	if you want to support creating these types of inodes. You
				324	will probably need to call d_instantiate() just as you would
				325	in the create() method
				326
				327	readlink: called by the readlink(2) system call. Only required if
				328	you want to support reading symbolic links
				329
				330	follow_link: called by the VFS to follow a symbolic link to the
				331	inode it points to. Only required if you want to support
				332	symbolic links
				333
				334
				335	struct file_operations <section>
				336	======================
				337
				338	This describes how the VFS can manipulate an open file. As of kernel
				339	2.1.99, the following members are defined:
				340
				341	struct file_operations {
				342	loff_t (llseek) (struct file , loff_t, int);
				343	ssize_t (read) (struct file , char , size_t, loff_t );
				344	ssize_t (write) (struct file , const char , size_t, loff_t );
				345	int (readdir) (struct file , void *, filldir_t);
				346	unsigned int (poll) (struct file , struct poll_table_struct *);
				347	int (ioctl) (struct inode , struct file *, unsigned int, unsigned long);
				348	int (mmap) (struct file , struct vm_area_struct *);
				349	int (open) (struct inode , struct file *);
				350	int (release) (struct inode , struct file *);
				351	int (fsync) (struct file , struct dentry *);
				352	int (fasync) (struct file , int);
				353	int (*check_media_change) (kdev_t dev);
				354	int (*revalidate) (kdev_t dev);
				355	int (lock) (struct file , int, struct file_lock *);
				356	};
				357
				358	Again, all methods are called without any locks being held, unless
				359	otherwise noted.
				360
				361	llseek: called when the VFS needs to move the file position index
				362
				363	read: called by read(2) and related system calls
				364
				365	write: called by write(2) and related system calls
				366
				367	readdir: called when the VFS needs to read the directory contents
				368
				369	poll: called by the VFS when a process wants to check if there is
				370	activity on this file and (optionally) go to sleep until there
				371	is activity. Called by the select(2) and poll(2) system calls
				372
				373	ioctl: called by the ioctl(2) system call
				374
				375	mmap: called by the mmap(2) system call
				376
				377	open: called by the VFS when an inode should be opened. When the VFS
				378	opens a file, it creates a new "struct file" and initialises
				379	the "f_op" file operations member with the "default_file_ops"
				380	field in the inode structure. It then calls the open method
				381	for the newly allocated file structure. You might think that
				382	the open method really belongs in "struct inode_operations",
				383	and you may be right. I think it's done the way it is because
				384	it makes filesystems simpler to implement. The open() method
				385	is a good place to initialise the "private_data" member in the
				386	file structure if you want to point to a device structure
				387
				388	release: called when the last reference to an open file is closed
				389
				390	fsync: called by the fsync(2) system call
				391
				392	fasync: called by the fcntl(2) system call when asynchronous
				393	(non-blocking) mode is enabled for a file
				394
				395	Note that the file operations are implemented by the specific
				396	filesystem in which the inode resides. When opening a device node
				397	(character or block special) most filesystems will call special
				398	support routines in the VFS which will locate the required device
				399	driver information. These support routines replace the filesystem file
				400	operations with those for the device driver, and then proceed to call
				401	the new open() method for the file. This is how opening a device file
				402	in the filesystem eventually ends up calling the device driver open()
				403	method. Note the devfs (the Device FileSystem) has a more direct path
				404	from device node to device driver (this is an unofficial kernel
				405	patch).
				406
				407
				408	Directory Entry Cache (dcache) <section>
				409	------------------------------
				410
				411	struct dentry_operations
				412	========================
				413
				414	This describes how a filesystem can overload the standard dentry
				415	operations. Dentries and the dcache are the domain of the VFS and the
				416	individual filesystem implementations. Device drivers have no business
				417	here. These methods may be set to NULL, as they are either optional or
				418	the VFS uses a default. As of kernel 2.1.99, the following members are
				419	defined:
				420
				421	struct dentry_operations {
				422	int (d_revalidate)(struct dentry );
				423	int (d_hash) (struct dentry , struct qstr *);
				424	int (d_compare) (struct dentry , struct qstr , struct qstr );
				425	void (d_delete)(struct dentry );
				426	void (d_release)(struct dentry );
				427	void (d_iput)(struct dentry , struct inode *);
				428	};
				429
				430	d_revalidate: called when the VFS needs to revalidate a dentry. This
				431	is called whenever a name look-up finds a dentry in the
				432	dcache. Most filesystems leave this as NULL, because all their
				433	dentries in the dcache are valid
				434
				435	d_hash: called when the VFS adds a dentry to the hash table
				436
				437	d_compare: called when a dentry should be compared with another
				438
				439	d_delete: called when the last reference to a dentry is
				440	deleted. This means no-one is using the dentry, however it is
				441	still valid and in the dcache
				442
				443	d_release: called when a dentry is really deallocated
				444
				445	d_iput: called when a dentry loses its inode (just prior to its
				446	being deallocated). The default when this is NULL is that the
				447	VFS calls iput(). If you define this method, you must call
				448	iput() yourself
				449
				450	Each dentry has a pointer to its parent dentry, as well as a hash list
				451	of child dentries. Child dentries are basically like files in a
				452	directory.
				453
				454	Directory Entry Cache APIs
				455	--------------------------
				456
				457	There are a number of functions defined which permit a filesystem to
				458	manipulate dentries:
				459
				460	dget: open a new handle for an existing dentry (this just increments
				461	the usage count)
				462
				463	dput: close a handle for a dentry (decrements the usage count). If
				464	the usage count drops to 0, the "d_delete" method is called
				465	and the dentry is placed on the unused list if the dentry is
				466	still in its parents hash list. Putting the dentry on the
				467	unused list just means that if the system needs some RAM, it
				468	goes through the unused list of dentries and deallocates them.
				469	If the dentry has already been unhashed and the usage count
				470	drops to 0, in this case the dentry is deallocated after the
				471	"d_delete" method is called
				472
				473	d_drop: this unhashes a dentry from its parents hash list. A
				474	subsequent call to dput() will dellocate the dentry if its
				475	usage count drops to 0
				476
				477	d_delete: delete a dentry. If there are no other open references to
				478	the dentry then the dentry is turned into a negative dentry
				479	(the d_iput() method is called). If there are other
				480	references, then d_drop() is called instead
				481
				482	d_add: add a dentry to its parents hash list and then calls
				483	d_instantiate()
				484
				485	d_instantiate: add a dentry to the alias hash list for the inode and
				486	updates the "d_inode" member. The "i_count" member in the
				487	inode structure should be set/incremented. If the inode
				488	pointer is NULL, the dentry is called a "negative
				489	dentry". This function is commonly called when an inode is
				490	created for an existing negative dentry
				491
				492	d_lookup: look up a dentry given its parent and path name component
				493	It looks up the child of that given name from the dcache
				494	hash table. If it is found, the reference count is incremented
				495	and the dentry is returned. The caller must use d_put()
				496	to free the dentry when it finishes using it.
				497
				498
				499	RCU-based dcache locking model
				500	------------------------------
				501
				502	On many workloads, the most common operation on dcache is
				503	to look up a dentry, given a parent dentry and the name
				504	of the child. Typically, for every open(), stat() etc.,
				505	the dentry corresponding to the pathname will be looked
				506	up by walking the tree starting with the first component
				507	of the pathname and using that dentry along with the next
				508	component to look up the next level and so on. Since it
				509	is a frequent operation for workloads like multiuser
				510	environments and webservers, it is important to optimize
				511	this path.
				512
				513	Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
				514	in every component during path look-up. Since 2.5.10 onwards,
				515	fastwalk algorithm changed this by holding the dcache_lock
				516	at the beginning and walking as many cached path component
				517	dentries as possible. This signficantly decreases the number
				518	of acquisition of dcache_lock. However it also increases the
				519	lock hold time signficantly and affects performance in large
				520	SMP machines. Since 2.5.62 kernel, dcache has been using
				521	a new locking model that uses RCU to make dcache look-up
				522	lock-free.
				523
				524	The current dcache locking model is not very different from the existing
				525	dcache locking model. Prior to 2.5.62 kernel, dcache_lock
				526	protected the hash chain, d_child, d_alias, d_lru lists as well
				527	as d_inode and several other things like mount look-up. RCU-based
				528	changes affect only the way the hash chain is protected. For everything
				529	else the dcache_lock must be taken for both traversing as well as
				530	updating. The hash chain updations too take the dcache_lock.
				531	The significant change is the way d_lookup traverses the hash chain,
				532	it doesn't acquire the dcache_lock for this and rely on RCU to
				533	ensure that the dentry has not been freed.
				534
				535
				536	Dcache locking details
				537	----------------------
				538	For many multi-user workloads, open() and stat() on files are
				539	very frequently occurring operations. Both involve walking
				540	of path names to find the dentry corresponding to the
				541	concerned file. In 2.4 kernel, dcache_lock was held
				542	during look-up of each path component. Contention and
				543	cacheline bouncing of this global lock caused significant
				544	scalability problems. With the introduction of RCU
				545	in linux kernel, this was worked around by making
				546	the look-up of path components during path walking lock-free.
				547
				548
				549	Safe lock-free look-up of dcache hash table
				550	===========================================
				551
				552	Dcache is a complex data structure with the hash table entries
				553	also linked together in other lists. In 2.4 kernel, dcache_lock
				554	protected all the lists. We applied RCU only on hash chain
				555	walking. The rest of the lists are still protected by dcache_lock.
				556	Some of the important changes are :
				557
				558	1. The deletion from hash chain is done using hlist_del_rcu() macro which
				559	doesn't initialize next pointer of the deleted dentry and this
				560	allows us to walk safely lock-free while a deletion is happening.
				561
				562	2. Insertion of a dentry into the hash table is done using
				563	hlist_add_head_rcu() which take care of ordering the writes -
				564	the writes to the dentry must be visible before the dentry
				565	is inserted. This works in conjuction with hlist_for_each_rcu()
				566	while walking the hash chain. The only requirement is that
				567	all initialization to the dentry must be done before hlist_add_head_rcu()
				568	since we don't have dcache_lock protection while traversing
				569	the hash chain. This isn't different from the existing code.
				570
				571	3. The dentry looked up without holding dcache_lock by cannot be
				572	returned for walking if it is unhashed. It then may have a NULL
				573	d_inode or other bogosity since RCU doesn't protect the other
				574	fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
				575	indicate unhashed dentries and use this in conjunction with a
				576	per-dentry lock (d_lock). Once looked up without the dcache_lock,
				577	we acquire the per-dentry lock (d_lock) and check if the
				578	dentry is unhashed. If so, the look-up is failed. If not, the
				579	reference count of the dentry is increased and the dentry is returned.
				580
				581	4. Once a dentry is looked up, it must be ensured during the path
				582	walk for that component it doesn't go away. In pre-2.5.10 code,
				583	this was done holding a reference to the dentry. dcache_rcu does
				584	the same. In some sense, dcache_rcu path walking looks like
				585	the pre-2.5.10 version.
				586
				587	5. All dentry hash chain updations must take the dcache_lock as well as
				588	the per-dentry lock in that order. dput() does this to ensure
				589	that a dentry that has just been looked up in another CPU
				590	doesn't get deleted before dget() can be done on it.
				591
				592	6. There are several ways to do reference counting of RCU protected
				593	objects. One such example is in ipv4 route cache where
				594	deferred freeing (using call_rcu()) is done as soon as
				595	the reference count goes to zero. This cannot be done in
				596	the case of dentries because tearing down of dentries
				597	require blocking (dentry_iput()) which isn't supported from
				598	RCU callbacks. Instead, tearing down of dentries happen
				599	synchronously in dput(), but actual freeing happens later
				600	when RCU grace period is over. This allows safe lock-free
				601	walking of the hash chains, but a matched dentry may have
				602	been partially torn down. The checking of DCACHE_UNHASHED
				603	flag with d_lock held detects such dentries and prevents
				604	them from being returned from look-up.
				605
				606
				607	Maintaining POSIX rename semantics
				608	==================================
				609
				610	Since look-up of dentries is lock-free, it can race against
				611	a concurrent rename operation. For example, during rename
				612	of file A to B, look-up of either A or B must succeed.
				613	So, if look-up of B happens after A has been removed from the
				614	hash chain but not added to the new hash chain, it may fail.
				615	Also, a comparison while the name is being written concurrently
				616	by a rename may result in false positive matches violating
				617	rename semantics. Issues related to race with rename are
				618	handled as described below :
				619
				620	1. Look-up can be done in two ways - d_lookup() which is safe
				621	from simultaneous renames and __d_lookup() which is not.
				622	If __d_lookup() fails, it must be followed up by a d_lookup()
				623	to correctly determine whether a dentry is in the hash table
				624	or not. d_lookup() protects look-ups using a sequence
				625	lock (rename_lock).
				626
				627	2. The name associated with a dentry (d_name) may be changed if
				628	a rename is allowed to happen simultaneously. To avoid memcmp()
				629	in __d_lookup() go out of bounds due to a rename and false
				630	positive comparison, the name comparison is done while holding the
				631	per-dentry lock. This prevents concurrent renames during this
				632	operation.
				633
				634	3. Hash table walking during look-up may move to a different bucket as
				635	the current dentry is moved to a different bucket due to rename.
				636	But we use hlists in dcache hash table and they are null-terminated.
				637	So, even if a dentry moves to a different bucket, hash chain
				638	walk will terminate. [with a list_head list, it may not since
				639	termination is when the list_head in the original bucket is reached].
				640	Since we redo the d_parent check and compare name while holding
				641	d_lock, lock-free look-up will not race against d_move().
				642
				643	4. There can be a theoritical race when a dentry keeps coming back
				644	to original bucket due to double moves. Due to this look-up may
				645	consider that it has never moved and can end up in a infinite loop.
				646	But this is not any worse that theoritical livelocks we already
				647	have in the kernel.
				648
				649
				650	Important guidelines for filesystem developers related to dcache_rcu
				651	====================================================================
				652
				653	1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
				654	don't change. Only dcache internal implementation changes. However
				655	filesystems must not delete from the dentry hash chains directly
				656	using the list macros like allowed earlier. They must use dcache
				657	APIs like d_drop() or __d_drop() depending on the situation.
				658
				659	2. d_flags is now protected by a per-dentry lock (d_lock). All
				660	access to d_flags must be protected by it.
				661
				662	3. For a hashed dentry, checking of d_count needs to be protected
				663	by d_lock.
				664
				665
				666	Papers and other documentation on dcache locking
				667	================================================
				668
				669	1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
				670
				671	2. http://lse.sourceforge.net/locking/dcache/dcache.html