Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Changes since 2.5.0: |
| 2 | |
Oliver Pinter | 3eb43f6 | 2008-02-03 17:59:17 +0200 | [diff] [blame] | 3 | --- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | [recommended] |
| 5 | |
| 6 | New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), |
| 7 | sb_set_blocksize() and sb_min_blocksize(). |
| 8 | |
| 9 | Use them. |
| 10 | |
| 11 | (sb_find_get_block() replaces 2.4's get_hash_table()) |
| 12 | |
Oliver Pinter | 3eb43f6 | 2008-02-03 17:59:17 +0200 | [diff] [blame] | 13 | --- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 14 | [recommended] |
| 15 | |
| 16 | New methods: ->alloc_inode() and ->destroy_inode(). |
| 17 | |
| 18 | Remove inode->u.foo_inode_i |
| 19 | Declare |
| 20 | struct foo_inode_info { |
| 21 | /* fs-private stuff */ |
| 22 | struct inode vfs_inode; |
| 23 | }; |
| 24 | static inline struct foo_inode_info *FOO_I(struct inode *inode) |
| 25 | { |
| 26 | return list_entry(inode, struct foo_inode_info, vfs_inode); |
| 27 | } |
| 28 | |
| 29 | Use FOO_I(inode) instead of &inode->u.foo_inode_i; |
| 30 | |
Oliver Pinter | 3eb43f6 | 2008-02-03 17:59:17 +0200 | [diff] [blame] | 31 | Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 32 | foo_inode_info and return the address of ->vfs_inode, the latter should free |
| 33 | FOO_I(inode) (see in-tree filesystems for examples). |
| 34 | |
| 35 | Make them ->alloc_inode and ->destroy_inode in your super_operations. |
| 36 | |
David Howells | 12debc4 | 2008-02-07 00:15:52 -0800 | [diff] [blame] | 37 | Keep in mind that now you need explicit initialization of private data |
| 38 | typically between calling iget_locked() and unlocking the inode. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 39 | |
| 40 | At some point that will become mandatory. |
| 41 | |
| 42 | --- |
| 43 | [mandatory] |
| 44 | |
| 45 | Change of file_system_type method (->read_super to ->get_sb) |
| 46 | |
| 47 | ->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. |
| 48 | |
| 49 | Turn your foo_read_super() into a function that would return 0 in case of |
| 50 | success and negative number in case of error (-EINVAL unless you have more |
| 51 | informative error value to report). Call it foo_fill_super(). Now declare |
| 52 | |
David Howells | 454e239 | 2006-06-23 02:02:57 -0700 | [diff] [blame] | 53 | int foo_get_sb(struct file_system_type *fs_type, |
| 54 | int flags, const char *dev_name, void *data, struct vfsmount *mnt) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 55 | { |
David Howells | 454e239 | 2006-06-23 02:02:57 -0700 | [diff] [blame] | 56 | return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, |
| 57 | mnt); |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 58 | } |
| 59 | |
| 60 | (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of |
| 61 | filesystem). |
| 62 | |
| 63 | Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as |
| 64 | foo_get_sb. |
| 65 | |
| 66 | --- |
| 67 | [mandatory] |
| 68 | |
| 69 | Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. |
| 70 | Most likely there is no need to change anything, but if you relied on |
| 71 | global exclusion between renames for some internal purpose - you need to |
| 72 | change your internal locking. Otherwise exclusion warranties remain the |
| 73 | same (i.e. parents and victim are locked, etc.). |
| 74 | |
| 75 | --- |
| 76 | [informational] |
| 77 | |
| 78 | Now we have the exclusion between ->lookup() and directory removal (by |
| 79 | ->rmdir() and ->rename()). If you used to need that exclusion and do |
| 80 | it by internal locking (most of filesystems couldn't care less) - you |
| 81 | can relax your locking. |
| 82 | |
| 83 | --- |
| 84 | [mandatory] |
| 85 | |
| 86 | ->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), |
| 87 | ->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() |
| 88 | and ->readdir() are called without BKL now. Grab it on entry, drop upon return |
| 89 | - that will guarantee the same locking you used to have. If your method or its |
| 90 | parts do not need BKL - better yet, now you can shift lock_kernel() and |
| 91 | unlock_kernel() so that they would protect exactly what needs to be |
| 92 | protected. |
| 93 | |
| 94 | --- |
| 95 | [mandatory] |
| 96 | |
| 97 | BKL is also moved from around sb operations. ->write_super() Is now called |
| 98 | without BKL held. BKL should have been shifted into individual fs sb_op |
| 99 | functions. If you don't need it, remove it. |
| 100 | |
| 101 | --- |
| 102 | [informational] |
| 103 | |
| 104 | check for ->link() target not being a directory is done by callers. Feel |
| 105 | free to drop it... |
| 106 | |
| 107 | --- |
| 108 | [informational] |
| 109 | |
Josef 'Jeff' Sipek | c2b3898 | 2007-05-24 12:21:43 -0400 | [diff] [blame] | 110 | ->link() callers hold ->i_mutex on the object we are linking to. Some of your |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 111 | problems might be over... |
| 112 | |
| 113 | --- |
| 114 | [mandatory] |
| 115 | |
| 116 | new file_system_type method - kill_sb(superblock). If you are converting |
| 117 | an existing filesystem, set it according to ->fs_flags: |
| 118 | FS_REQUIRES_DEV - kill_block_super |
| 119 | FS_LITTER - kill_litter_super |
| 120 | neither - kill_anon_super |
| 121 | FS_LITTER is gone - just remove it from fs_flags. |
| 122 | |
| 123 | --- |
| 124 | [mandatory] |
| 125 | |
| 126 | FS_SINGLE is gone (actually, that had happened back when ->get_sb() |
| 127 | went in - and hadn't been documented ;-/). Just remove it from fs_flags |
| 128 | (and see ->get_sb() entry for other actions). |
| 129 | |
| 130 | --- |
| 131 | [mandatory] |
| 132 | |
Josef 'Jeff' Sipek | c2b3898 | 2007-05-24 12:21:43 -0400 | [diff] [blame] | 133 | ->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so |
| 134 | watch for ->i_mutex-grabbing code that might be used by your ->setattr(). |
| 135 | Callers of notify_change() need ->i_mutex now. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 136 | |
| 137 | --- |
| 138 | [recommended] |
| 139 | |
| 140 | New super_block field "struct export_operations *s_export_op" for |
| 141 | explicit support for exporting, e.g. via NFS. The structure is fully |
| 142 | documented at its declaration in include/linux/fs.h, and in |
J. Bruce Fields | dc7a081 | 2009-10-27 14:41:35 -0400 | [diff] [blame] | 143 | Documentation/filesystems/nfs/Exporting. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 144 | |
| 145 | Briefly it allows for the definition of decode_fh and encode_fh operations |
| 146 | to encode and decode filehandles, and allows the filesystem to use |
| 147 | a standard helper function for decode_fh, and provide file-system specific |
| 148 | support for this helper, particularly get_parent. |
| 149 | |
| 150 | It is planned that this will be required for exporting once the code |
| 151 | settles down a bit. |
| 152 | |
| 153 | [mandatory] |
| 154 | |
| 155 | s_export_op is now required for exporting a filesystem. |
| 156 | isofs, ext2, ext3, resierfs, fat |
| 157 | can be used as examples of very different filesystems. |
| 158 | |
| 159 | --- |
| 160 | [mandatory] |
| 161 | |
| 162 | iget4() and the read_inode2 callback have been superseded by iget5_locked() |
| 163 | which has the following prototype, |
| 164 | |
| 165 | struct inode *iget5_locked(struct super_block *sb, unsigned long ino, |
| 166 | int (*test)(struct inode *, void *), |
| 167 | int (*set)(struct inode *, void *), |
| 168 | void *data); |
| 169 | |
| 170 | 'test' is an additional function that can be used when the inode |
| 171 | number is not sufficient to identify the actual file object. 'set' |
| 172 | should be a non-blocking function that initializes those parts of a |
| 173 | newly created inode to allow the test function to succeed. 'data' is |
| 174 | passed as an opaque value to both test and set functions. |
| 175 | |
David Howells | 12debc4 | 2008-02-07 00:15:52 -0800 | [diff] [blame] | 176 | When the inode has been created by iget5_locked(), it will be returned with the |
| 177 | I_NEW flag set and will still be locked. The filesystem then needs to finalize |
| 178 | the initialization. Once the inode is initialized it must be unlocked by |
| 179 | calling unlock_new_inode(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 180 | |
| 181 | The filesystem is responsible for setting (and possibly testing) i_ino |
| 182 | when appropriate. There is also a simpler iget_locked function that |
| 183 | just takes the superblock and inode number as arguments and does the |
| 184 | test and set for you. |
| 185 | |
| 186 | e.g. |
David Howells | b46980f | 2008-02-07 00:15:27 -0800 | [diff] [blame] | 187 | inode = iget_locked(sb, ino); |
| 188 | if (inode->i_state & I_NEW) { |
| 189 | err = read_inode_from_disk(inode); |
| 190 | if (err < 0) { |
| 191 | iget_failed(inode); |
| 192 | return err; |
| 193 | } |
| 194 | unlock_new_inode(inode); |
| 195 | } |
| 196 | |
| 197 | Note that if the process of setting up a new inode fails, then iget_failed() |
| 198 | should be called on the inode to render it dead, and an appropriate error |
| 199 | should be passed back to the caller. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 200 | |
| 201 | --- |
| 202 | [recommended] |
| 203 | |
| 204 | ->getattr() finally getting used. See instances in nfs, minix, etc. |
| 205 | |
| 206 | --- |
| 207 | [mandatory] |
| 208 | |
| 209 | ->revalidate() is gone. If your filesystem had it - provide ->getattr() |
| 210 | and let it call whatever you had as ->revlidate() + (for symlinks that |
| 211 | had ->revalidate()) add calls in ->follow_link()/->readlink(). |
| 212 | |
| 213 | --- |
| 214 | [mandatory] |
| 215 | |
| 216 | ->d_parent changes are not protected by BKL anymore. Read access is safe |
| 217 | if at least one of the following is true: |
| 218 | * filesystem has no cross-directory rename() |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 219 | * we know that parent had been locked (e.g. we are looking at |
| 220 | ->d_parent of ->lookup() argument). |
| 221 | * we are called from ->rename(). |
| 222 | * the child's ->d_lock is held |
| 223 | Audit your code and add locking if needed. Notice that any place that is |
| 224 | not protected by the conditions above is risky even in the old tree - you |
| 225 | had been relying on BKL and that's prone to screwups. Old tree had quite |
| 226 | a few holes of that kind - unprotected access to ->d_parent leading to |
| 227 | anything from oops to silent memory corruption. |
| 228 | |
| 229 | --- |
| 230 | [mandatory] |
| 231 | |
| 232 | FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags |
| 233 | (see rootfs for one kind of solution and bdev/socket/pipe for another). |
| 234 | |
| 235 | --- |
| 236 | [recommended] |
| 237 | |
| 238 | Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter |
| 239 | is still alive, but only because of the mess in drivers/s390/block/dasd.c. |
| 240 | As soon as it gets fixed is_read_only() will die. |
| 241 | |
| 242 | --- |
| 243 | [mandatory] |
| 244 | |
| 245 | ->permission() is called without BKL now. Grab it on entry, drop upon |
| 246 | return - that will guarantee the same locking you used to have. If |
| 247 | your method or its parts do not need BKL - better yet, now you can |
| 248 | shift lock_kernel() and unlock_kernel() so that they would protect |
| 249 | exactly what needs to be protected. |
| 250 | |
| 251 | --- |
| 252 | [mandatory] |
| 253 | |
| 254 | ->statfs() is now called without BKL held. BKL should have been |
| 255 | shifted into individual fs sb_op functions where it's not clear that |
| 256 | it's safe to remove it. If you don't need it, remove it. |
| 257 | |
| 258 | --- |
| 259 | [mandatory] |
| 260 | |
| 261 | is_read_only() is gone; use bdev_read_only() instead. |
| 262 | |
| 263 | --- |
| 264 | [mandatory] |
| 265 | |
| 266 | destroy_buffers() is gone; use invalidate_bdev(). |
| 267 | |
| 268 | --- |
| 269 | [mandatory] |
| 270 | |
| 271 | fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is |
| 272 | deliberate; as soon as struct block_device * is propagated in a reasonable |
| 273 | way by that code fixing will become trivial; until then nothing can be |
| 274 | done. |
Christoph Hellwig | 1e23173 | 2010-06-07 09:29:20 +0200 | [diff] [blame] | 275 | |
| 276 | [mandatory] |
| 277 | |
| 278 | block truncatation on error exit from ->write_begin, and ->direct_IO |
| 279 | moved from generic methods (block_write_begin, cont_write_begin, |
| 280 | nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at |
| 281 | ext2_write_failed and callers for an example. |
| 282 | |
| 283 | [mandatory] |
| 284 | |
| 285 | ->truncate is going away. The whole truncate sequence needs to be |
| 286 | implemented in ->setattr, which is now mandatory for filesystems |
| 287 | implementing on-disk size changes. Start with a copy of the old inode_setattr |
| 288 | and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to |
| 289 | be in order of zeroing blocks using block_truncate_page or similar helpers, |
| 290 | size update and on finally on-disk truncation which should not fail. |
| 291 | inode_change_ok now includes the size checks for ATTR_SIZE and must be called |
| 292 | in the beginning of ->setattr unconditionally. |
Al Viro | 336fb3b | 2010-06-08 00:37:12 -0400 | [diff] [blame] | 293 | |
| 294 | [mandatory] |
| 295 | |
| 296 | ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should |
| 297 | be used instead. It gets called whenever the inode is evicted, whether it has |
| 298 | remaining links or not. Caller does *not* evict the pagecache or inode-associated |
| 299 | metadata buffers; getting rid of those is responsibility of method, as it had |
| 300 | been for ->delete_inode(). |
| 301 | ->drop_inode() returns int now; it's called on final iput() with inode_lock |
| 302 | held and it returns true if filesystems wants the inode to be dropped. As before, |
| 303 | generic_drop_inode() is still the default and it's been updated appropriately. |
| 304 | generic_delete_inode() is also alive and it consists simply of return 1. Note that |
| 305 | all actual eviction work is done by caller after ->drop_inode() returns. |
| 306 | clear_inode() is gone; use end_writeback() instead. As before, it must |
| 307 | be called exactly once on each call of ->evict_inode() (as it used to be for |
| 308 | each call of ->delete_inode()). Unlike before, if you are using inode-associated |
| 309 | metadata buffers (i.e. mark_buffer_dirty_inode()), it's your responsibility to |
| 310 | call invalidate_inode_buffers() before end_writeback(). |
| 311 | No async writeback (and thus no calls of ->write_inode()) will happen |
| 312 | after end_writeback() returns, so actions that should not overlap with ->write_inode() |
| 313 | (e.g. freeing on-disk inode if i_nlink is 0) ought to be done after that call. |
| 314 | |
| 315 | NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out |
| 316 | if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() |
| 317 | may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly |
| 318 | free the on-disk inode, you may end up doing that while ->write_inode() is writing |
| 319 | to it. |
Nick Piggin | fe15ce4 | 2011-01-07 17:49:23 +1100 | [diff] [blame] | 320 | |
| 321 | --- |
| 322 | [mandatory] |
| 323 | |
| 324 | .d_delete() now only advises the dcache as to whether or not to cache |
| 325 | unreferenced dentries, and is now only called when the dentry refcount goes to |
| 326 | 0. Even on 0 refcount transition, it must be able to tolerate being called 0, |
| 327 | 1, or more times (eg. constant, idempotent). |
Nick Piggin | 621e155 | 2011-01-07 17:49:27 +1100 | [diff] [blame] | 328 | |
| 329 | --- |
| 330 | [mandatory] |
| 331 | |
| 332 | .d_compare() calling convention and locking rules are significantly |
| 333 | changed. Read updated documentation in Documentation/filesystems/vfs.txt (and |
| 334 | look at examples of other filesystems) for guidance. |
Nick Piggin | b1e6a01 | 2011-01-07 17:49:28 +1100 | [diff] [blame] | 335 | |
| 336 | --- |
| 337 | [mandatory] |
| 338 | |
| 339 | .d_hash() calling convention and locking rules are significantly |
| 340 | changed. Read updated documentation in Documentation/filesystems/vfs.txt (and |
| 341 | look at examples of other filesystems) for guidance. |
Nick Piggin | b5c84bf | 2011-01-07 17:49:38 +1100 | [diff] [blame] | 342 | |
| 343 | --- |
| 344 | [mandatory] |
| 345 | dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c |
| 346 | for details of what locks to replace dcache_lock with in order to protect |
| 347 | particular things. Most of the time, a filesystem only needs ->d_lock, which |
| 348 | protects *all* the dcache state of a given dentry. |
Nick Piggin | fa0d7e3d | 2011-01-07 17:49:49 +1100 | [diff] [blame] | 349 | |
| 350 | -- |
| 351 | [mandatory] |
| 352 | |
| 353 | Filesystems must RCU-free their inodes, if they can have been accessed |
| 354 | via rcu-walk path walk (basically, if the file can have had a path name in the |
| 355 | vfs namespace). |
| 356 | |
| 357 | i_dentry and i_rcu share storage in a union, and the vfs expects |
| 358 | i_dentry to be reinitialized before it is freed, so an: |
| 359 | |
| 360 | INIT_LIST_HEAD(&inode->i_dentry); |
| 361 | |
| 362 | must be done in the RCU callback. |
Nick Piggin | 34286d6 | 2011-01-07 17:49:57 +1100 | [diff] [blame] | 363 | |
| 364 | -- |
| 365 | [recommended] |
| 366 | vfs now tries to do path walking in "rcu-walk mode", which avoids |
| 367 | atomic operations and scalability hazards on dentries and inodes (see |
| 368 | Documentation/filesystems/path-walk.txt). d_hash and d_compare changes (above) |
| 369 | are examples of the changes required to support this. For more complex |
| 370 | filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so |
| 371 | no changes are required to the filesystem. However, this is costly and loses |
| 372 | the benefits of rcu-walk mode. We will begin to add filesystem callbacks that |
| 373 | are rcu-walk aware, shown below. Filesystems should take advantage of this |
| 374 | where possible. |
| 375 | |
| 376 | -- |
| 377 | [mandatory] |
| 378 | d_revalidate is a callback that is made on every path element (if |
| 379 | the filesystem provides it), which requires dropping out of rcu-walk mode. This |
| 380 | may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be |
| 381 | returned if the filesystem cannot handle rcu-walk. See |
| 382 | Documentation/filesystems/vfs.txt for more details. |
Nick Piggin | b74c79e | 2011-01-07 17:49:58 +1100 | [diff] [blame^] | 383 | |
| 384 | permission and check_acl are inode permission checks that are called |
| 385 | on many or all directory inodes on the way down a path walk (to check for |
| 386 | exec permission). These must now be rcu-walk aware (flags & IPERM_RCU). See |
| 387 | Documentation/filesystems/vfs.txt for more details. |