Pekka Enberg | cbf8f0f | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 1 | RCU-based dcache locking model |
| 2 | ============================== |
| 3 | |
| 4 | On many workloads, the most common operation on dcache is to look up a |
| 5 | dentry, given a parent dentry and the name of the child. Typically, |
| 6 | for every open(), stat() etc., the dentry corresponding to the |
| 7 | pathname will be looked up by walking the tree starting with the first |
| 8 | component of the pathname and using that dentry along with the next |
| 9 | component to look up the next level and so on. Since it is a frequent |
| 10 | operation for workloads like multiuser environments and web servers, |
| 11 | it is important to optimize this path. |
| 12 | |
| 13 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in |
| 14 | every component during path look-up. Since 2.5.10 onwards, fast-walk |
| 15 | algorithm changed this by holding the dcache_lock at the beginning and |
| 16 | walking as many cached path component dentries as possible. This |
| 17 | significantly decreases the number of acquisition of |
| 18 | dcache_lock. However it also increases the lock hold time |
| 19 | significantly and affects performance in large SMP machines. Since |
| 20 | 2.5.62 kernel, dcache has been using a new locking model that uses RCU |
| 21 | to make dcache look-up lock-free. |
| 22 | |
| 23 | The current dcache locking model is not very different from the |
| 24 | existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock |
| 25 | protected the hash chain, d_child, d_alias, d_lru lists as well as |
| 26 | d_inode and several other things like mount look-up. RCU-based changes |
| 27 | affect only the way the hash chain is protected. For everything else |
| 28 | the dcache_lock must be taken for both traversing as well as |
| 29 | updating. The hash chain updates too take the dcache_lock. The |
| 30 | significant change is the way d_lookup traverses the hash chain, it |
| 31 | doesn't acquire the dcache_lock for this and rely on RCU to ensure |
| 32 | that the dentry has not been *freed*. |
| 33 | |
| 34 | |
| 35 | Dcache locking details |
| 36 | ====================== |
| 37 | |
| 38 | For many multi-user workloads, open() and stat() on files are very |
| 39 | frequently occurring operations. Both involve walking of path names to |
| 40 | find the dentry corresponding to the concerned file. In 2.4 kernel, |
| 41 | dcache_lock was held during look-up of each path component. Contention |
| 42 | and cache-line bouncing of this global lock caused significant |
| 43 | scalability problems. With the introduction of RCU in Linux kernel, |
| 44 | this was worked around by making the look-up of path components during |
| 45 | path walking lock-free. |
| 46 | |
| 47 | |
| 48 | Safe lock-free look-up of dcache hash table |
| 49 | =========================================== |
| 50 | |
| 51 | Dcache is a complex data structure with the hash table entries also |
| 52 | linked together in other lists. In 2.4 kernel, dcache_lock protected |
| 53 | all the lists. We applied RCU only on hash chain walking. The rest of |
| 54 | the lists are still protected by dcache_lock. Some of the important |
| 55 | changes are : |
| 56 | |
| 57 | 1. The deletion from hash chain is done using hlist_del_rcu() macro |
| 58 | which doesn't initialize next pointer of the deleted dentry and |
| 59 | this allows us to walk safely lock-free while a deletion is |
| 60 | happening. |
| 61 | |
| 62 | 2. Insertion of a dentry into the hash table is done using |
| 63 | hlist_add_head_rcu() which take care of ordering the writes - the |
| 64 | writes to the dentry must be visible before the dentry is |
| 65 | inserted. This works in conjunction with hlist_for_each_rcu() while |
| 66 | walking the hash chain. The only requirement is that all |
| 67 | initialization to the dentry must be done before |
| 68 | hlist_add_head_rcu() since we don't have dcache_lock protection |
| 69 | while traversing the hash chain. This isn't different from the |
| 70 | existing code. |
| 71 | |
| 72 | 3. The dentry looked up without holding dcache_lock by cannot be |
| 73 | returned for walking if it is unhashed. It then may have a NULL |
| 74 | d_inode or other bogosity since RCU doesn't protect the other |
| 75 | fields in the dentry. We therefore use a flag DCACHE_UNHASHED to |
| 76 | indicate unhashed dentries and use this in conjunction with a |
| 77 | per-dentry lock (d_lock). Once looked up without the dcache_lock, |
| 78 | we acquire the per-dentry lock (d_lock) and check if the dentry is |
| 79 | unhashed. If so, the look-up is failed. If not, the reference count |
| 80 | of the dentry is increased and the dentry is returned. |
| 81 | |
| 82 | 4. Once a dentry is looked up, it must be ensured during the path walk |
| 83 | for that component it doesn't go away. In pre-2.5.10 code, this was |
| 84 | done holding a reference to the dentry. dcache_rcu does the same. |
| 85 | In some sense, dcache_rcu path walking looks like the pre-2.5.10 |
| 86 | version. |
| 87 | |
| 88 | 5. All dentry hash chain updates must take the dcache_lock as well as |
| 89 | the per-dentry lock in that order. dput() does this to ensure that |
| 90 | a dentry that has just been looked up in another CPU doesn't get |
| 91 | deleted before dget() can be done on it. |
| 92 | |
| 93 | 6. There are several ways to do reference counting of RCU protected |
| 94 | objects. One such example is in ipv4 route cache where deferred |
| 95 | freeing (using call_rcu()) is done as soon as the reference count |
| 96 | goes to zero. This cannot be done in the case of dentries because |
| 97 | tearing down of dentries require blocking (dentry_iput()) which |
| 98 | isn't supported from RCU callbacks. Instead, tearing down of |
| 99 | dentries happen synchronously in dput(), but actual freeing happens |
| 100 | later when RCU grace period is over. This allows safe lock-free |
| 101 | walking of the hash chains, but a matched dentry may have been |
| 102 | partially torn down. The checking of DCACHE_UNHASHED flag with |
| 103 | d_lock held detects such dentries and prevents them from being |
| 104 | returned from look-up. |
| 105 | |
| 106 | |
| 107 | Maintaining POSIX rename semantics |
| 108 | ================================== |
| 109 | |
| 110 | Since look-up of dentries is lock-free, it can race against a |
| 111 | concurrent rename operation. For example, during rename of file A to |
| 112 | B, look-up of either A or B must succeed. So, if look-up of B happens |
| 113 | after A has been removed from the hash chain but not added to the new |
| 114 | hash chain, it may fail. Also, a comparison while the name is being |
| 115 | written concurrently by a rename may result in false positive matches |
| 116 | violating rename semantics. Issues related to race with rename are |
| 117 | handled as described below : |
| 118 | |
| 119 | 1. Look-up can be done in two ways - d_lookup() which is safe from |
| 120 | simultaneous renames and __d_lookup() which is not. If |
| 121 | __d_lookup() fails, it must be followed up by a d_lookup() to |
| 122 | correctly determine whether a dentry is in the hash table or |
| 123 | not. d_lookup() protects look-ups using a sequence lock |
| 124 | (rename_lock). |
| 125 | |
| 126 | 2. The name associated with a dentry (d_name) may be changed if a |
| 127 | rename is allowed to happen simultaneously. To avoid memcmp() in |
| 128 | __d_lookup() go out of bounds due to a rename and false positive |
| 129 | comparison, the name comparison is done while holding the |
| 130 | per-dentry lock. This prevents concurrent renames during this |
| 131 | operation. |
| 132 | |
| 133 | 3. Hash table walking during look-up may move to a different bucket as |
| 134 | the current dentry is moved to a different bucket due to rename. |
| 135 | But we use hlists in dcache hash table and they are |
| 136 | null-terminated. So, even if a dentry moves to a different bucket, |
| 137 | hash chain walk will terminate. [with a list_head list, it may not |
| 138 | since termination is when the list_head in the original bucket is |
| 139 | reached]. Since we redo the d_parent check and compare name while |
| 140 | holding d_lock, lock-free look-up will not race against d_move(). |
| 141 | |
| 142 | 4. There can be a theoretical race when a dentry keeps coming back to |
| 143 | original bucket due to double moves. Due to this look-up may |
| 144 | consider that it has never moved and can end up in a infinite loop. |
| 145 | But this is not any worse that theoretical livelocks we already |
| 146 | have in the kernel. |
| 147 | |
| 148 | |
| 149 | Important guidelines for filesystem developers related to dcache_rcu |
| 150 | ==================================================================== |
| 151 | |
| 152 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem |
| 153 | don't change. Only dcache internal implementation changes. However |
| 154 | filesystems *must not* delete from the dentry hash chains directly |
| 155 | using the list macros like allowed earlier. They must use dcache |
| 156 | APIs like d_drop() or __d_drop() depending on the situation. |
| 157 | |
| 158 | 2. d_flags is now protected by a per-dentry lock (d_lock). All access |
| 159 | to d_flags must be protected by it. |
| 160 | |
| 161 | 3. For a hashed dentry, checking of d_count needs to be protected by |
| 162 | d_lock. |
| 163 | |
| 164 | |
| 165 | Papers and other documentation on dcache locking |
| 166 | ================================================ |
| 167 | |
| 168 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). |
| 169 | |
| 170 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html |
| 171 | |
| 172 | |
| 173 | |