Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 1 | Introduction |
| 2 | ============ |
| 3 | |
| 4 | dm-cache is a device mapper target written by Joe Thornber, Heinz |
| 5 | Mauelshagen, and Mike Snitzer. |
| 6 | |
| 7 | It aims to improve performance of a block device (eg, a spindle) by |
| 8 | dynamically migrating some of its data to a faster, smaller device |
| 9 | (eg, an SSD). |
| 10 | |
| 11 | This device-mapper solution allows us to insert this caching at |
| 12 | different levels of the dm stack, for instance above the data device for |
| 13 | a thin-provisioning pool. Caching solutions that are integrated more |
| 14 | closely with the virtual memory system should give better performance. |
| 15 | |
| 16 | The target reuses the metadata library used in the thin-provisioning |
| 17 | library. |
| 18 | |
| 19 | The decision as to what data to migrate and when is left to a plug-in |
| 20 | policy module. Several of these have been written as we experiment, |
| 21 | and we hope other people will contribute others for specific io |
| 22 | scenarios (eg. a vm image server). |
| 23 | |
| 24 | Glossary |
| 25 | ======== |
| 26 | |
| 27 | Migration - Movement of the primary copy of a logical block from one |
| 28 | device to the other. |
| 29 | Promotion - Migration from slow device to fast device. |
| 30 | Demotion - Migration from fast device to slow device. |
| 31 | |
| 32 | The origin device always contains a copy of the logical block, which |
| 33 | may be out of date or kept in sync with the copy on the cache device |
| 34 | (depending on policy). |
| 35 | |
| 36 | Design |
| 37 | ====== |
| 38 | |
| 39 | Sub-devices |
| 40 | ----------- |
| 41 | |
| 42 | The target is constructed by passing three devices to it (along with |
| 43 | other parameters detailed later): |
| 44 | |
| 45 | 1. An origin device - the big, slow one. |
| 46 | |
| 47 | 2. A cache device - the small, fast one. |
| 48 | |
| 49 | 3. A small metadata device - records which blocks are in the cache, |
| 50 | which are dirty, and extra hints for use by the policy object. |
| 51 | This information could be put on the cache device, but having it |
| 52 | separate allows the volume manager to configure it differently, |
Mike Snitzer | 66bb264 | 2013-08-16 10:54:20 -0400 | [diff] [blame] | 53 | e.g. as a mirror for extra robustness. This metadata device may only |
| 54 | be used by a single cache device. |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 55 | |
| 56 | Fixed block size |
| 57 | ---------------- |
| 58 | |
| 59 | The origin is divided up into blocks of a fixed size. This block size |
| 60 | is configurable when you first create the cache. Typically we've been |
Mike Snitzer | 0547304 | 2013-08-16 10:54:19 -0400 | [diff] [blame] | 61 | using block sizes of 256KB - 1024KB. The block size must be between 64 |
| 62 | (32KB) and 2097152 (1GB) and a multiple of 64 (32KB). |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 63 | |
| 64 | Having a fixed block size simplifies the target a lot. But it is |
| 65 | something of a compromise. For instance, a small part of a block may be |
| 66 | getting hit a lot, yet the whole block will be promoted to the cache. |
| 67 | So large block sizes are bad because they waste cache space. And small |
| 68 | block sizes are bad because they increase the amount of metadata (both |
| 69 | in core and on disk). |
| 70 | |
Joe Thornber | 2ee57d5 | 2013-10-24 14:10:29 -0400 | [diff] [blame] | 71 | Cache operating modes |
| 72 | --------------------- |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 73 | |
Joe Thornber | 2ee57d5 | 2013-10-24 14:10:29 -0400 | [diff] [blame] | 74 | The cache has three operating modes: writeback, writethrough and |
| 75 | passthrough. |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 76 | |
| 77 | If writeback, the default, is selected then a write to a block that is |
| 78 | cached will go only to the cache and the block will be marked dirty in |
| 79 | the metadata. |
| 80 | |
| 81 | If writethrough is selected then a write to a cached block will not |
| 82 | complete until it has hit both the origin and cache devices. Clean |
| 83 | blocks should remain clean. |
| 84 | |
Joe Thornber | 2ee57d5 | 2013-10-24 14:10:29 -0400 | [diff] [blame] | 85 | If passthrough is selected, useful when the cache contents are not known |
| 86 | to be coherent with the origin device, then all reads are served from |
| 87 | the origin device (all reads miss the cache) and all writes are |
| 88 | forwarded to the origin device; additionally, write hits cause cache |
Mike Snitzer | 7b6b2bc | 2013-11-12 12:17:43 -0500 | [diff] [blame] | 89 | block invalidates. To enable passthrough mode the cache must be clean. |
| 90 | Passthrough mode allows a cache device to be activated without having to |
| 91 | worry about coherency. Coherency that exists is maintained, although |
| 92 | the cache will gradually cool as writes take place. If the coherency of |
| 93 | the cache can later be verified, or established through use of the |
| 94 | "invalidate_cblocks" message, the cache device can be transitioned to |
| 95 | writethrough or writeback mode while still warm. Otherwise, the cache |
| 96 | contents can be discarded prior to transitioning to the desired |
| 97 | operating mode. |
Joe Thornber | 2ee57d5 | 2013-10-24 14:10:29 -0400 | [diff] [blame] | 98 | |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 99 | A simple cleaner policy is provided, which will clean (write back) all |
Mike Snitzer | 7b6b2bc | 2013-11-12 12:17:43 -0500 | [diff] [blame] | 100 | dirty blocks in a cache. Useful for decommissioning a cache or when |
| 101 | shrinking a cache. Shrinking the cache's fast device requires all cache |
| 102 | blocks, in the area of the cache being removed, to be clean. If the |
| 103 | area being removed from the cache still contains dirty blocks the resize |
| 104 | will fail. Care must be taken to never reduce the volume used for the |
| 105 | cache's fast device until the cache is clean. This is of particular |
| 106 | importance if writeback mode is used. Writethrough and passthrough |
| 107 | modes already maintain a clean cache. Future support to partially clean |
| 108 | the cache, above a specified threshold, will allow for keeping the cache |
| 109 | warm and in writeback mode during resize. |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 110 | |
| 111 | Migration throttling |
| 112 | -------------------- |
| 113 | |
| 114 | Migrating data between the origin and cache device uses bandwidth. |
| 115 | The user can set a throttle to prevent more than a certain amount of |
Anatol Pomozov | f884ab1 | 2013-05-08 16:56:16 -0700 | [diff] [blame] | 116 | migration occurring at any one time. Currently we're not taking any |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 117 | account of normal io traffic going to the devices. More work needs |
| 118 | doing here to avoid migrating during those peak io moments. |
| 119 | |
| 120 | For the time being, a message "migration_threshold <#sectors>" |
| 121 | can be used to set the maximum number of sectors being migrated, |
| 122 | the default being 204800 sectors (or 100MB). |
| 123 | |
| 124 | Updating on-disk metadata |
| 125 | ------------------------- |
| 126 | |
Mike Snitzer | 07f2b6e | 2014-02-14 11:58:41 -0500 | [diff] [blame] | 127 | On-disk metadata is committed every time a FLUSH or FUA bio is written. |
| 128 | If no such requests are made then commits will occur every second. This |
| 129 | means the cache behaves like a physical disk that has a volatile write |
| 130 | cache. If power is lost you may lose some recent writes. The metadata |
| 131 | should always be consistent in spite of any crash. |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 132 | |
| 133 | The 'dirty' state for a cache block changes far too frequently for us |
| 134 | to keep updating it on the fly. So we treat it as a hint. In normal |
| 135 | operation it will be written when the dm device is suspended. If the |
| 136 | system crashes all cache blocks will be assumed dirty when restarted. |
| 137 | |
| 138 | Per-block policy hints |
| 139 | ---------------------- |
| 140 | |
| 141 | Policy plug-ins can store a chunk of data per cache block. It's up to |
| 142 | the policy how big this chunk is, but it should be kept small. Like the |
| 143 | dirty flags this data is lost if there's a crash so a safe fallback |
| 144 | value should always be possible. |
| 145 | |
| 146 | For instance, the 'mq' policy, which is currently the default policy, |
| 147 | uses this facility to store the hit count of the cache blocks. If |
| 148 | there's a crash this information will be lost, which means the cache |
| 149 | may be less efficient until those hit counts are regenerated. |
| 150 | |
| 151 | Policy hints affect performance, not correctness. |
| 152 | |
| 153 | Policy messaging |
| 154 | ---------------- |
| 155 | |
| 156 | Policies will have different tunables, specific to each one, so we |
| 157 | need a generic way of getting and setting these. Device-mapper |
| 158 | messages are used. Refer to cache-policies.txt. |
| 159 | |
| 160 | Discard bitset resolution |
| 161 | ------------------------- |
| 162 | |
| 163 | We can avoid copying data during migration if we know the block has |
| 164 | been discarded. A prime example of this is when mkfs discards the |
| 165 | whole block device. We store a bitset tracking the discard state of |
| 166 | blocks. However, we allow this bitset to have a different block size |
| 167 | from the cache blocks. This is because we need to track the discard |
| 168 | state for all of the origin device (compare with the dirty bitset |
| 169 | which is just for the smaller cache device). |
| 170 | |
| 171 | Target interface |
| 172 | ================ |
| 173 | |
| 174 | Constructor |
| 175 | ----------- |
| 176 | |
| 177 | cache <metadata dev> <cache dev> <origin dev> <block size> |
| 178 | <#feature args> [<feature arg>]* |
| 179 | <policy> <#policy args> [policy args]* |
| 180 | |
| 181 | metadata dev : fast device holding the persistent metadata |
| 182 | cache dev : fast device holding cached data blocks |
| 183 | origin dev : slow device holding original data blocks |
| 184 | block size : cache unit size in sectors |
| 185 | |
| 186 | #feature args : number of feature arguments passed |
Mike Snitzer | 7b6b2bc | 2013-11-12 12:17:43 -0500 | [diff] [blame] | 187 | feature args : writethrough or passthrough (The default is writeback.) |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 188 | |
| 189 | policy : the replacement policy to use |
| 190 | #policy args : an even number of arguments corresponding to |
| 191 | key/value pairs passed to the policy |
| 192 | policy args : key/value pairs passed to the policy |
| 193 | E.g. 'sequential_threshold 1024' |
| 194 | See cache-policies.txt for details. |
| 195 | |
| 196 | Optional feature arguments are: |
| 197 | writethrough : write through caching that prohibits cache block |
| 198 | content from being different from origin block content. |
| 199 | Without this argument, the default behaviour is to write |
| 200 | back cache block contents later for performance reasons, |
| 201 | so they may differ from the corresponding origin blocks. |
| 202 | |
Mike Snitzer | 7b6b2bc | 2013-11-12 12:17:43 -0500 | [diff] [blame] | 203 | passthrough : a degraded mode useful for various cache coherency |
| 204 | situations (e.g., rolling back snapshots of |
| 205 | underlying storage). Reads and writes always go to |
| 206 | the origin. If a write goes to a cached origin |
| 207 | block, then the cache block is invalidated. |
| 208 | To enable passthrough mode the cache must be clean. |
| 209 | |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 210 | A policy called 'default' is always registered. This is an alias for |
| 211 | the policy we currently think is giving best all round performance. |
| 212 | |
| 213 | As the default policy could vary between kernels, if you are relying on |
| 214 | the characteristics of a specific policy, always request it by name. |
| 215 | |
| 216 | Status |
| 217 | ------ |
| 218 | |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 219 | <metadata block size> <#used metadata blocks>/<#total metadata blocks> |
| 220 | <cache block size> <#used cache blocks>/<#total cache blocks> |
| 221 | <#read hits> <#read misses> <#write hits> <#write misses> |
| 222 | <#demotions> <#promotions> <#dirty> <#features> <features>* |
Mike Snitzer | 2e68c4e | 2014-01-15 21:06:55 -0500 | [diff] [blame] | 223 | <#core args> <core args>* <policy name> <#policy args> <policy args>* |
Joe Thornber | 028ae9f | 2015-04-22 16:42:35 -0400 | [diff] [blame] | 224 | <cache metadata mode> |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 225 | |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 226 | metadata block size : Fixed block size for each metadata block in |
| 227 | sectors |
| 228 | #used metadata blocks : Number of metadata blocks used |
| 229 | #total metadata blocks : Total number of metadata blocks |
| 230 | cache block size : Configurable block size for the cache device |
| 231 | in sectors |
| 232 | #used cache blocks : Number of blocks resident in the cache |
| 233 | #total cache blocks : Total number of cache blocks |
| 234 | #read hits : Number of times a READ bio has been mapped |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 235 | to the cache |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 236 | #read misses : Number of times a READ bio has been mapped |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 237 | to the origin |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 238 | #write hits : Number of times a WRITE bio has been mapped |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 239 | to the cache |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 240 | #write misses : Number of times a WRITE bio has been |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 241 | mapped to the origin |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 242 | #demotions : Number of times a block has been removed |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 243 | from the cache |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 244 | #promotions : Number of times a block has been moved to |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 245 | the cache |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 246 | #dirty : Number of blocks in the cache that differ |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 247 | from the origin |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 248 | #feature args : Number of feature args to follow |
| 249 | feature args : 'writethrough' (optional) |
| 250 | #core args : Number of core arguments (must be even) |
| 251 | core args : Key/value pairs for tuning the core |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 252 | e.g. migration_threshold |
Mike Snitzer | 2e68c4e | 2014-01-15 21:06:55 -0500 | [diff] [blame] | 253 | policy name : Name of the policy |
Mike Snitzer | 6a38861 | 2014-01-09 16:04:12 -0500 | [diff] [blame] | 254 | #policy args : Number of policy arguments to follow (must be even) |
Joe Thornber | 028ae9f | 2015-04-22 16:42:35 -0400 | [diff] [blame] | 255 | policy args : Key/value pairs e.g. sequential_threshold |
| 256 | cache metadata mode : ro if read-only, rw if read-write |
| 257 | In serious cases where even a read-only mode is deemed unsafe |
| 258 | no further I/O will be permitted and the status will just |
| 259 | contain the string 'Fail'. The userspace recovery tools |
| 260 | should then be used. |
Mike Snitzer | 255eac2 | 2015-07-15 11:42:59 -0400 | [diff] [blame] | 261 | needs_check : 'needs_check' if set, '-' if not set |
| 262 | A metadata operation has failed, resulting in the needs_check |
| 263 | flag being set in the metadata's superblock. The metadata |
| 264 | device must be deactivated and checked/repaired before the |
| 265 | cache can be made fully operational again. '-' indicates |
| 266 | needs_check is not set. |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 267 | |
| 268 | Messages |
| 269 | -------- |
| 270 | |
| 271 | Policies will have different tunables, specific to each one, so we |
| 272 | need a generic way of getting and setting these. Device-mapper |
| 273 | messages are used. (A sysfs interface would also be possible.) |
| 274 | |
| 275 | The message format is: |
| 276 | |
| 277 | <key> <value> |
| 278 | |
| 279 | E.g. |
| 280 | dmsetup message my_cache 0 sequential_threshold 1024 |
| 281 | |
Joe Thornber | 65790ff | 2013-11-08 16:39:50 +0000 | [diff] [blame] | 282 | |
| 283 | Invalidation is removing an entry from the cache without writing it |
| 284 | back. Cache blocks can be invalidated via the invalidate_cblocks |
Mike Snitzer | 7b6b2bc | 2013-11-12 12:17:43 -0500 | [diff] [blame] | 285 | message, which takes an arbitrary number of cblock ranges. Each cblock |
Mike Snitzer | 83f539e | 2013-11-26 11:03:54 -0500 | [diff] [blame] | 286 | range's end value is "one past the end", meaning 5-10 expresses a range |
| 287 | of values from 5 to 9. Each cblock must be expressed as a decimal |
| 288 | value, in the future a variant message that takes cblock ranges |
| 289 | expressed in hexidecimal may be needed to better support efficient |
| 290 | invalidation of larger caches. The cache must be in passthrough mode |
| 291 | when invalidate_cblocks is used. |
Joe Thornber | 65790ff | 2013-11-08 16:39:50 +0000 | [diff] [blame] | 292 | |
| 293 | invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* |
| 294 | |
| 295 | E.g. |
| 296 | dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 |
| 297 | |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 298 | Examples |
| 299 | ======== |
| 300 | |
| 301 | The test suite can be found here: |
| 302 | |
Joe Thornber | 65790ff | 2013-11-08 16:39:50 +0000 | [diff] [blame] | 303 | https://github.com/jthornber/device-mapper-test-suite |
Joe Thornber | c6b4fcb | 2013-03-01 22:45:51 +0000 | [diff] [blame] | 304 | |
| 305 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ |
| 306 | /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' |
| 307 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ |
| 308 | /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ |
| 309 | mq 4 sequential_threshold 1024 random_threshold 8' |