David Howells | 9ae326a | 2009-04-03 16:42:41 +0100 | [diff] [blame] | 1 | =============================================== |
| 2 | CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM |
| 3 | =============================================== |
| 4 | |
| 5 | Contents: |
| 6 | |
| 7 | (*) Overview. |
| 8 | |
| 9 | (*) Requirements. |
| 10 | |
| 11 | (*) Configuration. |
| 12 | |
| 13 | (*) Starting the cache. |
| 14 | |
| 15 | (*) Things to avoid. |
| 16 | |
| 17 | (*) Cache culling. |
| 18 | |
| 19 | (*) Cache structure. |
| 20 | |
| 21 | (*) Security model and SELinux. |
| 22 | |
| 23 | (*) A note on security. |
| 24 | |
| 25 | (*) Statistical information. |
| 26 | |
| 27 | (*) Debugging. |
| 28 | |
| 29 | |
| 30 | ======== |
| 31 | OVERVIEW |
| 32 | ======== |
| 33 | |
| 34 | CacheFiles is a caching backend that's meant to use as a cache a directory on |
| 35 | an already mounted filesystem of a local type (such as Ext3). |
| 36 | |
| 37 | CacheFiles uses a userspace daemon to do some of the cache management - such as |
| 38 | reaping stale nodes and culling. This is called cachefilesd and lives in |
| 39 | /sbin. |
| 40 | |
| 41 | The filesystem and data integrity of the cache are only as good as those of the |
| 42 | filesystem providing the backing services. Note that CacheFiles does not |
| 43 | attempt to journal anything since the journalling interfaces of the various |
| 44 | filesystems are very specific in nature. |
| 45 | |
| 46 | CacheFiles creates a misc character device - "/dev/cachefiles" - that is used |
| 47 | to communication with the daemon. Only one thing may have this open at once, |
| 48 | and whilst it is open, a cache is at least partially in existence. The daemon |
| 49 | opens this and sends commands down it to control the cache. |
| 50 | |
| 51 | CacheFiles is currently limited to a single cache. |
| 52 | |
| 53 | CacheFiles attempts to maintain at least a certain percentage of free space on |
| 54 | the filesystem, shrinking the cache by culling the objects it contains to make |
| 55 | space if necessary - see the "Cache Culling" section. This means it can be |
| 56 | placed on the same medium as a live set of data, and will expand to make use of |
| 57 | spare space and automatically contract when the set of data requires more |
| 58 | space. |
| 59 | |
| 60 | |
| 61 | ============ |
| 62 | REQUIREMENTS |
| 63 | ============ |
| 64 | |
| 65 | The use of CacheFiles and its daemon requires the following features to be |
| 66 | available in the system and in the cache filesystem: |
| 67 | |
| 68 | - dnotify. |
| 69 | |
| 70 | - extended attributes (xattrs). |
| 71 | |
| 72 | - openat() and friends. |
| 73 | |
| 74 | - bmap() support on files in the filesystem (FIBMAP ioctl). |
| 75 | |
| 76 | - The use of bmap() to detect a partial page at the end of the file. |
| 77 | |
| 78 | It is strongly recommended that the "dir_index" option is enabled on Ext3 |
| 79 | filesystems being used as a cache. |
| 80 | |
| 81 | |
| 82 | ============= |
| 83 | CONFIGURATION |
| 84 | ============= |
| 85 | |
| 86 | The cache is configured by a script in /etc/cachefilesd.conf. These commands |
| 87 | set up cache ready for use. The following script commands are available: |
| 88 | |
| 89 | (*) brun <N>% |
| 90 | (*) bcull <N>% |
| 91 | (*) bstop <N>% |
| 92 | (*) frun <N>% |
| 93 | (*) fcull <N>% |
| 94 | (*) fstop <N>% |
| 95 | |
| 96 | Configure the culling limits. Optional. See the section on culling |
| 97 | The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. |
| 98 | |
| 99 | The commands beginning with a 'b' are file space (block) limits, those |
| 100 | beginning with an 'f' are file count limits. |
| 101 | |
| 102 | (*) dir <path> |
| 103 | |
| 104 | Specify the directory containing the root of the cache. Mandatory. |
| 105 | |
| 106 | (*) tag <name> |
| 107 | |
| 108 | Specify a tag to FS-Cache to use in distinguishing multiple caches. |
| 109 | Optional. The default is "CacheFiles". |
| 110 | |
| 111 | (*) debug <mask> |
| 112 | |
| 113 | Specify a numeric bitmask to control debugging in the kernel module. |
| 114 | Optional. The default is zero (all off). The following values can be |
| 115 | OR'd into the mask to collect various information: |
| 116 | |
| 117 | 1 Turn on trace of function entry (_enter() macros) |
| 118 | 2 Turn on trace of function exit (_leave() macros) |
| 119 | 4 Turn on trace of internal debug points (_debug()) |
| 120 | |
| 121 | This mask can also be set through sysfs, eg: |
| 122 | |
| 123 | echo 5 >/sys/modules/cachefiles/parameters/debug |
| 124 | |
| 125 | |
| 126 | ================== |
| 127 | STARTING THE CACHE |
| 128 | ================== |
| 129 | |
| 130 | The cache is started by running the daemon. The daemon opens the cache device, |
| 131 | configures the cache and tells it to begin caching. At that point the cache |
| 132 | binds to fscache and the cache becomes live. |
| 133 | |
| 134 | The daemon is run as follows: |
| 135 | |
| 136 | /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] |
| 137 | |
| 138 | The flags are: |
| 139 | |
| 140 | (*) -d |
| 141 | |
| 142 | Increase the debugging level. This can be specified multiple times and |
| 143 | is cumulative with itself. |
| 144 | |
| 145 | (*) -s |
| 146 | |
| 147 | Send messages to stderr instead of syslog. |
| 148 | |
| 149 | (*) -n |
| 150 | |
| 151 | Don't daemonise and go into background. |
| 152 | |
| 153 | (*) -f <configfile> |
| 154 | |
| 155 | Use an alternative configuration file rather than the default one. |
| 156 | |
| 157 | |
| 158 | =============== |
| 159 | THINGS TO AVOID |
| 160 | =============== |
| 161 | |
| 162 | Do not mount other things within the cache as this will cause problems. The |
| 163 | kernel module contains its own very cut-down path walking facility that ignores |
| 164 | mountpoints, but the daemon can't avoid them. |
| 165 | |
| 166 | Do not create, rename or unlink files and directories in the cache whilst the |
| 167 | cache is active, as this may cause the state to become uncertain. |
| 168 | |
| 169 | Renaming files in the cache might make objects appear to be other objects (the |
| 170 | filename is part of the lookup key). |
| 171 | |
| 172 | Do not change or remove the extended attributes attached to cache files by the |
| 173 | cache as this will cause the cache state management to get confused. |
| 174 | |
| 175 | Do not create files or directories in the cache, lest the cache get confused or |
| 176 | serve incorrect data. |
| 177 | |
| 178 | Do not chmod files in the cache. The module creates things with minimal |
| 179 | permissions to prevent random users being able to access them directly. |
| 180 | |
| 181 | |
| 182 | ============= |
| 183 | CACHE CULLING |
| 184 | ============= |
| 185 | |
| 186 | The cache may need culling occasionally to make space. This involves |
| 187 | discarding objects from the cache that have been used less recently than |
| 188 | anything else. Culling is based on the access time of data objects. Empty |
| 189 | directories are culled if not in use. |
| 190 | |
| 191 | Cache culling is done on the basis of the percentage of blocks and the |
| 192 | percentage of files available in the underlying filesystem. There are six |
| 193 | "limits": |
| 194 | |
| 195 | (*) brun |
| 196 | (*) frun |
| 197 | |
| 198 | If the amount of free space and the number of available files in the cache |
| 199 | rises above both these limits, then culling is turned off. |
| 200 | |
| 201 | (*) bcull |
| 202 | (*) fcull |
| 203 | |
| 204 | If the amount of available space or the number of available files in the |
| 205 | cache falls below either of these limits, then culling is started. |
| 206 | |
| 207 | (*) bstop |
| 208 | (*) fstop |
| 209 | |
| 210 | If the amount of available space or the number of available files in the |
| 211 | cache falls below either of these limits, then no further allocation of |
| 212 | disk space or files is permitted until culling has raised things above |
| 213 | these limits again. |
| 214 | |
| 215 | These must be configured thusly: |
| 216 | |
| 217 | 0 <= bstop < bcull < brun < 100 |
| 218 | 0 <= fstop < fcull < frun < 100 |
| 219 | |
| 220 | Note that these are percentages of available space and available files, and do |
| 221 | _not_ appear as 100 minus the percentage displayed by the "df" program. |
| 222 | |
| 223 | The userspace daemon scans the cache to build up a table of cullable objects. |
| 224 | These are then culled in least recently used order. A new scan of the cache is |
| 225 | started as soon as space is made in the table. Objects will be skipped if |
| 226 | their atimes have changed or if the kernel module says it is still using them. |
| 227 | |
| 228 | |
| 229 | =============== |
| 230 | CACHE STRUCTURE |
| 231 | =============== |
| 232 | |
| 233 | The CacheFiles module will create two directories in the directory it was |
| 234 | given: |
| 235 | |
| 236 | (*) cache/ |
| 237 | |
| 238 | (*) graveyard/ |
| 239 | |
| 240 | The active cache objects all reside in the first directory. The CacheFiles |
| 241 | kernel module moves any retired or culled objects that it can't simply unlink |
| 242 | to the graveyard from which the daemon will actually delete them. |
| 243 | |
| 244 | The daemon uses dnotify to monitor the graveyard directory, and will delete |
| 245 | anything that appears therein. |
| 246 | |
| 247 | |
| 248 | The module represents index objects as directories with the filename "I..." or |
| 249 | "J...". Note that the "cache/" directory is itself a special index. |
| 250 | |
| 251 | Data objects are represented as files if they have no children, or directories |
| 252 | if they do. Their filenames all begin "D..." or "E...". If represented as a |
| 253 | directory, data objects will have a file in the directory called "data" that |
| 254 | actually holds the data. |
| 255 | |
| 256 | Special objects are similar to data objects, except their filenames begin |
| 257 | "S..." or "T...". |
| 258 | |
| 259 | |
| 260 | If an object has children, then it will be represented as a directory. |
| 261 | Immediately in the representative directory are a collection of directories |
| 262 | named for hash values of the child object keys with an '@' prepended. Into |
| 263 | this directory, if possible, will be placed the representations of the child |
| 264 | objects: |
| 265 | |
| 266 | INDEX INDEX INDEX DATA FILES |
| 267 | ========= ========== ================================= ================ |
| 268 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 |
| 269 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry |
| 270 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry |
| 271 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry |
| 272 | |
| 273 | |
| 274 | If the key is so long that it exceeds NAME_MAX with the decorations added on to |
| 275 | it, then it will be cut into pieces, the first few of which will be used to |
| 276 | make a nest of directories, and the last one of which will be the objects |
| 277 | inside the last directory. The names of the intermediate directories will have |
| 278 | '+' prepended: |
| 279 | |
| 280 | J1223/@23/+xy...z/+kl...m/Epqr |
| 281 | |
| 282 | |
| 283 | Note that keys are raw data, and not only may they exceed NAME_MAX in size, |
| 284 | they may also contain things like '/' and NUL characters, and so they may not |
| 285 | be suitable for turning directly into a filename. |
| 286 | |
| 287 | To handle this, CacheFiles will use a suitably printable filename directly and |
| 288 | "base-64" encode ones that aren't directly suitable. The two versions of |
| 289 | object filenames indicate the encoding: |
| 290 | |
| 291 | OBJECT TYPE PRINTABLE ENCODED |
| 292 | =============== =============== =============== |
| 293 | Index "I..." "J..." |
| 294 | Data "D..." "E..." |
| 295 | Special "S..." "T..." |
| 296 | |
| 297 | Intermediate directories are always "@" or "+" as appropriate. |
| 298 | |
| 299 | |
| 300 | Each object in the cache has an extended attribute label that holds the object |
| 301 | type ID (required to distinguish special objects) and the auxiliary data from |
| 302 | the netfs. The latter is used to detect stale objects in the cache and update |
| 303 | or retire them. |
| 304 | |
| 305 | |
| 306 | Note that CacheFiles will erase from the cache any file it doesn't recognise or |
| 307 | any file of an incorrect type (such as a FIFO file or a device file). |
| 308 | |
| 309 | |
| 310 | ========================== |
| 311 | SECURITY MODEL AND SELINUX |
| 312 | ========================== |
| 313 | |
| 314 | CacheFiles is implemented to deal properly with the LSM security features of |
| 315 | the Linux kernel and the SELinux facility. |
| 316 | |
| 317 | One of the problems that CacheFiles faces is that it is generally acting on |
| 318 | behalf of a process, and running in that process's context, and that includes a |
| 319 | security context that is not appropriate for accessing the cache - either |
| 320 | because the files in the cache are inaccessible to that process, or because if |
| 321 | the process creates a file in the cache, that file may be inaccessible to other |
| 322 | processes. |
| 323 | |
| 324 | The way CacheFiles works is to temporarily change the security context (fsuid, |
| 325 | fsgid and actor security label) that the process acts as - without changing the |
| 326 | security context of the process when it the target of an operation performed by |
| 327 | some other process (so signalling and suchlike still work correctly). |
| 328 | |
| 329 | |
| 330 | When the CacheFiles module is asked to bind to its cache, it: |
| 331 | |
| 332 | (1) Finds the security label attached to the root cache directory and uses |
| 333 | that as the security label with which it will create files. By default, |
| 334 | this is: |
| 335 | |
| 336 | cachefiles_var_t |
| 337 | |
| 338 | (2) Finds the security label of the process which issued the bind request |
| 339 | (presumed to be the cachefilesd daemon), which by default will be: |
| 340 | |
| 341 | cachefilesd_t |
| 342 | |
| 343 | and asks LSM to supply a security ID as which it should act given the |
| 344 | daemon's label. By default, this will be: |
| 345 | |
| 346 | cachefiles_kernel_t |
| 347 | |
| 348 | SELinux transitions the daemon's security ID to the module's security ID |
| 349 | based on a rule of this form in the policy. |
| 350 | |
| 351 | type_transition <daemon's-ID> kernel_t : process <module's-ID>; |
| 352 | |
| 353 | For instance: |
| 354 | |
| 355 | type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; |
| 356 | |
| 357 | |
| 358 | The module's security ID gives it permission to create, move and remove files |
| 359 | and directories in the cache, to find and access directories and files in the |
| 360 | cache, to set and access extended attributes on cache objects, and to read and |
| 361 | write files in the cache. |
| 362 | |
| 363 | The daemon's security ID gives it only a very restricted set of permissions: it |
| 364 | may scan directories, stat files and erase files and directories. It may |
| 365 | not read or write files in the cache, and so it is precluded from accessing the |
| 366 | data cached therein; nor is it permitted to create new files in the cache. |
| 367 | |
| 368 | |
| 369 | There are policy source files available in: |
| 370 | |
| 371 | http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 |
| 372 | |
| 373 | and later versions. In that tarball, see the files: |
| 374 | |
| 375 | cachefilesd.te |
| 376 | cachefilesd.fc |
| 377 | cachefilesd.if |
| 378 | |
| 379 | They are built and installed directly by the RPM. |
| 380 | |
| 381 | If a non-RPM based system is being used, then copy the above files to their own |
| 382 | directory and run: |
| 383 | |
| 384 | make -f /usr/share/selinux/devel/Makefile |
| 385 | semodule -i cachefilesd.pp |
| 386 | |
| 387 | You will need checkpolicy and selinux-policy-devel installed prior to the |
| 388 | build. |
| 389 | |
| 390 | |
| 391 | By default, the cache is located in /var/fscache, but if it is desirable that |
| 392 | it should be elsewhere, than either the above policy files must be altered, or |
| 393 | an auxiliary policy must be installed to label the alternate location of the |
| 394 | cache. |
| 395 | |
| 396 | For instructions on how to add an auxiliary policy to enable the cache to be |
| 397 | located elsewhere when SELinux is in enforcing mode, please see: |
| 398 | |
| 399 | /usr/share/doc/cachefilesd-*/move-cache.txt |
| 400 | |
| 401 | When the cachefilesd rpm is installed; alternatively, the document can be found |
| 402 | in the sources. |
| 403 | |
| 404 | |
| 405 | ================== |
| 406 | A NOTE ON SECURITY |
| 407 | ================== |
| 408 | |
| 409 | CacheFiles makes use of the split security in the task_struct. It allocates |
Marc Dionne | 91ac033 | 2009-04-23 11:21:55 +0100 | [diff] [blame] | 410 | its own task_security structure, and redirects current->cred to point to it |
David Howells | 9ae326a | 2009-04-03 16:42:41 +0100 | [diff] [blame] | 411 | when it acts on behalf of another process, in that process's context. |
| 412 | |
| 413 | The reason it does this is that it calls vfs_mkdir() and suchlike rather than |
| 414 | bypassing security and calling inode ops directly. Therefore the VFS and LSM |
| 415 | may deny the CacheFiles access to the cache data because under some |
| 416 | circumstances the caching code is running in the security context of whatever |
| 417 | process issued the original syscall on the netfs. |
| 418 | |
| 419 | Furthermore, should CacheFiles create a file or directory, the security |
| 420 | parameters with that object is created (UID, GID, security label) would be |
| 421 | derived from that process that issued the system call, thus potentially |
| 422 | preventing other processes from accessing the cache - including CacheFiles's |
| 423 | cache management daemon (cachefilesd). |
| 424 | |
| 425 | What is required is to temporarily override the security of the process that |
| 426 | issued the system call. We can't, however, just do an in-place change of the |
| 427 | security data as that affects the process as an object, not just as a subject. |
| 428 | This means it may lose signals or ptrace events for example, and affects what |
| 429 | the process looks like in /proc. |
| 430 | |
| 431 | So CacheFiles makes use of a logical split in the security between the |
Marc Dionne | 91ac033 | 2009-04-23 11:21:55 +0100 | [diff] [blame] | 432 | objective security (task->real_cred) and the subjective security (task->cred). |
| 433 | The objective security holds the intrinsic security properties of a process and |
| 434 | is never overridden. This is what appears in /proc, and is what is used when a |
David Howells | 9ae326a | 2009-04-03 16:42:41 +0100 | [diff] [blame] | 435 | process is the target of an operation by some other process (SIGKILL for |
| 436 | example). |
| 437 | |
| 438 | The subjective security holds the active security properties of a process, and |
| 439 | may be overridden. This is not seen externally, and is used whan a process |
| 440 | acts upon another object, for example SIGKILLing another process or opening a |
| 441 | file. |
| 442 | |
| 443 | LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request |
| 444 | for CacheFiles to run in a context of a specific security label, or to create |
| 445 | files and directories with another security label. |
| 446 | |
| 447 | |
| 448 | ======================= |
| 449 | STATISTICAL INFORMATION |
| 450 | ======================= |
| 451 | |
| 452 | If FS-Cache is compiled with the following option enabled: |
| 453 | |
| 454 | CONFIG_CACHEFILES_HISTOGRAM=y |
| 455 | |
| 456 | then it will gather certain statistics and display them through a proc file. |
| 457 | |
| 458 | (*) /proc/fs/cachefiles/histogram |
| 459 | |
| 460 | cat /proc/fs/cachefiles/histogram |
| 461 | JIFS SECS LOOKUPS MKDIRS CREATES |
| 462 | ===== ===== ========= ========= ========= |
| 463 | |
| 464 | This shows the breakdown of the number of times each amount of time |
| 465 | between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The |
| 466 | columns are as follows: |
| 467 | |
| 468 | COLUMN TIME MEASUREMENT |
| 469 | ======= ======================================================= |
| 470 | LOOKUPS Length of time to perform a lookup on the backing fs |
| 471 | MKDIRS Length of time to perform a mkdir on the backing fs |
| 472 | CREATES Length of time to perform a create on the backing fs |
| 473 | |
| 474 | Each row shows the number of events that took a particular range of times. |
| 475 | Each step is 1 jiffy in size. The JIFS column indicates the particular |
| 476 | jiffy range covered, and the SECS field the equivalent number of seconds. |
| 477 | |
| 478 | |
| 479 | ========= |
| 480 | DEBUGGING |
| 481 | ========= |
| 482 | |
| 483 | If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime |
| 484 | debugging enabled by adjusting the value in: |
| 485 | |
| 486 | /sys/module/cachefiles/parameters/debug |
| 487 | |
| 488 | This is a bitmask of debugging streams to enable: |
| 489 | |
| 490 | BIT VALUE STREAM POINT |
| 491 | ======= ======= =============================== ======================= |
| 492 | 0 1 General Function entry trace |
| 493 | 1 2 Function exit trace |
| 494 | 2 4 General |
| 495 | |
| 496 | The appropriate set of values should be OR'd together and the result written to |
| 497 | the control file. For example: |
| 498 | |
| 499 | echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug |
| 500 | |
| 501 | will turn on all function entry debugging. |