Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Documentation for /proc/sys/fs/* kernel version 2.2.10 |
| 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 3 | (c) 2009, Shen Feng<shen@cn.fujitsu.com> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | |
| 5 | For general info and legal blurb, please look in README. |
| 6 | |
| 7 | ============================================================== |
| 8 | |
| 9 | This file contains documentation for the sysctl files in |
| 10 | /proc/sys/fs/ and is valid for Linux kernel version 2.2. |
| 11 | |
| 12 | The files in this directory can be used to tune and monitor |
| 13 | miscellaneous and general things in the operation of the Linux |
| 14 | kernel. Since some of the files _can_ be used to screw up your |
| 15 | system, it is advisable to read both documentation and source |
| 16 | before actually making adjustments. |
| 17 | |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 18 | 1. /proc/sys/fs |
| 19 | ---------------------------------------------------------- |
| 20 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 21 | Currently, these files are in /proc/sys/fs: |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 22 | - aio-max-nr |
| 23 | - aio-nr |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 24 | - dentry-state |
| 25 | - dquot-max |
| 26 | - dquot-nr |
| 27 | - file-max |
| 28 | - file-nr |
| 29 | - inode-max |
| 30 | - inode-nr |
| 31 | - inode-state |
Eric Dumazet | 9cfe015 | 2008-02-06 01:37:16 -0800 | [diff] [blame] | 32 | - nr_open |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 33 | - overflowuid |
| 34 | - overflowgid |
Willy Tarreau | 759c011 | 2016-01-18 16:36:09 +0100 | [diff] [blame] | 35 | - pipe-user-pages-hard |
| 36 | - pipe-user-pages-soft |
Salvatore Mesoraca | 0c41bee | 2018-08-23 17:00:35 -0700 | [diff] [blame] | 37 | - protected_fifos |
Kees Cook | 800179c | 2012-07-25 17:29:07 -0700 | [diff] [blame] | 38 | - protected_hardlinks |
Salvatore Mesoraca | 0c41bee | 2018-08-23 17:00:35 -0700 | [diff] [blame] | 39 | - protected_regular |
Kees Cook | 800179c | 2012-07-25 17:29:07 -0700 | [diff] [blame] | 40 | - protected_symlinks |
Alexey Dobriyan | a2e0b56 | 2006-08-27 01:23:28 -0700 | [diff] [blame] | 41 | - suid_dumpable |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 42 | - super-max |
| 43 | - super-nr |
| 44 | |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 45 | ============================================================== |
| 46 | |
| 47 | aio-nr & aio-max-nr: |
| 48 | |
| 49 | aio-nr is the running total of the number of events specified on the |
| 50 | io_setup system call for all currently active aio contexts. If aio-nr |
| 51 | reaches aio-max-nr then io_setup will fail with EAGAIN. Note that |
| 52 | raising aio-max-nr does not result in the pre-allocation or re-sizing |
| 53 | of any kernel data structures. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 54 | |
| 55 | ============================================================== |
| 56 | |
| 57 | dentry-state: |
| 58 | |
| 59 | From linux/fs/dentry.c: |
| 60 | -------------------------------------------------------------- |
| 61 | struct { |
| 62 | int nr_dentry; |
| 63 | int nr_unused; |
| 64 | int age_limit; /* age in seconds */ |
| 65 | int want_pages; /* pages requested by system */ |
| 66 | int dummy[2]; |
| 67 | } dentry_stat = {0, 0, 45, 0,}; |
| 68 | -------------------------------------------------------------- |
| 69 | |
| 70 | Dentries are dynamically allocated and deallocated, and |
| 71 | nr_dentry seems to be 0 all the time. Hence it's safe to |
| 72 | assume that only nr_unused, age_limit and want_pages are |
| 73 | used. Nr_unused seems to be exactly what its name says. |
| 74 | Age_limit is the age in seconds after which dcache entries |
| 75 | can be reclaimed when memory is short and want_pages is |
| 76 | nonzero when shrink_dcache_pages() has been called and the |
| 77 | dcache isn't pruned yet. |
| 78 | |
| 79 | ============================================================== |
| 80 | |
| 81 | dquot-max & dquot-nr: |
| 82 | |
| 83 | The file dquot-max shows the maximum number of cached disk |
| 84 | quota entries. |
| 85 | |
| 86 | The file dquot-nr shows the number of allocated disk quota |
| 87 | entries and the number of free disk quota entries. |
| 88 | |
| 89 | If the number of free cached disk quotas is very low and |
| 90 | you have some awesome number of simultaneous system users, |
| 91 | you might want to raise the limit. |
| 92 | |
| 93 | ============================================================== |
| 94 | |
| 95 | file-max & file-nr: |
| 96 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 97 | The value in file-max denotes the maximum number of file- |
| 98 | handles that the Linux kernel will allocate. When you get lots |
| 99 | of error messages about running out of file handles, you might |
| 100 | want to increase this limit. |
| 101 | |
Federica Teodori | ca3b78a | 2011-03-15 16:12:05 -0700 | [diff] [blame] | 102 | Historically,the kernel was able to allocate file handles |
| 103 | dynamically, but not to free them again. The three values in |
| 104 | file-nr denote the number of allocated file handles, the number |
| 105 | of allocated but unused file handles, and the maximum number of |
| 106 | file handles. Linux 2.6 always reports 0 as the number of free |
| 107 | file handles -- this is not an error, it just means that the |
| 108 | number of allocated file handles exactly matches the number of |
| 109 | used file handles. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 110 | |
Xiaotian Feng | bcadbbd | 2009-09-23 15:56:13 -0700 | [diff] [blame] | 111 | Attempts to allocate more file descriptors than file-max are |
| 112 | reported with printk, look for "VFS: file-max limit <number> |
| 113 | reached". |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 114 | ============================================================== |
| 115 | |
Eric Dumazet | 9cfe015 | 2008-02-06 01:37:16 -0800 | [diff] [blame] | 116 | nr_open: |
| 117 | |
| 118 | This denotes the maximum number of file-handles a process can |
| 119 | allocate. Default value is 1024*1024 (1048576) which should be |
| 120 | enough for most machines. Actual limit depends on RLIMIT_NOFILE |
| 121 | resource limit. |
| 122 | |
| 123 | ============================================================== |
| 124 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 125 | inode-max, inode-nr & inode-state: |
| 126 | |
| 127 | As with file handles, the kernel allocates the inode structures |
| 128 | dynamically, but can't free them yet. |
| 129 | |
| 130 | The value in inode-max denotes the maximum number of inode |
| 131 | handlers. This value should be 3-4 times larger than the value |
| 132 | in file-max, since stdin, stdout and network sockets also |
| 133 | need an inode struct to handle them. When you regularly run |
| 134 | out of inodes, you need to increase this value. |
| 135 | |
| 136 | The file inode-nr contains the first two items from |
| 137 | inode-state, so we'll skip to that file... |
| 138 | |
| 139 | Inode-state contains three actual numbers and four dummies. |
| 140 | The actual numbers are, in order of appearance, nr_inodes, |
| 141 | nr_free_inodes and preshrink. |
| 142 | |
| 143 | Nr_inodes stands for the number of inodes the system has |
| 144 | allocated, this can be slightly more than inode-max because |
| 145 | Linux allocates them one pageful at a time. |
| 146 | |
| 147 | Nr_free_inodes represents the number of free inodes (?) and |
| 148 | preshrink is nonzero when the nr_inodes > inode-max and the |
| 149 | system needs to prune the inode list instead of allocating |
| 150 | more. |
| 151 | |
| 152 | ============================================================== |
| 153 | |
| 154 | overflowgid & overflowuid: |
| 155 | |
| 156 | Some filesystems only support 16-bit UIDs and GIDs, although in Linux |
| 157 | UIDs and GIDs are 32 bits. When one of these filesystems is mounted |
| 158 | with writes enabled, any UID or GID that would exceed 65535 is translated |
| 159 | to a fixed value before being written to disk. |
| 160 | |
| 161 | These sysctls allow you to change the value of the fixed UID and GID. |
| 162 | The default is 65534. |
| 163 | |
| 164 | ============================================================== |
| 165 | |
Willy Tarreau | 759c011 | 2016-01-18 16:36:09 +0100 | [diff] [blame] | 166 | pipe-user-pages-hard: |
| 167 | |
| 168 | Maximum total number of pages a non-privileged user may allocate for pipes. |
| 169 | Once this limit is reached, no new pipes may be allocated until usage goes |
| 170 | below the limit again. When set to 0, no limit is applied, which is the default |
| 171 | setting. |
| 172 | |
| 173 | ============================================================== |
| 174 | |
| 175 | pipe-user-pages-soft: |
| 176 | |
| 177 | Maximum total number of pages a non-privileged user may allocate for pipes |
| 178 | before the pipe size gets limited to a single page. Once this limit is reached, |
| 179 | new pipes will be limited to a single page in size for this user in order to |
| 180 | limit total memory usage, and trying to increase them using fcntl() will be |
| 181 | denied until usage goes below the limit again. The default value allows to |
| 182 | allocate up to 1024 pipes at their default size. When set to 0, no limit is |
| 183 | applied. |
| 184 | |
| 185 | ============================================================== |
| 186 | |
Salvatore Mesoraca | 0c41bee | 2018-08-23 17:00:35 -0700 | [diff] [blame] | 187 | protected_fifos: |
| 188 | |
| 189 | The intent of this protection is to avoid unintentional writes to |
| 190 | an attacker-controlled FIFO, where a program expected to create a regular |
| 191 | file. |
| 192 | |
| 193 | When set to "0", writing to FIFOs is unrestricted. |
| 194 | |
| 195 | When set to "1" don't allow O_CREAT open on FIFOs that we don't own |
| 196 | in world writable sticky directories, unless they are owned by the |
| 197 | owner of the directory. |
| 198 | |
| 199 | When set to "2" it also applies to group writable sticky directories. |
| 200 | |
| 201 | This protection is based on the restrictions in Openwall. |
| 202 | |
| 203 | ============================================================== |
| 204 | |
Kees Cook | 800179c | 2012-07-25 17:29:07 -0700 | [diff] [blame] | 205 | protected_hardlinks: |
| 206 | |
| 207 | A long-standing class of security issues is the hardlink-based |
| 208 | time-of-check-time-of-use race, most commonly seen in world-writable |
| 209 | directories like /tmp. The common method of exploitation of this flaw |
| 210 | is to cross privilege boundaries when following a given hardlink (i.e. a |
| 211 | root process follows a hardlink created by another user). Additionally, |
| 212 | on systems without separated partitions, this stops unauthorized users |
| 213 | from "pinning" vulnerable setuid/setgid files against being upgraded by |
| 214 | the administrator, or linking to special files. |
| 215 | |
| 216 | When set to "0", hardlink creation behavior is unrestricted. |
| 217 | |
| 218 | When set to "1" hardlinks cannot be created by users if they do not |
| 219 | already own the source file, or do not have read/write access to it. |
| 220 | |
| 221 | This protection is based on the restrictions in Openwall and grsecurity. |
| 222 | |
| 223 | ============================================================== |
| 224 | |
Salvatore Mesoraca | 0c41bee | 2018-08-23 17:00:35 -0700 | [diff] [blame] | 225 | protected_regular: |
| 226 | |
| 227 | This protection is similar to protected_fifos, but it |
| 228 | avoids writes to an attacker-controlled regular file, where a program |
| 229 | expected to create one. |
| 230 | |
| 231 | When set to "0", writing to regular files is unrestricted. |
| 232 | |
| 233 | When set to "1" don't allow O_CREAT open on regular files that we |
| 234 | don't own in world writable sticky directories, unless they are |
| 235 | owned by the owner of the directory. |
| 236 | |
| 237 | When set to "2" it also applies to group writable sticky directories. |
| 238 | |
| 239 | ============================================================== |
| 240 | |
Kees Cook | 800179c | 2012-07-25 17:29:07 -0700 | [diff] [blame] | 241 | protected_symlinks: |
| 242 | |
| 243 | A long-standing class of security issues is the symlink-based |
| 244 | time-of-check-time-of-use race, most commonly seen in world-writable |
| 245 | directories like /tmp. The common method of exploitation of this flaw |
| 246 | is to cross privilege boundaries when following a given symlink (i.e. a |
| 247 | root process follows a symlink belonging to another user). For a likely |
| 248 | incomplete list of hundreds of examples across the years, please see: |
| 249 | http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp |
| 250 | |
| 251 | When set to "0", symlink following behavior is unrestricted. |
| 252 | |
| 253 | When set to "1" symlinks are permitted to be followed only when outside |
| 254 | a sticky world-writable directory, or when the uid of the symlink and |
| 255 | follower match, or when the directory owner matches the symlink's owner. |
| 256 | |
| 257 | This protection is based on the restrictions in Openwall and grsecurity. |
| 258 | |
| 259 | ============================================================== |
| 260 | |
Alexey Dobriyan | a2e0b56 | 2006-08-27 01:23:28 -0700 | [diff] [blame] | 261 | suid_dumpable: |
| 262 | |
| 263 | This value can be used to query and set the core dump mode for setuid |
| 264 | or otherwise protected/tainted binaries. The modes are |
| 265 | |
| 266 | 0 - (default) - traditional behaviour. Any process which has changed |
Kees Cook | 9520628 | 2012-07-30 14:39:15 -0700 | [diff] [blame] | 267 | privilege levels or is execute only will not be dumped. |
Alexey Dobriyan | a2e0b56 | 2006-08-27 01:23:28 -0700 | [diff] [blame] | 268 | 1 - (debug) - all processes dump core when possible. The core dump is |
| 269 | owned by the current user and no security is applied. This is |
| 270 | intended for system debugging situations only. Ptrace is unchecked. |
Kees Cook | 9520628 | 2012-07-30 14:39:15 -0700 | [diff] [blame] | 271 | This is insecure as it allows regular users to examine the memory |
| 272 | contents of privileged processes. |
Alexey Dobriyan | a2e0b56 | 2006-08-27 01:23:28 -0700 | [diff] [blame] | 273 | 2 - (suidsafe) - any binary which normally would not be dumped is dumped |
Kees Cook | 9520628 | 2012-07-30 14:39:15 -0700 | [diff] [blame] | 274 | anyway, but only if the "core_pattern" kernel sysctl is set to |
| 275 | either a pipe handler or a fully qualified path. (For more details |
| 276 | on this limitation, see CVE-2006-2451.) This mode is appropriate |
| 277 | when administrators are attempting to debug problems in a normal |
| 278 | environment, and either have a core dump pipe handler that knows |
| 279 | to treat privileged core dumps with care, or specific directory |
| 280 | defined for catching core dumps. If a core dump happens without |
| 281 | a pipe handler or fully qualifid path, a message will be emitted |
| 282 | to syslog warning about the lack of a correct setting. |
Alexey Dobriyan | a2e0b56 | 2006-08-27 01:23:28 -0700 | [diff] [blame] | 283 | |
| 284 | ============================================================== |
| 285 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 286 | super-max & super-nr: |
| 287 | |
| 288 | These numbers control the maximum number of superblocks, and |
| 289 | thus the maximum number of mounted filesystems the kernel |
| 290 | can have. You only need to increase super-max if you need to |
| 291 | mount more filesystems than the current value in super-max |
| 292 | allows you to. |
| 293 | |
| 294 | ============================================================== |
| 295 | |
| 296 | aio-nr & aio-max-nr: |
| 297 | |
| 298 | aio-nr shows the current system-wide number of asynchronous io |
| 299 | requests. aio-max-nr allows you to change the maximum value |
| 300 | aio-nr can grow to. |
| 301 | |
| 302 | ============================================================== |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 303 | |
Eric W. Biederman | d292168 | 2016-09-28 00:27:17 -0500 | [diff] [blame] | 304 | mount-max: |
| 305 | |
| 306 | This denotes the maximum number of mounts that may exist |
| 307 | in a mount namespace. |
| 308 | |
| 309 | ============================================================== |
| 310 | |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 311 | |
| 312 | 2. /proc/sys/fs/binfmt_misc |
| 313 | ---------------------------------------------------------- |
| 314 | |
| 315 | Documentation for the files in /proc/sys/fs/binfmt_misc is |
| 316 | in Documentation/binfmt_misc.txt. |
| 317 | |
| 318 | |
| 319 | 3. /proc/sys/fs/mqueue - POSIX message queues filesystem |
| 320 | ---------------------------------------------------------- |
| 321 | |
| 322 | The "mqueue" filesystem provides the necessary kernel features to enable the |
| 323 | creation of a user space library that implements the POSIX message queues |
| 324 | API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System |
| 325 | Interfaces specification.) |
| 326 | |
| 327 | The "mqueue" filesystem contains values for determining/setting the amount of |
| 328 | resources used by the file system. |
| 329 | |
| 330 | /proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the |
| 331 | maximum number of message queues allowed on the system. |
| 332 | |
| 333 | /proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the |
| 334 | maximum number of messages in a queue value. In fact it is the limiting value |
| 335 | for another (user) limit which is set in mq_open invocation. This attribute of |
| 336 | a queue must be less or equal then msg_max. |
| 337 | |
| 338 | /proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the |
| 339 | maximum message size value (it is every message queue's attribute set during |
| 340 | its creation). |
| 341 | |
KOSAKI Motohiro | cef0184 | 2012-05-31 16:26:33 -0700 | [diff] [blame] | 342 | /proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the |
| 343 | default number of messages in a queue value if attr parameter of mq_open(2) is |
| 344 | NULL. If it exceed msg_max, the default value is initialized msg_max. |
| 345 | |
| 346 | /proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting |
| 347 | the default message size value if attr parameter of mq_open(2) is NULL. If it |
| 348 | exceed msgsize_max, the default value is initialized msgsize_max. |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 349 | |
| 350 | 4. /proc/sys/fs/epoll - Configuration options for the epoll interface |
| 351 | -------------------------------------------------------- |
| 352 | |
| 353 | This directory contains configuration options for the epoll(7) interface. |
| 354 | |
Shen Feng | 760df93 | 2009-04-02 16:57:20 -0700 | [diff] [blame] | 355 | max_user_watches |
| 356 | ---------------- |
| 357 | |
| 358 | Every epoll file descriptor can store a number of files to be monitored |
| 359 | for event readiness. Each one of these monitored files constitutes a "watch". |
| 360 | This configuration option sets the maximum number of "watches" that are |
| 361 | allowed for each user. |
| 362 | Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes |
| 363 | on a 64bit one. |
| 364 | The current default value for max_user_watches is the 1/32 of the available |
| 365 | low memory, divided for the "watch" cost in bytes. |
| 366 | |