Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 1 | # Kernel Self-Protection |
| 2 | |
| 3 | Kernel self-protection is the design and implementation of systems and |
| 4 | structures within the Linux kernel to protect against security flaws in |
| 5 | the kernel itself. This covers a wide range of issues, including removing |
| 6 | entire classes of bugs, blocking security flaw exploitation methods, |
| 7 | and actively detecting attack attempts. Not all topics are explored in |
| 8 | this document, but it should serve as a reasonable starting point and |
| 9 | answer any frequently asked questions. (Patches welcome, of course!) |
| 10 | |
| 11 | In the worst-case scenario, we assume an unprivileged local attacker |
| 12 | has arbitrary read and write access to the kernel's memory. In many |
| 13 | cases, bugs being exploited will not provide this level of access, |
| 14 | but with systems in place that defend against the worst case we'll |
| 15 | cover the more limited cases as well. A higher bar, and one that should |
| 16 | still be kept in mind, is protecting the kernel against a _privileged_ |
| 17 | local attacker, since the root user has access to a vastly increased |
| 18 | attack surface. (Especially when they have the ability to load arbitrary |
| 19 | kernel modules.) |
| 20 | |
| 21 | The goals for successful self-protection systems would be that they |
| 22 | are effective, on by default, require no opt-in by developers, have no |
| 23 | performance impact, do not impede kernel debugging, and have tests. It |
| 24 | is uncommon that all these goals can be met, but it is worth explicitly |
| 25 | mentioning them, since these aspects need to be explored, dealt with, |
| 26 | and/or accepted. |
| 27 | |
| 28 | |
| 29 | ## Attack Surface Reduction |
| 30 | |
| 31 | The most fundamental defense against security exploits is to reduce the |
| 32 | areas of the kernel that can be used to redirect execution. This ranges |
| 33 | from limiting the exposed APIs available to userspace, making in-kernel |
| 34 | APIs hard to use incorrectly, minimizing the areas of writable kernel |
| 35 | memory, etc. |
| 36 | |
| 37 | ### Strict kernel memory permissions |
| 38 | |
| 39 | When all of kernel memory is writable, it becomes trivial for attacks |
| 40 | to redirect execution flow. To reduce the availability of these targets |
| 41 | the kernel needs to protect its memory with a tight set of permissions. |
| 42 | |
| 43 | #### Executable code and read-only data must not be writable |
| 44 | |
| 45 | Any areas of the kernel with executable memory must not be writable. |
| 46 | While this obviously includes the kernel text itself, we must consider |
| 47 | all additional places too: kernel modules, JIT memory, etc. (There are |
| 48 | temporary exceptions to this rule to support things like instruction |
| 49 | alternatives, breakpoints, kprobes, etc. If these must exist in a |
| 50 | kernel, they are implemented in a way where the memory is temporarily |
| 51 | made writable during the update, and then returned to the original |
| 52 | permissions.) |
| 53 | |
| 54 | In support of this are (the poorly named) CONFIG_DEBUG_RODATA and |
| 55 | CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not |
| 56 | writable, data is not executable, and read-only data is neither writable |
| 57 | nor executable. |
| 58 | |
| 59 | #### Function pointers and sensitive variables must not be writable |
| 60 | |
| 61 | Vast areas of kernel memory contain function pointers that are looked |
| 62 | up by the kernel and used to continue execution (e.g. descriptor/vector |
| 63 | tables, file/network/etc operation structures, etc). The number of these |
| 64 | variables must be reduced to an absolute minimum. |
| 65 | |
| 66 | Many such variables can be made read-only by setting them "const" |
| 67 | so that they live in the .rodata section instead of the .data section |
| 68 | of the kernel, gaining the protection of the kernel's strict memory |
| 69 | permissions as described above. |
| 70 | |
| 71 | For variables that are initialized once at __init time, these can |
| 72 | be marked with the (new and under development) __ro_after_init |
| 73 | attribute. |
| 74 | |
| 75 | What remains are variables that are updated rarely (e.g. GDT). These |
| 76 | will need another infrastructure (similar to the temporary exceptions |
| 77 | made to kernel code mentioned above) that allow them to spend the rest |
| 78 | of their lifetime read-only. (For example, when being updated, only the |
| 79 | CPU thread performing the update would be given uninterruptible write |
| 80 | access to the memory.) |
| 81 | |
| 82 | #### Segregation of kernel memory from userspace memory |
| 83 | |
| 84 | The kernel must never execute userspace memory. The kernel must also never |
| 85 | access userspace memory without explicit expectation to do so. These |
| 86 | rules can be enforced either by support of hardware-based restrictions |
| 87 | (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). |
| 88 | By blocking userspace memory in this way, execution and data parsing |
| 89 | cannot be passed to trivially-controlled userspace memory, forcing |
| 90 | attacks to operate entirely in kernel memory. |
| 91 | |
| 92 | ### Reduced access to syscalls |
| 93 | |
| 94 | One trivial way to eliminate many syscalls for 64-bit systems is building |
| 95 | without CONFIG_COMPAT. However, this is rarely a feasible scenario. |
| 96 | |
| 97 | The "seccomp" system provides an opt-in feature made available to |
| 98 | userspace, which provides a way to reduce the number of kernel entry |
| 99 | points available to a running process. This limits the breadth of kernel |
| 100 | code that can be reached, possibly reducing the availability of a given |
| 101 | bug to an attack. |
| 102 | |
| 103 | An area of improvement would be creating viable ways to keep access to |
| 104 | things like compat, user namespaces, BPF creation, and perf limited only |
| 105 | to trusted processes. This would keep the scope of kernel entry points |
| 106 | restricted to the more regular set of normally available to unprivileged |
| 107 | userspace. |
| 108 | |
| 109 | ### Restricting access to kernel modules |
| 110 | |
| 111 | The kernel should never allow an unprivileged user the ability to |
| 112 | load specific kernel modules, since that would provide a facility to |
| 113 | unexpectedly extend the available attack surface. (The on-demand loading |
| 114 | of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is |
| 115 | considered "expected" here, though additional consideration should be |
| 116 | given even to these.) For example, loading a filesystem module via an |
| 117 | unprivileged socket API is nonsense: only the root or physically local |
| 118 | user should trigger filesystem module loading. (And even this can be up |
| 119 | for debate in some scenarios.) |
| 120 | |
| 121 | To protect against even privileged users, systems may need to either |
| 122 | disable module loading entirely (e.g. monolithic kernel builds or |
| 123 | modules_disabled sysctl), or provide signed modules (e.g. |
| 124 | CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having |
| 125 | root load arbitrary kernel code via the module loader interface. |
| 126 | |
| 127 | |
| 128 | ## Memory integrity |
| 129 | |
| 130 | There are many memory structures in the kernel that are regularly abused |
| 131 | to gain execution control during an attack, By far the most commonly |
| 132 | understood is that of the stack buffer overflow in which the return |
| 133 | address stored on the stack is overwritten. Many other examples of this |
| 134 | kind of attack exist, and protections exist to defend against them. |
| 135 | |
| 136 | ### Stack buffer overflow |
| 137 | |
| 138 | The classic stack buffer overflow involves writing past the expected end |
| 139 | of a variable stored on the stack, ultimately writing a controlled value |
| 140 | to the stack frame's stored return address. The most widely used defense |
| 141 | is the presence of a stack canary between the stack variables and the |
| 142 | return address (CONFIG_CC_STACKPROTECTOR), which is verified just before |
| 143 | the function returns. Other defenses include things like shadow stacks. |
| 144 | |
| 145 | ### Stack depth overflow |
| 146 | |
| 147 | A less well understood attack is using a bug that triggers the |
| 148 | kernel to consume stack memory with deep function calls or large stack |
| 149 | allocations. With this attack it is possible to write beyond the end of |
| 150 | the kernel's preallocated stack space and into sensitive structures. Two |
| 151 | important changes need to be made for better protections: moving the |
| 152 | sensitive thread_info structure elsewhere, and adding a faulting memory |
| 153 | hole at the bottom of the stack to catch these overflows. |
| 154 | |
| 155 | ### Heap memory integrity |
| 156 | |
| 157 | The structures used to track heap free lists can be sanity-checked during |
| 158 | allocation and freeing to make sure they aren't being used to manipulate |
| 159 | other memory areas. |
| 160 | |
| 161 | ### Counter integrity |
| 162 | |
| 163 | Many places in the kernel use atomic counters to track object references |
| 164 | or perform similar lifetime management. When these counters can be made |
| 165 | to wrap (over or under) this traditionally exposes a use-after-free |
| 166 | flaw. By trapping atomic wrapping, this class of bug vanishes. |
| 167 | |
| 168 | ### Size calculation overflow detection |
| 169 | |
| 170 | Similar to counter overflow, integer overflows (usually size calculations) |
| 171 | need to be detected at runtime to kill this class of bug, which |
| 172 | traditionally leads to being able to write past the end of kernel buffers. |
| 173 | |
| 174 | |
| 175 | ## Statistical defenses |
| 176 | |
| 177 | While many protections can be considered deterministic (e.g. read-only |
| 178 | memory cannot be written to), some protections provide only statistical |
| 179 | defense, in that an attack must gather enough information about a |
| 180 | running system to overcome the defense. While not perfect, these do |
| 181 | provide meaningful defenses. |
| 182 | |
| 183 | ### Canaries, blinding, and other secrets |
| 184 | |
| 185 | It should be noted that things like the stack canary discussed earlier |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 186 | are technically statistical defenses, since they rely on a secret value, |
| 187 | and such values may become discoverable through an information exposure |
| 188 | flaw. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 189 | |
| 190 | Blinding literal values for things like JITs, where the executable |
| 191 | contents may be partially under the control of userspace, need a similar |
| 192 | secret value. |
| 193 | |
| 194 | It is critical that the secret values used must be separate (e.g. |
| 195 | different canary per stack) and high entropy (e.g. is the RNG actually |
| 196 | working?) in order to maximize their success. |
| 197 | |
| 198 | ### Kernel Address Space Layout Randomization (KASLR) |
| 199 | |
| 200 | Since the location of kernel memory is almost always instrumental in |
| 201 | mounting a successful attack, making the location non-deterministic |
| 202 | raises the difficulty of an exploit. (Note that this in turn makes |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 203 | the value of information exposures higher, since they may be used to |
| 204 | discover desired memory locations.) |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 205 | |
| 206 | #### Text and module base |
| 207 | |
| 208 | By relocating the physical and virtual base address of the kernel at |
| 209 | boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be |
| 210 | frustrated. Additionally, offsetting the module loading base address |
| 211 | means that even systems that load the same set of modules in the same |
| 212 | order every boot will not share a common base address with the rest of |
| 213 | the kernel text. |
| 214 | |
| 215 | #### Stack base |
| 216 | |
| 217 | If the base address of the kernel stack is not the same between processes, |
| 218 | or even not the same between syscalls, targets on or beyond the stack |
| 219 | become more difficult to locate. |
| 220 | |
| 221 | #### Dynamic memory base |
| 222 | |
| 223 | Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up |
| 224 | being relatively deterministic in layout due to the order of early-boot |
| 225 | initializations. If the base address of these areas is not the same |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 226 | between boots, targeting them is frustrated, requiring an information |
| 227 | exposure specific to the region. |
| 228 | |
| 229 | #### Structure layout |
| 230 | |
| 231 | By performing a per-build randomization of the layout of sensitive |
| 232 | structures, attacks must either be tuned to known kernel builds or expose |
| 233 | enough kernel memory to determine structure layouts before manipulating |
| 234 | them. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 235 | |
| 236 | |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 237 | ## Preventing Information Exposures |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 238 | |
| 239 | Since the locations of sensitive structures are the primary target for |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 240 | attacks, it is important to defend against exposure of both kernel memory |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 241 | addresses and kernel memory contents (since they may contain kernel |
| 242 | addresses or other sensitive things like canary values). |
| 243 | |
| 244 | ### Unique identifiers |
| 245 | |
| 246 | Kernel memory addresses must never be used as identifiers exposed to |
| 247 | userspace. Instead, use an atomic counter, an idr, or similar unique |
| 248 | identifier. |
| 249 | |
| 250 | ### Memory initialization |
| 251 | |
| 252 | Memory copied to userspace must always be fully initialized. If not |
| 253 | explicitly memset(), this will require changes to the compiler to make |
| 254 | sure structure holes are cleared. |
| 255 | |
| 256 | ### Memory poisoning |
| 257 | |
| 258 | When releasing memory, it is best to poison the contents (clear stack on |
| 259 | syscall return, wipe heap memory on a free), to avoid reuse attacks that |
| 260 | rely on the old contents of memory. This frustrates many uninitialized |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 261 | variable attacks, stack content exposures, heap content exposures, and |
| 262 | use-after-free attacks. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 263 | |
| 264 | ### Destination tracking |
| 265 | |
| 266 | To help kill classes of bugs that result in kernel addresses being |
| 267 | written to userspace, the destination of writes needs to be tracked. If |
| 268 | the buffer is destined for userspace (e.g. seq_file backed /proc files), |
| 269 | it should automatically censor sensitive values. |