blob: 3010576c9fca047af63b668a926e6bd9f923563d [file] [log] [blame]
Kees Cook9f803662016-05-16 19:27:28 -07001# Kernel Self-Protection
2
3Kernel self-protection is the design and implementation of systems and
4structures within the Linux kernel to protect against security flaws in
5the kernel itself. This covers a wide range of issues, including removing
6entire classes of bugs, blocking security flaw exploitation methods,
7and actively detecting attack attempts. Not all topics are explored in
8this document, but it should serve as a reasonable starting point and
9answer any frequently asked questions. (Patches welcome, of course!)
10
11In the worst-case scenario, we assume an unprivileged local attacker
12has arbitrary read and write access to the kernel's memory. In many
13cases, bugs being exploited will not provide this level of access,
14but with systems in place that defend against the worst case we'll
15cover the more limited cases as well. A higher bar, and one that should
16still be kept in mind, is protecting the kernel against a _privileged_
17local attacker, since the root user has access to a vastly increased
18attack surface. (Especially when they have the ability to load arbitrary
19kernel modules.)
20
21The goals for successful self-protection systems would be that they
22are effective, on by default, require no opt-in by developers, have no
23performance impact, do not impede kernel debugging, and have tests. It
24is uncommon that all these goals can be met, but it is worth explicitly
25mentioning them, since these aspects need to be explored, dealt with,
26and/or accepted.
27
28
29## Attack Surface Reduction
30
31The most fundamental defense against security exploits is to reduce the
32areas of the kernel that can be used to redirect execution. This ranges
33from limiting the exposed APIs available to userspace, making in-kernel
34APIs hard to use incorrectly, minimizing the areas of writable kernel
35memory, etc.
36
37### Strict kernel memory permissions
38
39When all of kernel memory is writable, it becomes trivial for attacks
40to redirect execution flow. To reduce the availability of these targets
41the kernel needs to protect its memory with a tight set of permissions.
42
43#### Executable code and read-only data must not be writable
44
45Any areas of the kernel with executable memory must not be writable.
46While this obviously includes the kernel text itself, we must consider
47all additional places too: kernel modules, JIT memory, etc. (There are
48temporary exceptions to this rule to support things like instruction
49alternatives, breakpoints, kprobes, etc. If these must exist in a
50kernel, they are implemented in a way where the memory is temporarily
51made writable during the update, and then returned to the original
52permissions.)
53
54In support of this are (the poorly named) CONFIG_DEBUG_RODATA and
55CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not
56writable, data is not executable, and read-only data is neither writable
57nor executable.
58
59#### Function pointers and sensitive variables must not be writable
60
61Vast areas of kernel memory contain function pointers that are looked
62up by the kernel and used to continue execution (e.g. descriptor/vector
63tables, file/network/etc operation structures, etc). The number of these
64variables must be reduced to an absolute minimum.
65
66Many such variables can be made read-only by setting them "const"
67so that they live in the .rodata section instead of the .data section
68of the kernel, gaining the protection of the kernel's strict memory
69permissions as described above.
70
71For variables that are initialized once at __init time, these can
72be marked with the (new and under development) __ro_after_init
73attribute.
74
75What remains are variables that are updated rarely (e.g. GDT). These
76will need another infrastructure (similar to the temporary exceptions
77made to kernel code mentioned above) that allow them to spend the rest
78of their lifetime read-only. (For example, when being updated, only the
79CPU thread performing the update would be given uninterruptible write
80access to the memory.)
81
82#### Segregation of kernel memory from userspace memory
83
84The kernel must never execute userspace memory. The kernel must also never
85access userspace memory without explicit expectation to do so. These
86rules can be enforced either by support of hardware-based restrictions
87(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
88By blocking userspace memory in this way, execution and data parsing
89cannot be passed to trivially-controlled userspace memory, forcing
90attacks to operate entirely in kernel memory.
91
92### Reduced access to syscalls
93
94One trivial way to eliminate many syscalls for 64-bit systems is building
95without CONFIG_COMPAT. However, this is rarely a feasible scenario.
96
97The "seccomp" system provides an opt-in feature made available to
98userspace, which provides a way to reduce the number of kernel entry
99points available to a running process. This limits the breadth of kernel
100code that can be reached, possibly reducing the availability of a given
101bug to an attack.
102
103An area of improvement would be creating viable ways to keep access to
104things like compat, user namespaces, BPF creation, and perf limited only
105to trusted processes. This would keep the scope of kernel entry points
106restricted to the more regular set of normally available to unprivileged
107userspace.
108
109### Restricting access to kernel modules
110
111The kernel should never allow an unprivileged user the ability to
112load specific kernel modules, since that would provide a facility to
113unexpectedly extend the available attack surface. (The on-demand loading
114of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
115considered "expected" here, though additional consideration should be
116given even to these.) For example, loading a filesystem module via an
117unprivileged socket API is nonsense: only the root or physically local
118user should trigger filesystem module loading. (And even this can be up
119for debate in some scenarios.)
120
121To protect against even privileged users, systems may need to either
122disable module loading entirely (e.g. monolithic kernel builds or
123modules_disabled sysctl), or provide signed modules (e.g.
124CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
125root load arbitrary kernel code via the module loader interface.
126
127
128## Memory integrity
129
130There are many memory structures in the kernel that are regularly abused
131to gain execution control during an attack, By far the most commonly
132understood is that of the stack buffer overflow in which the return
133address stored on the stack is overwritten. Many other examples of this
134kind of attack exist, and protections exist to defend against them.
135
136### Stack buffer overflow
137
138The classic stack buffer overflow involves writing past the expected end
139of a variable stored on the stack, ultimately writing a controlled value
140to the stack frame's stored return address. The most widely used defense
141is the presence of a stack canary between the stack variables and the
142return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
143the function returns. Other defenses include things like shadow stacks.
144
145### Stack depth overflow
146
147A less well understood attack is using a bug that triggers the
148kernel to consume stack memory with deep function calls or large stack
149allocations. With this attack it is possible to write beyond the end of
150the kernel's preallocated stack space and into sensitive structures. Two
151important changes need to be made for better protections: moving the
152sensitive thread_info structure elsewhere, and adding a faulting memory
153hole at the bottom of the stack to catch these overflows.
154
155### Heap memory integrity
156
157The structures used to track heap free lists can be sanity-checked during
158allocation and freeing to make sure they aren't being used to manipulate
159other memory areas.
160
161### Counter integrity
162
163Many places in the kernel use atomic counters to track object references
164or perform similar lifetime management. When these counters can be made
165to wrap (over or under) this traditionally exposes a use-after-free
166flaw. By trapping atomic wrapping, this class of bug vanishes.
167
168### Size calculation overflow detection
169
170Similar to counter overflow, integer overflows (usually size calculations)
171need to be detected at runtime to kill this class of bug, which
172traditionally leads to being able to write past the end of kernel buffers.
173
174
175## Statistical defenses
176
177While many protections can be considered deterministic (e.g. read-only
178memory cannot be written to), some protections provide only statistical
179defense, in that an attack must gather enough information about a
180running system to overcome the defense. While not perfect, these do
181provide meaningful defenses.
182
183### Canaries, blinding, and other secrets
184
185It should be noted that things like the stack canary discussed earlier
Kees Cookc9de4a82016-05-18 06:37:47 -0700186are technically statistical defenses, since they rely on a secret value,
187and such values may become discoverable through an information exposure
188flaw.
Kees Cook9f803662016-05-16 19:27:28 -0700189
190Blinding literal values for things like JITs, where the executable
191contents may be partially under the control of userspace, need a similar
192secret value.
193
194It is critical that the secret values used must be separate (e.g.
195different canary per stack) and high entropy (e.g. is the RNG actually
196working?) in order to maximize their success.
197
198### Kernel Address Space Layout Randomization (KASLR)
199
200Since the location of kernel memory is almost always instrumental in
201mounting a successful attack, making the location non-deterministic
202raises the difficulty of an exploit. (Note that this in turn makes
Kees Cookc9de4a82016-05-18 06:37:47 -0700203the value of information exposures higher, since they may be used to
204discover desired memory locations.)
Kees Cook9f803662016-05-16 19:27:28 -0700205
206#### Text and module base
207
208By relocating the physical and virtual base address of the kernel at
209boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
210frustrated. Additionally, offsetting the module loading base address
211means that even systems that load the same set of modules in the same
212order every boot will not share a common base address with the rest of
213the kernel text.
214
215#### Stack base
216
217If the base address of the kernel stack is not the same between processes,
218or even not the same between syscalls, targets on or beyond the stack
219become more difficult to locate.
220
221#### Dynamic memory base
222
223Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
224being relatively deterministic in layout due to the order of early-boot
225initializations. If the base address of these areas is not the same
Kees Cookc9de4a82016-05-18 06:37:47 -0700226between boots, targeting them is frustrated, requiring an information
227exposure specific to the region.
228
229#### Structure layout
230
231By performing a per-build randomization of the layout of sensitive
232structures, attacks must either be tuned to known kernel builds or expose
233enough kernel memory to determine structure layouts before manipulating
234them.
Kees Cook9f803662016-05-16 19:27:28 -0700235
236
Kees Cookc9de4a82016-05-18 06:37:47 -0700237## Preventing Information Exposures
Kees Cook9f803662016-05-16 19:27:28 -0700238
239Since the locations of sensitive structures are the primary target for
Kees Cookc9de4a82016-05-18 06:37:47 -0700240attacks, it is important to defend against exposure of both kernel memory
Kees Cook9f803662016-05-16 19:27:28 -0700241addresses and kernel memory contents (since they may contain kernel
242addresses or other sensitive things like canary values).
243
244### Unique identifiers
245
246Kernel memory addresses must never be used as identifiers exposed to
247userspace. Instead, use an atomic counter, an idr, or similar unique
248identifier.
249
250### Memory initialization
251
252Memory copied to userspace must always be fully initialized. If not
253explicitly memset(), this will require changes to the compiler to make
254sure structure holes are cleared.
255
256### Memory poisoning
257
258When releasing memory, it is best to poison the contents (clear stack on
259syscall return, wipe heap memory on a free), to avoid reuse attacks that
260rely on the old contents of memory. This frustrates many uninitialized
Kees Cookc9de4a82016-05-18 06:37:47 -0700261variable attacks, stack content exposures, heap content exposures, and
262use-after-free attacks.
Kees Cook9f803662016-05-16 19:27:28 -0700263
264### Destination tracking
265
266To help kill classes of bugs that result in kernel addresses being
267written to userspace, the destination of writes needs to be tracked. If
268the buffer is destined for userspace (e.g. seq_file backed /proc files),
269it should automatically censor sensitive values.