Blame - Documentation/security/self-protection.txt - kernel/msm-4.9

blob: 3010576c9fca047af63b668a926e6bd9f923563d [file] [log] [blame]

Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	1	# Kernel Self-Protection
				2
				3	Kernel self-protection is the design and implementation of systems and
				4	structures within the Linux kernel to protect against security flaws in
				5	the kernel itself. This covers a wide range of issues, including removing
				6	entire classes of bugs, blocking security flaw exploitation methods,
				7	and actively detecting attack attempts. Not all topics are explored in
				8	this document, but it should serve as a reasonable starting point and
				9	answer any frequently asked questions. (Patches welcome, of course!)
				10
				11	In the worst-case scenario, we assume an unprivileged local attacker
				12	has arbitrary read and write access to the kernel's memory. In many
				13	cases, bugs being exploited will not provide this level of access,
				14	but with systems in place that defend against the worst case we'll
				15	cover the more limited cases as well. A higher bar, and one that should
				16	still be kept in mind, is protecting the kernel against a _privileged_
				17	local attacker, since the root user has access to a vastly increased
				18	attack surface. (Especially when they have the ability to load arbitrary
				19	kernel modules.)
				20
				21	The goals for successful self-protection systems would be that they
				22	are effective, on by default, require no opt-in by developers, have no
				23	performance impact, do not impede kernel debugging, and have tests. It
				24	is uncommon that all these goals can be met, but it is worth explicitly
				25	mentioning them, since these aspects need to be explored, dealt with,
				26	and/or accepted.
				27
				28
				29	## Attack Surface Reduction
				30
				31	The most fundamental defense against security exploits is to reduce the
				32	areas of the kernel that can be used to redirect execution. This ranges
				33	from limiting the exposed APIs available to userspace, making in-kernel
				34	APIs hard to use incorrectly, minimizing the areas of writable kernel
				35	memory, etc.
				36
				37	### Strict kernel memory permissions
				38
				39	When all of kernel memory is writable, it becomes trivial for attacks
				40	to redirect execution flow. To reduce the availability of these targets
				41	the kernel needs to protect its memory with a tight set of permissions.
				42
				43	#### Executable code and read-only data must not be writable
				44
				45	Any areas of the kernel with executable memory must not be writable.
				46	While this obviously includes the kernel text itself, we must consider
				47	all additional places too: kernel modules, JIT memory, etc. (There are
				48	temporary exceptions to this rule to support things like instruction
				49	alternatives, breakpoints, kprobes, etc. If these must exist in a
				50	kernel, they are implemented in a way where the memory is temporarily
				51	made writable during the update, and then returned to the original
				52	permissions.)
				53
				54	In support of this are (the poorly named) CONFIG_DEBUG_RODATA and
				55	CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not
				56	writable, data is not executable, and read-only data is neither writable
				57	nor executable.
				58
				59	#### Function pointers and sensitive variables must not be writable
				60
				61	Vast areas of kernel memory contain function pointers that are looked
				62	up by the kernel and used to continue execution (e.g. descriptor/vector
				63	tables, file/network/etc operation structures, etc). The number of these
				64	variables must be reduced to an absolute minimum.
				65
				66	Many such variables can be made read-only by setting them "const"
				67	so that they live in the .rodata section instead of the .data section
				68	of the kernel, gaining the protection of the kernel's strict memory
				69	permissions as described above.
				70
				71	For variables that are initialized once at __init time, these can
				72	be marked with the (new and under development) __ro_after_init
				73	attribute.
				74
				75	What remains are variables that are updated rarely (e.g. GDT). These
				76	will need another infrastructure (similar to the temporary exceptions
				77	made to kernel code mentioned above) that allow them to spend the rest
				78	of their lifetime read-only. (For example, when being updated, only the
				79	CPU thread performing the update would be given uninterruptible write
				80	access to the memory.)
				81
				82	#### Segregation of kernel memory from userspace memory
				83
				84	The kernel must never execute userspace memory. The kernel must also never
				85	access userspace memory without explicit expectation to do so. These
				86	rules can be enforced either by support of hardware-based restrictions
				87	(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
				88	By blocking userspace memory in this way, execution and data parsing
				89	cannot be passed to trivially-controlled userspace memory, forcing
				90	attacks to operate entirely in kernel memory.
				91
				92	### Reduced access to syscalls
				93
				94	One trivial way to eliminate many syscalls for 64-bit systems is building
				95	without CONFIG_COMPAT. However, this is rarely a feasible scenario.
				96
				97	The "seccomp" system provides an opt-in feature made available to
				98	userspace, which provides a way to reduce the number of kernel entry
				99	points available to a running process. This limits the breadth of kernel
				100	code that can be reached, possibly reducing the availability of a given
				101	bug to an attack.
				102
				103	An area of improvement would be creating viable ways to keep access to
				104	things like compat, user namespaces, BPF creation, and perf limited only
				105	to trusted processes. This would keep the scope of kernel entry points
				106	restricted to the more regular set of normally available to unprivileged
				107	userspace.
				108
				109	### Restricting access to kernel modules
				110
				111	The kernel should never allow an unprivileged user the ability to
				112	load specific kernel modules, since that would provide a facility to
				113	unexpectedly extend the available attack surface. (The on-demand loading
				114	of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
				115	considered "expected" here, though additional consideration should be
				116	given even to these.) For example, loading a filesystem module via an
				117	unprivileged socket API is nonsense: only the root or physically local
				118	user should trigger filesystem module loading. (And even this can be up
				119	for debate in some scenarios.)
				120
				121	To protect against even privileged users, systems may need to either
				122	disable module loading entirely (e.g. monolithic kernel builds or
				123	modules_disabled sysctl), or provide signed modules (e.g.
				124	CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
				125	root load arbitrary kernel code via the module loader interface.
				126
				127
				128	## Memory integrity
				129
				130	There are many memory structures in the kernel that are regularly abused
				131	to gain execution control during an attack, By far the most commonly
				132	understood is that of the stack buffer overflow in which the return
				133	address stored on the stack is overwritten. Many other examples of this
				134	kind of attack exist, and protections exist to defend against them.
				135
				136	### Stack buffer overflow
				137
				138	The classic stack buffer overflow involves writing past the expected end
				139	of a variable stored on the stack, ultimately writing a controlled value
				140	to the stack frame's stored return address. The most widely used defense
				141	is the presence of a stack canary between the stack variables and the
				142	return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
				143	the function returns. Other defenses include things like shadow stacks.
				144
				145	### Stack depth overflow
				146
				147	A less well understood attack is using a bug that triggers the
				148	kernel to consume stack memory with deep function calls or large stack
				149	allocations. With this attack it is possible to write beyond the end of
				150	the kernel's preallocated stack space and into sensitive structures. Two
				151	important changes need to be made for better protections: moving the
				152	sensitive thread_info structure elsewhere, and adding a faulting memory
				153	hole at the bottom of the stack to catch these overflows.
				154
				155	### Heap memory integrity
				156
				157	The structures used to track heap free lists can be sanity-checked during
				158	allocation and freeing to make sure they aren't being used to manipulate
				159	other memory areas.
				160
				161	### Counter integrity
				162
				163	Many places in the kernel use atomic counters to track object references
				164	or perform similar lifetime management. When these counters can be made
				165	to wrap (over or under) this traditionally exposes a use-after-free
				166	flaw. By trapping atomic wrapping, this class of bug vanishes.
				167
				168	### Size calculation overflow detection
				169
				170	Similar to counter overflow, integer overflows (usually size calculations)
				171	need to be detected at runtime to kill this class of bug, which
				172	traditionally leads to being able to write past the end of kernel buffers.
				173
				174
				175	## Statistical defenses
				176
				177	While many protections can be considered deterministic (e.g. read-only
				178	memory cannot be written to), some protections provide only statistical
				179	defense, in that an attack must gather enough information about a
				180	running system to overcome the defense. While not perfect, these do
				181	provide meaningful defenses.
				182
				183	### Canaries, blinding, and other secrets
				184
				185	It should be noted that things like the stack canary discussed earlier
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	186	are technically statistical defenses, since they rely on a secret value,
				187	and such values may become discoverable through an information exposure
				188	flaw.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	189
				190	Blinding literal values for things like JITs, where the executable
				191	contents may be partially under the control of userspace, need a similar
				192	secret value.
				193
				194	It is critical that the secret values used must be separate (e.g.
				195	different canary per stack) and high entropy (e.g. is the RNG actually
				196	working?) in order to maximize their success.
				197
				198	### Kernel Address Space Layout Randomization (KASLR)
				199
				200	Since the location of kernel memory is almost always instrumental in
				201	mounting a successful attack, making the location non-deterministic
				202	raises the difficulty of an exploit. (Note that this in turn makes
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	203	the value of information exposures higher, since they may be used to
				204	discover desired memory locations.)
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	205
				206	#### Text and module base
				207
				208	By relocating the physical and virtual base address of the kernel at
				209	boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
				210	frustrated. Additionally, offsetting the module loading base address
				211	means that even systems that load the same set of modules in the same
				212	order every boot will not share a common base address with the rest of
				213	the kernel text.
				214
				215	#### Stack base
				216
				217	If the base address of the kernel stack is not the same between processes,
				218	or even not the same between syscalls, targets on or beyond the stack
				219	become more difficult to locate.
				220
				221	#### Dynamic memory base
				222
				223	Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
				224	being relatively deterministic in layout due to the order of early-boot
				225	initializations. If the base address of these areas is not the same
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	226	between boots, targeting them is frustrated, requiring an information
				227	exposure specific to the region.
				228
				229	#### Structure layout
				230
				231	By performing a per-build randomization of the layout of sensitive
				232	structures, attacks must either be tuned to known kernel builds or expose
				233	enough kernel memory to determine structure layouts before manipulating
				234	them.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	235
				236
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	237	## Preventing Information Exposures
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	238
				239	Since the locations of sensitive structures are the primary target for
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	240	attacks, it is important to defend against exposure of both kernel memory
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	241	addresses and kernel memory contents (since they may contain kernel
				242	addresses or other sensitive things like canary values).
				243
				244	### Unique identifiers
				245
				246	Kernel memory addresses must never be used as identifiers exposed to
				247	userspace. Instead, use an atomic counter, an idr, or similar unique
				248	identifier.
				249
				250	### Memory initialization
				251
				252	Memory copied to userspace must always be fully initialized. If not
				253	explicitly memset(), this will require changes to the compiler to make
				254	sure structure holes are cleared.
				255
				256	### Memory poisoning
				257
				258	When releasing memory, it is best to poison the contents (clear stack on
				259	syscall return, wipe heap memory on a free), to avoid reuse attacks that
				260	rely on the old contents of memory. This frustrates many uninitialized
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	261	variable attacks, stack content exposures, heap content exposures, and
				262	use-after-free attacks.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	263
				264	### Destination tracking
				265
				266	To help kill classes of bugs that result in kernel addresses being
				267	written to userspace, the destination of writes needs to be tracked. If
				268	the buffer is destined for userspace (e.g. seq_file backed /proc files),
				269	it should automatically censor sensitive values.