Blame - Documentation/security/self-protection.rst - kernel/msm-4.19

blob: 60c8bd8b77bf2f1861599ff99eaa379dfc814408 [file] [log] [blame]

Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	1	======================
				2	Kernel Self-Protection
				3	======================
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	4
				5	Kernel self-protection is the design and implementation of systems and
				6	structures within the Linux kernel to protect against security flaws in
				7	the kernel itself. This covers a wide range of issues, including removing
				8	entire classes of bugs, blocking security flaw exploitation methods,
				9	and actively detecting attack attempts. Not all topics are explored in
				10	this document, but it should serve as a reasonable starting point and
				11	answer any frequently asked questions. (Patches welcome, of course!)
				12
				13	In the worst-case scenario, we assume an unprivileged local attacker
				14	has arbitrary read and write access to the kernel's memory. In many
				15	cases, bugs being exploited will not provide this level of access,
				16	but with systems in place that defend against the worst case we'll
				17	cover the more limited cases as well. A higher bar, and one that should
				18	still be kept in mind, is protecting the kernel against a _privileged_
				19	local attacker, since the root user has access to a vastly increased
				20	attack surface. (Especially when they have the ability to load arbitrary
				21	kernel modules.)
				22
				23	The goals for successful self-protection systems would be that they
				24	are effective, on by default, require no opt-in by developers, have no
				25	performance impact, do not impede kernel debugging, and have tests. It
				26	is uncommon that all these goals can be met, but it is worth explicitly
				27	mentioning them, since these aspects need to be explored, dealt with,
				28	and/or accepted.
				29
				30
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	31	Attack Surface Reduction
				32	========================
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	33
				34	The most fundamental defense against security exploits is to reduce the
				35	areas of the kernel that can be used to redirect execution. This ranges
				36	from limiting the exposed APIs available to userspace, making in-kernel
				37	APIs hard to use incorrectly, minimizing the areas of writable kernel
				38	memory, etc.
				39
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	40	Strict kernel memory permissions
				41	--------------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	42
				43	When all of kernel memory is writable, it becomes trivial for attacks
				44	to redirect execution flow. To reduce the availability of these targets
				45	the kernel needs to protect its memory with a tight set of permissions.
				46
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	47	Executable code and read-only data must not be writable
				48	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	49
				50	Any areas of the kernel with executable memory must not be writable.
				51	While this obviously includes the kernel text itself, we must consider
				52	all additional places too: kernel modules, JIT memory, etc. (There are
				53	temporary exceptions to this rule to support things like instruction
				54	alternatives, breakpoints, kprobes, etc. If these must exist in a
				55	kernel, they are implemented in a way where the memory is temporarily
				56	made writable during the update, and then returned to the original
				57	permissions.)
				58
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	59	In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
				60	``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	61	writable, data is not executable, and read-only data is neither writable
				62	nor executable.
				63
Laura Abbott	ad21fc4	2017-02-06 16:31:57 -0800	[diff] [blame]	64	Most architectures have these options on by default and not user selectable.
				65	For some architectures like arm that wish to have these be selectable,
				66	the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	67	a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
Laura Abbott	ad21fc4	2017-02-06 16:31:57 -0800	[diff] [blame]	68	the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
				69
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	70	Function pointers and sensitive variables must not be writable
				71	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	72
				73	Vast areas of kernel memory contain function pointers that are looked
				74	up by the kernel and used to continue execution (e.g. descriptor/vector
				75	tables, file/network/etc operation structures, etc). The number of these
				76	variables must be reduced to an absolute minimum.
				77
				78	Many such variables can be made read-only by setting them "const"
				79	so that they live in the .rodata section instead of the .data section
				80	of the kernel, gaining the protection of the kernel's strict memory
				81	permissions as described above.
				82
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	83	For variables that are initialized once at ``__init`` time, these can
				84	be marked with the (new and under development) ``__ro_after_init``
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	85	attribute.
				86
				87	What remains are variables that are updated rarely (e.g. GDT). These
				88	will need another infrastructure (similar to the temporary exceptions
				89	made to kernel code mentioned above) that allow them to spend the rest
				90	of their lifetime read-only. (For example, when being updated, only the
				91	CPU thread performing the update would be given uninterruptible write
				92	access to the memory.)
				93
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	94	Segregation of kernel memory from userspace memory
				95	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	96
				97	The kernel must never execute userspace memory. The kernel must also never
				98	access userspace memory without explicit expectation to do so. These
				99	rules can be enforced either by support of hardware-based restrictions
				100	(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
				101	By blocking userspace memory in this way, execution and data parsing
				102	cannot be passed to trivially-controlled userspace memory, forcing
				103	attacks to operate entirely in kernel memory.
				104
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	105	Reduced access to syscalls
				106	--------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	107
				108	One trivial way to eliminate many syscalls for 64-bit systems is building
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	109	without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	110
				111	The "seccomp" system provides an opt-in feature made available to
				112	userspace, which provides a way to reduce the number of kernel entry
				113	points available to a running process. This limits the breadth of kernel
				114	code that can be reached, possibly reducing the availability of a given
				115	bug to an attack.
				116
				117	An area of improvement would be creating viable ways to keep access to
				118	things like compat, user namespaces, BPF creation, and perf limited only
				119	to trusted processes. This would keep the scope of kernel entry points
				120	restricted to the more regular set of normally available to unprivileged
				121	userspace.
				122
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	123	Restricting access to kernel modules
				124	------------------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	125
				126	The kernel should never allow an unprivileged user the ability to
				127	load specific kernel modules, since that would provide a facility to
				128	unexpectedly extend the available attack surface. (The on-demand loading
				129	of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
				130	considered "expected" here, though additional consideration should be
				131	given even to these.) For example, loading a filesystem module via an
				132	unprivileged socket API is nonsense: only the root or physically local
				133	user should trigger filesystem module loading. (And even this can be up
				134	for debate in some scenarios.)
				135
				136	To protect against even privileged users, systems may need to either
				137	disable module loading entirely (e.g. monolithic kernel builds or
				138	modules_disabled sysctl), or provide signed modules (e.g.
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	139	``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	140	root load arbitrary kernel code via the module loader interface.
				141
				142
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	143	Memory integrity
				144	================
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	145
				146	There are many memory structures in the kernel that are regularly abused
				147	to gain execution control during an attack, By far the most commonly
				148	understood is that of the stack buffer overflow in which the return
				149	address stored on the stack is overwritten. Many other examples of this
				150	kind of attack exist, and protections exist to defend against them.
				151
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	152	Stack buffer overflow
				153	---------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	154
				155	The classic stack buffer overflow involves writing past the expected end
				156	of a variable stored on the stack, ultimately writing a controlled value
				157	to the stack frame's stored return address. The most widely used defense
				158	is the presence of a stack canary between the stack variables and the
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	159	return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	160	the function returns. Other defenses include things like shadow stacks.
				161
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	162	Stack depth overflow
				163	--------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	164
				165	A less well understood attack is using a bug that triggers the
				166	kernel to consume stack memory with deep function calls or large stack
				167	allocations. With this attack it is possible to write beyond the end of
				168	the kernel's preallocated stack space and into sensitive structures. Two
				169	important changes need to be made for better protections: moving the
				170	sensitive thread_info structure elsewhere, and adding a faulting memory
				171	hole at the bottom of the stack to catch these overflows.
				172
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	173	Heap memory integrity
				174	---------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	175
				176	The structures used to track heap free lists can be sanity-checked during
				177	allocation and freeing to make sure they aren't being used to manipulate
				178	other memory areas.
				179
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	180	Counter integrity
				181	-----------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	182
				183	Many places in the kernel use atomic counters to track object references
				184	or perform similar lifetime management. When these counters can be made
				185	to wrap (over or under) this traditionally exposes a use-after-free
				186	flaw. By trapping atomic wrapping, this class of bug vanishes.
				187
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	188	Size calculation overflow detection
				189	-----------------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	190
				191	Similar to counter overflow, integer overflows (usually size calculations)
				192	need to be detected at runtime to kill this class of bug, which
				193	traditionally leads to being able to write past the end of kernel buffers.
				194
				195
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	196	Probabilistic defenses
				197	======================
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	198
				199	While many protections can be considered deterministic (e.g. read-only
				200	memory cannot be written to), some protections provide only statistical
				201	defense, in that an attack must gather enough information about a
				202	running system to overcome the defense. While not perfect, these do
				203	provide meaningful defenses.
				204
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	205	Canaries, blinding, and other secrets
				206	-------------------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	207
				208	It should be noted that things like the stack canary discussed earlier
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	209	are technically statistical defenses, since they rely on a secret value,
				210	and such values may become discoverable through an information exposure
				211	flaw.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	212
				213	Blinding literal values for things like JITs, where the executable
				214	contents may be partially under the control of userspace, need a similar
				215	secret value.
				216
				217	It is critical that the secret values used must be separate (e.g.
				218	different canary per stack) and high entropy (e.g. is the RNG actually
				219	working?) in order to maximize their success.
				220
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	221	Kernel Address Space Layout Randomization (KASLR)
				222	-------------------------------------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	223
				224	Since the location of kernel memory is almost always instrumental in
				225	mounting a successful attack, making the location non-deterministic
				226	raises the difficulty of an exploit. (Note that this in turn makes
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	227	the value of information exposures higher, since they may be used to
				228	discover desired memory locations.)
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	229
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	230	Text and module base
				231	~~~~~~~~~~~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	232
				233	By relocating the physical and virtual base address of the kernel at
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	234	boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	235	frustrated. Additionally, offsetting the module loading base address
				236	means that even systems that load the same set of modules in the same
				237	order every boot will not share a common base address with the rest of
				238	the kernel text.
				239
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	240	Stack base
				241	~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	242
				243	If the base address of the kernel stack is not the same between processes,
				244	or even not the same between syscalls, targets on or beyond the stack
				245	become more difficult to locate.
				246
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	247	Dynamic memory base
				248	~~~~~~~~~~~~~~~~~~~
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	249
				250	Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
				251	being relatively deterministic in layout due to the order of early-boot
				252	initializations. If the base address of these areas is not the same
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	253	between boots, targeting them is frustrated, requiring an information
				254	exposure specific to the region.
				255
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	256	Structure layout
				257	~~~~~~~~~~~~~~~~
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	258
				259	By performing a per-build randomization of the layout of sensitive
				260	structures, attacks must either be tuned to known kernel builds or expose
				261	enough kernel memory to determine structure layouts before manipulating
				262	them.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	263
				264
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	265	Preventing Information Exposures
				266	================================
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	267
				268	Since the locations of sensitive structures are the primary target for
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	269	attacks, it is important to defend against exposure of both kernel memory
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	270	addresses and kernel memory contents (since they may contain kernel
				271	addresses or other sensitive things like canary values).
				272
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	273	Unique identifiers
				274	------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	275
				276	Kernel memory addresses must never be used as identifiers exposed to
				277	userspace. Instead, use an atomic counter, an idr, or similar unique
				278	identifier.
				279
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	280	Memory initialization
				281	---------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	282
				283	Memory copied to userspace must always be fully initialized. If not
				284	explicitly memset(), this will require changes to the compiler to make
				285	sure structure holes are cleared.
				286
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	287	Memory poisoning
				288	----------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	289
				290	When releasing memory, it is best to poison the contents (clear stack on
				291	syscall return, wipe heap memory on a free), to avoid reuse attacks that
				292	rely on the old contents of memory. This frustrates many uninitialized
Kees Cook	c9de4a8	2016-05-18 06:37:47 -0700	[diff] [blame]	293	variable attacks, stack content exposures, heap content exposures, and
				294	use-after-free attacks.
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	295
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	296	Destination tracking
				297	--------------------
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	298
				299	To help kill classes of bugs that result in kernel addresses being
				300	written to userspace, the destination of writes needs to be tracked. If
Kees Cook	c2ed674	2017-05-13 04:51:41 -0700	[diff] [blame]	301	the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
Kees Cook	9f80366	2016-05-16 19:27:28 -0700	[diff] [blame]	302	it should automatically censor sensitive values.