Blame - Documentation/x86/pti.txt - kernel/msm-4.9

blob: d11eff61fc9addf6cd0ad54db5b30bac73783503 [file] [log] [blame]

Dave Hansen	4e6c2af	2018-01-05 09:44:36 -0800	[diff] [blame]	1	Overview
				2	========
				3
				4	Page Table Isolation (pti, previously known as KAISER[1]) is a
				5	countermeasure against attacks on the shared user/kernel address
				6	space such as the "Meltdown" approach[2].
				7
				8	To mitigate this class of attacks, we create an independent set of
				9	page tables for use only when running userspace applications. When
				10	the kernel is entered via syscalls, interrupts or exceptions, the
				11	page tables are switched to the full "kernel" copy. When the system
				12	switches back to user mode, the user copy is used again.
				13
				14	The userspace page tables contain only a minimal amount of kernel
				15	data: only what is needed to enter/exit the kernel such as the
				16	entry/exit functions themselves and the interrupt descriptor table
				17	(IDT). There are a few strictly unnecessary things that get mapped
				18	such as the first C function when entering an interrupt (see
				19	comments in pti.c).
				20
				21	This approach helps to ensure that side-channel attacks leveraging
				22	the paging structures do not function when PTI is enabled. It can be
				23	enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
				24	Once enabled at compile-time, it can be disabled at boot with the
				25	'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
				26
				27	Page Table Management
				28	=====================
				29
				30	When PTI is enabled, the kernel manages two sets of page tables.
				31	The first set is very similar to the single set which is present in
				32	kernels without PTI. This includes a complete mapping of userspace
				33	that the kernel can use for things like copy_to_user().
				34
				35	Although _complete_, the user portion of the kernel page tables is
				36	crippled by setting the NX bit in the top level. This ensures
				37	that any missed kernel->user CR3 switch will immediately crash
				38	userspace upon executing its first instruction.
				39
				40	The userspace page tables map only the kernel data needed to enter
				41	and exit the kernel. This data is entirely contained in the 'struct
				42	cpu_entry_area' structure which is placed in the fixmap which gives
				43	each CPU's copy of the area a compile-time-fixed virtual address.
				44
				45	For new userspace mappings, the kernel makes the entries in its
				46	page tables like normal. The only difference is when the kernel
				47	makes entries in the top (PGD) level. In addition to setting the
				48	entry in the main kernel PGD, a copy of the entry is made in the
				49	userspace page tables' PGD.
				50
				51	This sharing at the PGD level also inherently shares all the lower
				52	layers of the page tables. This leaves a single, shared set of
				53	userspace page tables to manage. One PTE to lock, one set of
				54	accessed bits, dirty bits, etc...
				55
				56	Overhead
				57	========
				58
				59	Protection against side-channel attacks is important. But,
				60	this protection comes at a cost:
				61
				62	1. Increased Memory Use
				63	a. Each process now needs an order-1 PGD instead of order-0.
				64	(Consumes an additional 4k per process).
				65	b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
				66	aligned so that it can be mapped by setting a single PMD
				67	entry. This consumes nearly 2MB of RAM once the kernel
				68	is decompressed, but no space in the kernel image itself.
				69
				70	2. Runtime Cost
				71	a. CR3 manipulation to switch between the page table copies
				72	must be done at interrupt, syscall, and exception entry
				73	and exit (it can be skipped when the kernel is interrupted,
				74	though.) Moves to CR3 are on the order of a hundred
				75	cycles, and are required at every entry and exit.
				76	b. A "trampoline" must be used for SYSCALL entry. This
				77	trampoline depends on a smaller set of resources than the
				78	non-PTI SYSCALL entry code, so requires mapping fewer
				79	things into the userspace page tables. The downside is
				80	that stacks must be switched at entry time.
				81	d. Global pages are disabled for all kernel structures not
				82	mapped into both kernel and userspace page tables. This
				83	feature of the MMU allows different processes to share TLB
				84	entries mapping the kernel. Losing the feature means more
				85	TLB misses after a context switch. The actual loss of
				86	performance is very small, however, never exceeding 1%.
				87	d. Process Context IDentifiers (PCID) is a CPU feature that
				88	allows us to skip flushing the entire TLB when switching page
				89	tables by setting a special bit in CR3 when the page tables
				90	are changed. This makes switching the page tables (at context
				91	switch, or kernel entry/exit) cheaper. But, on systems with
				92	PCID support, the context switch code must flush both the user
				93	and kernel entries out of the TLB. The user PCID TLB flush is
				94	deferred until the exit to userspace, minimizing the cost.
				95	See intel.com/sdm for the gory PCID/INVPCID details.
				96	e. The userspace page tables must be populated for each new
				97	process. Even without PTI, the shared kernel mappings
				98	are created by copying top-level (PGD) entries into each
				99	new process. But, with PTI, there are now two kernel
				100	mappings: one in the kernel page tables that maps everything
				101	and one for the entry/exit structures. At fork(), we need to
				102	copy both.
				103	f. In addition to the fork()-time copying, there must also
				104	be an update to the userspace PGD any time a set_pgd() is done
				105	on a PGD used to map userspace. This ensures that the kernel
				106	and userspace copies always map the same userspace
				107	memory.
				108	g. On systems without PCID support, each CR3 write flushes
				109	the entire TLB. That means that each syscall, interrupt
				110	or exception flushes the TLB.
				111	h. INVPCID is a TLB-flushing instruction which allows flushing
				112	of TLB entries for non-current PCIDs. Some systems support
				113	PCIDs, but do not support INVPCID. On these systems, addresses
				114	can only be flushed from the TLB for the current PCID. When
				115	flushing a kernel address, we need to flush all PCIDs, so a
				116	single kernel address flush will require a TLB-flushing CR3
				117	write upon the next use of every PCID.
				118
				119	Possible Future Work
				120	====================
				121	1. We can be more careful about not actually writing to CR3
				122	unless its value is actually changed.
				123	2. Allow PTI to be enabled/disabled at runtime in addition to the
				124	boot-time switching.
				125
				126	Testing
				127	========
				128
				129	To test stability of PTI, the following test procedure is recommended,
				130	ideally doing all of these in parallel:
				131
				132	1. Set CONFIG_DEBUG_ENTRY=y
				133	2. Run several copies of all of the tools/testing/selftests/x86/ tests
				134	(excluding MPX and protection_keys) in a loop on multiple CPUs for
				135	several minutes. These tests frequently uncover corner cases in the
				136	kernel entry code. In general, old kernels might cause these tests
				137	themselves to crash, but they should never crash the kernel.
				138	3. Run the 'perf' tool in a mode (top or record) that generates many
				139	frequent performance monitoring non-maskable interrupts (see "NMI"
				140	in /proc/interrupts). This exercises the NMI entry/exit code which
				141	is known to trigger bugs in code paths that did not expect to be
				142	interrupted, including nested NMIs. Using "-c" boosts the rate of
				143	NMIs, and using two -c with separate counters encourages nested NMIs
				144	and less deterministic behavior.
				145
				146	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
				147
				148	4. Launch a KVM virtual machine.
				149	5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
				150	This has been a lightly-tested code path and needs extra scrutiny.
				151
				152	Debugging
				153	=========
				154
				155	Bugs in PTI cause a few different signatures of crashes
				156	that are worth noting here.
				157
				158	* Failures of the selftests/x86 code. Usually a bug in one of the
				159	more obscure corners of entry_64.S
				160	* Crashes in early boot, especially around CPU bringup. Bugs
				161	in the trampoline code or mappings cause these.
				162	* Crashes at the first interrupt. Caused by bugs in entry_64.S,
				163	like screwing up a page table switch. Also caused by
				164	incorrectly mapping the IRQ handler entry code.
				165	* Crashes at the first NMI. The NMI code is separate from main
				166	interrupt handlers and can have bugs that do not affect
				167	normal interrupts. Also caused by incorrectly mapping NMI
				168	code. NMIs that interrupt the entry code must be very
				169	careful and can be the cause of crashes that show up when
				170	running perf.
				171	* Kernel crashes at the first exit to userspace. entry_64.S
				172	bugs, or failing to map some of the exit code.
				173	* Crashes at first interrupt that interrupts userspace. The paths
				174	in entry_64.S that return to userspace are sometimes separate
				175	from the ones that return to the kernel.
				176	* Double faults: overflowing the kernel stack because of page
				177	faults upon page faults. Caused by touching non-pti-mapped
				178	data in the entry code, or forgetting to switch to kernel
				179	CR3 before calling into C functions which are not pti-mapped.
				180	* Userspace segfaults early in boot, sometimes manifesting
				181	as mount(8) failing to mount the rootfs. These have
				182	tended to be TLB invalidation issues. Usually invalidating
				183	the wrong PCID, or otherwise missing an invalidation.
				184
				185	1. https://gruss.cc/files/kaiser.pdf
				186	2. https://meltdownattack.com/meltdown.pdf