Blame - Documentation/virtual/kvm/locking.txt - kernel/msm-4.19

blob: 1bb8bcaf8497703f7cdd61538ca1374f0e8ac622 [file] [log] [blame]

Jan Kiszka	38a778a	2011-02-09 15:11:28 +0100	[diff] [blame]	1	KVM Lock Overview
				2	=================
				3
				4	1. Acquisition Orders
				5	---------------------
				6
Paolo Bonzini	58e3948	2016-10-13 13:10:57 +0200	[diff] [blame]	7	The acquisition orders for mutexes are as follows:
				8
				9	- kvm->lock is taken outside vcpu->mutex
				10
				11	- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
				12
				13	- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
				14	them together is quite rare.
				15
Paolo Bonzini	3f5ad8b	2016-12-12 10:12:53 +0100	[diff] [blame]	16	On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
				17
				18	For spinlocks, kvm_lock is taken outside kvm->mmu_lock.
				19
				20	Everything else is a leaf: no other lock is taken inside the critical
				21	sections.
Jan Kiszka	38a778a	2011-02-09 15:11:28 +0100	[diff] [blame]	22
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	23	2: Exception
				24	------------
				25
				26	Fast page fault:
				27
				28	Fast page fault is the fast path which fixes the guest page fault out of
Junaid Shahid	63dbe14	2016-12-06 16:46:17 -0800	[diff] [blame]	29	the mmu-lock on x86. Currently, the page fault can be fast in one of the
				30	following two cases:
				31
				32	1. Access Tracking: The SPTE is not present, but it is marked for access
				33	tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
				34	restore the saved R/X bits. This is described in more detail later below.
				35
				36	2. Write-Protection: The SPTE is present and the fault is
				37	caused by write-protect. That means we just need to change the W bit of the
				38	spte.
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	39
				40	What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
				41	SPTE_MMU_WRITEABLE bit on the spte:
				42	- SPTE_HOST_WRITEABLE means the gfn is writable on host.
				43	- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
				44	the gfn is writable on guest mmu and it is not write-protected by shadow
				45	page write-protection.
				46
				47	On fast page fault path, we will use cmpxchg to atomically set the spte W
Junaid Shahid	63dbe14	2016-12-06 16:46:17 -0800	[diff] [blame]	48	bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
				49	restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	50	is safe because whenever changing these bits can be detected by cmpxchg.
				51
				52	But we need carefully check these cases:
				53	1): The mapping from gfn to pfn
				54	The mapping from gfn to pfn may be changed since we can only ensure the pfn
				55	is not changed during cmpxchg. This is a ABA problem, for example, below case
				56	will happen:
				57
				58	At the beginning:
				59	gpte = gfn1
				60	gfn1 is mapped to pfn1 on host
				61	spte is the shadow page table entry corresponding with gpte and
				62	spte = pfn1
				63
				64	VCPU 0 VCPU0
				65	on fast page fault path:
				66
				67	old_spte = *spte;
				68	pfn1 is swapped out:
				69	spte = 0;
				70
				71	pfn1 is re-alloced for gfn2.
				72
				73	gpte is changed to point to
				74	gfn2 by the guest:
				75	spte = pfn1;
				76
				77	if (cmpxchg(spte, old_spte, old_spte+W)
				78	mark_page_dirty(vcpu->kvm, gfn1)
				79	OOPS!!!
				80
				81	We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
				82
				83	For direct sp, we can easily avoid it since the spte of direct sp is fixed
				84	to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
				85	to pin gfn to pfn, because after gfn_to_pfn_atomic():
				86	- We have held the refcount of pfn that means the pfn can not be freed and
				87	be reused for another gfn.
				88	- The pfn is writable that means it can not be shared between different gfns
				89	by KSM.
				90
				91	Then, we can ensure the dirty bitmaps is correctly set for a gfn.
				92
				93	Currently, to simplify the whole things, we disable fast page fault for
				94	indirect shadow page.
				95
				96	2): Dirty bit tracking
				97	In the origin code, the spte can be fast updated (non-atomically) if the
				98	spte is read-only and the Accessed bit has already been set since the
				99	Accessed bit and Dirty bit can not be lost.
				100
				101	But it is not true after fast page fault since the spte can be marked
				102	writable between reading spte and updating spte. Like below case:
				103
				104	At the beginning:
				105	spte.W = 0
				106	spte.Accessed = 1
				107
				108	VCPU 0 VCPU0
				109	In mmu_spte_clear_track_bits():
				110
				111	old_spte = *spte;
				112
				113	/* 'if' condition is satisfied. */
Andrea Gelmini	bb3541f	2016-05-21 14:14:44 +0200	[diff] [blame]	114	if (old_spte.Accessed == 1 &&
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	115	old_spte.W == 0)
				116	spte = 0ull;
				117	on fast page fault path:
				118	spte.W = 1
				119	memory write on the spte:
				120	spte.Dirty = 1
				121
				122
				123	else
				124	old_spte = xchg(spte, 0ull)
				125
				126
Andrea Gelmini	bb3541f	2016-05-21 14:14:44 +0200	[diff] [blame]	127	if (old_spte.Accessed == 1)
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	128	kvm_set_pfn_accessed(spte.pfn);
				129	if (old_spte.Dirty == 1)
				130	kvm_set_pfn_dirty(spte.pfn);
				131	OOPS!!!
				132
				133	The Dirty bit is lost in this case.
				134
				135	In order to avoid this kind of issue, we always treat the spte as "volatile"
				136	if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
Masanari Iida	1718003	2013-12-22 01:21:23 +0900	[diff] [blame]	137	the spte is always atomically updated in this case.
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	138
				139	3): flush tlbs due to spte updated
				140	If the spte is updated from writable to readonly, we should flush all TLBs,
				141	otherwise rmap_write_protect will find a read-only spte, even though the
				142	writable spte might be cached on a CPU's TLB.
				143
				144	As mentioned before, the spte can be updated to writable out of mmu-lock on
				145	fast page fault path, in order to easily audit the path, we see if TLBs need
				146	be flushed caused by this reason in mmu_spte_update() since this is a common
				147	function to update spte (present -> present).
				148
				149	Since the spte is "volatile" if it can be updated out of mmu-lock, we always
Masanari Iida	1718003	2013-12-22 01:21:23 +0900	[diff] [blame]	150	atomically update the spte, the race caused by fast page fault can be avoided,
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	151	See the comments in spte_has_volatile_bits() and mmu_spte_update().
				152
Junaid Shahid	63dbe14	2016-12-06 16:46:17 -0800	[diff] [blame]	153	Lockless Access Tracking:
				154
				155	This is used for Intel CPUs that are using EPT but do not support the EPT A/D
				156	bits. In this case, when the KVM MMU notifier is called to track accesses to a
				157	page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
				158	by clearing the RWX bits in the PTE and storing the original R & X bits in
				159	some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
				160	PTE (using the ignored bit 62). When the VM tries to access the page later on,
				161	a fault is generated and the fast page fault mechanism described above is used
				162	to atomically restore the PTE to a Present state. The W bit is not saved when
				163	the PTE is marked for access tracking and during restoration to the Present
				164	state, the W bit is set depending on whether or not it was a write access. If
				165	it wasn't, then the W bit will remain clear until a write access happens, at
				166	which time it will be set using the Dirty tracking mechanism described above.
				167
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	168	3. Reference
Jan Kiszka	38a778a	2011-02-09 15:11:28 +0100	[diff] [blame]	169	------------
				170
				171	Name: kvm_lock
Paolo Bonzini	2f303b7	2013-09-25 13:53:07 +0200	[diff] [blame]	172	Type: spinlock_t
Jan Kiszka	38a778a	2011-02-09 15:11:28 +0100	[diff] [blame]	173	Arch: any
				174	Protects: - vm_list
Paolo Bonzini	4a937f9	2013-09-10 12:58:35 +0200	[diff] [blame]	175
				176	Name: kvm_count_lock
				177	Type: raw_spinlock_t
				178	Arch: any
				179	Protects: - hardware virtualization enable/disable
Jan Kiszka	38a778a	2011-02-09 15:11:28 +0100	[diff] [blame]	180	Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
				181	migration.
				182
				183	Name: kvm_arch::tsc_write_lock
				184	Type: raw_spinlock
				185	Arch: x86
				186	Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
				187	- tsc offset in vmcb
				188	Comment: 'raw' because updating the tsc offsets must not be preempted.
Xiao Guangrong	58d8b17	2012-06-20 16:00:26 +0800	[diff] [blame]	189
				190	Name: kvm->mmu_lock
				191	Type: spinlock_t
				192	Arch: any
				193	Protects: -shadow page/shadow tlb entry
				194	Comment: it is a spinlock since it is used in mmu notifier.
Thomas Huth	519192a	2013-09-09 17:32:56 +0200	[diff] [blame]	195
				196	Name: kvm->srcu
				197	Type: srcu lock
				198	Arch: any
				199	Protects: - kvm->memslots
				200	- kvm->buses
				201	Comment: The srcu read lock must be held while accessing memslots (e.g.
				202	when using gfn_to_* functions) and while accessing in-kernel
				203	MMIO/PIO address->device structure mapping (kvm->buses).
				204	The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
				205	if it is needed by multiple functions.
Feng Wu	bf9f6ac	2015-09-18 22:29:55 +0800	[diff] [blame]	206
				207	Name: blocked_vcpu_on_cpu_lock
				208	Type: spinlock_t
				209	Arch: x86
				210	Protects: blocked_vcpu_on_cpu
				211	Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
				212	When VT-d posted-interrupts is supported and the VM has assigned
				213	devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
				214	protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
				215	wakeup notification event since external interrupts from the
				216	assigned devices happens, we will find the vCPU on the list to
				217	wakeup.