Blame - Documentation/locking/ww-mutex-design.txt - kernel/msm-4.19

blob: f0ed7c30e695dc6f6a4ed7a1c2c40a44978a79ed [file] [log] [blame]

Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	1	Wound/Wait Deadlock-Proof Mutex Design
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	2	======================================
				3
				4	Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
				5
				6	Motivation for WW-Mutexes
				7	-------------------------
				8
				9	GPU's do operations that commonly involve many buffers. Those buffers
				10	can be shared across contexts/processes, exist in different memory
				11	domains (for example VRAM vs system memory), and so on. And with
				12	PRIME / dmabuf, they can even be shared across devices. So there are
				13	a handful of situations where the driver needs to wait for buffers to
				14	become ready. If you think about this in terms of waiting on a buffer
				15	mutex for it to become available, this presents a problem because
				16	there is no way to guarantee that buffers appear in a execbuf/batch in
				17	the same order in all contexts. That is directly under control of
				18	userspace, and a result of the sequence of GL calls that an application
				19	makes. Which results in the potential for deadlock. The problem gets
				20	more complex when you consider that the kernel may need to migrate the
				21	buffer(s) into VRAM before the GPU operates on the buffer(s), which
				22	may in turn require evicting some other buffers (and you don't want to
				23	evict other buffers which are already queued up to the GPU), but for a
				24	simplified understanding of the problem you can ignore this.
				25
				26	The algorithm that the TTM graphics subsystem came up with for dealing with
				27	this problem is quite simple. For each group of buffers (execbuf) that need
				28	to be locked, the caller would be assigned a unique reservation id/ticket,
				29	from a global counter. In case of deadlock while locking all the buffers
				30	associated with a execbuf, the one with the lowest reservation ticket (i.e.
				31	the oldest task) wins, and the one with the higher reservation id (i.e. the
				32	younger task) unlocks all of the buffers that it has already locked, and then
				33	tries again.
				34
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	35	In the RDBMS literature, a reservation ticket is associated with a transaction.
				36	and the deadlock handling approach is called Wait-Die. The name is based on
				37	the actions of a locking thread when it encounters an already locked mutex.
				38	If the transaction holding the lock is younger, the locking transaction waits.
				39	If the transaction holding the lock is older, the locking transaction backs off
				40	and dies. Hence Wait-Die.
				41	There is also another algorithm called Wound-Wait:
				42	If the transaction holding the lock is younger, the locking transaction
				43	wounds the transaction holding the lock, requesting it to die.
				44	If the transaction holding the lock is older, it waits for the other
				45	transaction. Hence Wound-Wait.
				46	The two algorithms are both fair in that a transaction will eventually succeed.
				47	However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
				48	compared to Wait-Die, but is, on the other hand, associated with more work than
				49	Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
				50	algorithm in that transactions are wounded by other transactions, and that
				51	requires a reliable way to pick up up the wounded condition and preempt the
				52	running transaction. Note that this is not the same as process preemption. A
				53	Wound-Wait transaction is considered preempted when it dies (returning
				54	-EDEADLK) following a wound.
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	55
				56	Concepts
				57	--------
				58
				59	Compared to normal mutexes two additional concepts/objects show up in the lock
				60	interface for w/w mutexes:
				61
				62	Acquire context: To ensure eventual forward progress it is important the a task
				63	trying to acquire locks doesn't grab a new reservation id, but keeps the one it
				64	acquired when starting the lock acquisition. This ticket is stored in the
				65	acquire context. Furthermore the acquire context keeps track of debugging state
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	66	to catch w/w mutex interface abuse. An acquire context is representing a
				67	transaction.
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	68
				69	W/w class: In contrast to normal mutexes the lock class needs to be explicit for
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	70	w/w mutexes, since it is required to initialize the acquire context. The lock
				71	class also specifies what algorithm to use, Wound-Wait or Wait-Die.
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	72
				73	Furthermore there are three different class of w/w lock acquire functions:
				74
				75	* Normal lock acquisition with a context, using ww_mutex_lock.
				76
Peter Ziljstra	55f036c	2018-06-15 10:07:12 +0200	[diff] [blame]	77	* Slowpath lock acquisition on the contending lock, used by the task that just
				78	killed its transaction after having dropped all already acquired locks.
				79	These functions have the _slow postfix.
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	80
				81	From a simple semantics point-of-view the _slow functions are not strictly
				82	required, since simply calling the normal ww_mutex_lock functions on the
				83	contending lock (after having dropped all other already acquired locks) will
				84	work correctly. After all if no other ww mutex has been acquired yet there's
				85	no deadlock potential and hence the ww_mutex_lock call will block and not
				86	prematurely return -EDEADLK. The advantage of the _slow functions is in
				87	interface safety:
				88	- ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
				89	has a void return type. Note that since ww mutex code needs loops/retries
				90	anyway the __must_check doesn't result in spurious warnings, even though the
				91	very first lock operation can never fail.
				92	- When full debugging is enabled ww_mutex_lock_slow checks that all acquired
				93	ww mutex have been released (preventing deadlocks) and makes sure that we
				94	block on the contending lock (preventing spinning through the -EDEADLK
				95	slowpath until the contended lock can be acquired).
				96
				97	* Functions to only acquire a single w/w mutex, which results in the exact same
				98	semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
				99	context.
				100
				101	Again this is not strictly required. But often you only want to acquire a
				102	single lock in which case it's pointless to set up an acquire context (and so
				103	better to avoid grabbing a deadlock avoidance ticket).
				104
				105	Of course, all the usual variants for handling wake-ups due to signals are also
				106	provided.
				107
				108	Usage
				109	-----
				110
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	111	The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
				112	DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
				113	As a rough rule of thumb, use Wound-Wait iff you
				114	expect the number of simultaneous competing transactions to be typically small,
				115	and you want to reduce the number of rollbacks.
				116
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	117	Three different ways to acquire locks within the same w/w class. Common
				118	definitions for methods #1 and #2:
				119
				120	static DEFINE_WW_CLASS(ww_class);
				121
				122	struct obj {
				123	struct ww_mutex lock;
				124	/* obj data */
				125	};
				126
				127	struct obj_entry {
				128	struct list_head head;
				129	struct obj *obj;
				130	};
				131
				132	Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
				133	This is useful if a list of required objects is already tracked somewhere.
				134	Furthermore the lock helper can use propagate the -EALREADY return code back to
				135	the caller as a signal that an object is twice on the list. This is useful if
				136	the list is constructed from userspace input and the ABI requires userspace to
				137	not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).
				138
				139	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
				140	{
				141	struct obj *res_obj = NULL;
				142	struct obj_entry *contended_entry = NULL;
				143	struct obj_entry *entry;
				144
				145	ww_acquire_init(ctx, &ww_class);
				146
				147	retry:
				148	list_for_each_entry (entry, list, head) {
				149	if (entry->obj == res_obj) {
				150	res_obj = NULL;
				151	continue;
				152	}
				153	ret = ww_mutex_lock(&entry->obj->lock, ctx);
				154	if (ret < 0) {
				155	contended_entry = entry;
				156	goto err;
				157	}
				158	}
				159
				160	ww_acquire_done(ctx);
				161	return 0;
				162
				163	err:
				164	list_for_each_entry_continue_reverse (entry, list, head)
				165	ww_mutex_unlock(&entry->obj->lock);
				166
				167	if (res_obj)
				168	ww_mutex_unlock(&res_obj->lock);
				169
				170	if (ret == -EDEADLK) {
				171	/* we lost out in a seqno race, lock and retry.. */
				172	ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
				173	res_obj = contended_entry->obj;
				174	goto retry;
				175	}
				176	ww_acquire_fini(ctx);
				177
				178	return ret;
				179	}
				180
				181	Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
				182	of duplicate entry detection using -EALREADY as method 1 above. But the
				183	list-reordering allows for a bit more idiomatic code.
				184
				185	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
				186	{
				187	struct obj_entry entry, entry2;
				188
				189	ww_acquire_init(ctx, &ww_class);
				190
				191	list_for_each_entry (entry, list, head) {
				192	ret = ww_mutex_lock(&entry->obj->lock, ctx);
				193	if (ret < 0) {
				194	entry2 = entry;
				195
				196	list_for_each_entry_continue_reverse (entry2, list, head)
				197	ww_mutex_unlock(&entry2->obj->lock);
				198
				199	if (ret != -EDEADLK) {
				200	ww_acquire_fini(ctx);
				201	return ret;
				202	}
				203
				204	/* we lost out in a seqno race, lock and retry.. */
				205	ww_mutex_lock_slow(&entry->obj->lock, ctx);
				206
				207	/*
				208	* Move buf to head of the list, this will point
				209	* buf->next to the first unlocked entry,
				210	* restarting the for loop.
				211	*/
				212	list_del(&entry->head);
				213	list_add(&entry->head, list);
				214	}
				215	}
				216
				217	ww_acquire_done(ctx);
				218	return 0;
				219	}
				220
				221	Unlocking works the same way for both methods #1 and #2:
				222
				223	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
				224	{
				225	struct obj_entry *entry;
				226
				227	list_for_each_entry (entry, list, head)
				228	ww_mutex_unlock(&entry->obj->lock);
				229
				230	ww_acquire_fini(ctx);
				231	}
				232
				233	Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
				234	e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
				235	and edges can only be changed when holding the locks of all involved nodes. w/w
				236	mutexes are a natural fit for such a case for two reasons:
				237	- They can handle lock-acquisition in any order which allows us to start walking
				238	a graph from a starting point and then iteratively discovering new edges and
				239	locking down the nodes those edges connect to.
				240	- Due to the -EALREADY return code signalling that a given objects is already
				241	held there's no need for additional book-keeping to break cycles in the graph
				242	or keep track off which looks are already held (when using more than one node
				243	as a starting point).
				244
				245	Note that this approach differs in two important ways from the above methods:
				246	- Since the list of objects is dynamically constructed (and might very well be
Peter Ziljstra	55f036c	2018-06-15 10:07:12 +0200	[diff] [blame]	247	different when retrying due to hitting the -EDEADLK die condition) there's
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	248	no need to keep any object on a persistent list when it's not locked. We can
				249	therefore move the list_head into the object itself.
				250	- On the other hand the dynamic object list construction also means that the -EALREADY return
				251	code can't be propagated.
				252
				253	Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
				254	list of starting nodes (passed in from userspace) using one of the above
				255	methods. And then lock any additional objects affected by the operations using
				256	method #3 below. The backoff/retry procedure will be a bit more involved, since
				257	when the dynamic locking step hits -EDEADLK we also need to unlock all the
				258	objects acquired with the fixed list. But the w/w mutex debug checks will catch
				259	any interface misuse for these cases.
				260
				261	Also, method 3 can't fail the lock acquisition step since it doesn't return
				262	-EALREADY. Of course this would be different when using the _interruptible
				263	variants, but that's outside of the scope of these examples here.
				264
				265	struct obj {
				266	struct ww_mutex ww_mutex;
				267	struct list_head locked_list;
				268	};
				269
				270	static DEFINE_WW_CLASS(ww_class);
				271
				272	void __unlock_objs(struct list_head *list)
				273	{
				274	struct obj entry, temp;
				275
				276	list_for_each_entry_safe (entry, temp, list, locked_list) {
				277	/* need to do that before unlocking, since only the current lock holder is
				278	allowed to use object */
				279	list_del(&entry->locked_list);
				280	ww_mutex_unlock(entry->ww_mutex)
				281	}
				282	}
				283
				284	void lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
				285	{
				286	struct obj *obj;
				287
				288	ww_acquire_init(ctx, &ww_class);
				289
				290	retry:
				291	/* re-init loop start state */
				292	loop {
				293	/* magic code which walks over a graph and decides which objects
				294	* to lock */
				295
				296	ret = ww_mutex_lock(obj->ww_mutex, ctx);
				297	if (ret == -EALREADY) {
				298	/* we have that one already, get to the next object */
				299	continue;
				300	}
				301	if (ret == -EDEADLK) {
				302	__unlock_objs(list);
				303
				304	ww_mutex_lock_slow(obj, ctx);
				305	list_add(&entry->locked_list, list);
				306	goto retry;
				307	}
				308
				309	/* locked a new object, add it to the list */
				310	list_add_tail(&entry->locked_list, list);
				311	}
				312
				313	ww_acquire_done(ctx);
				314	return 0;
				315	}
				316
				317	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
				318	{
				319	__unlock_objs(list);
				320	ww_acquire_fini(ctx);
				321	}
				322
				323	Method 4: Only lock one single objects. In that case deadlock detection and
				324	prevention is obviously overkill, since with grabbing just one lock you can't
				325	produce a deadlock within just one class. To simplify this case the w/w mutex
				326	api can be used with a NULL context.
				327
				328	Implementation Details
				329	----------------------
				330
				331	Design:
				332	ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
				333	normal mutex locks, which are far more common. As such there is only a small
				334	increase in code size if wait/wound mutexes are not used.
				335
Nicolai Hähnle	27bd57a	2016-12-21 19:46:40 +0100	[diff] [blame]	336	We maintain the following invariants for the wait list:
				337	(1) Waiters with an acquire context are sorted by stamp order; waiters
				338	without an acquire context are interspersed in FIFO order.
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	339	(2) For Wait-Die, among waiters with contexts, only the first one can have
				340	other locks acquired already (ctx->acquired > 0). Note that this waiter
				341	may come after other waiters without contexts in the list.
				342
				343	The Wound-Wait preemption is implemented with a lazy-preemption scheme:
				344	The wounded status of the transaction is checked only when there is
				345	contention for a new lock and hence a true chance of deadlock. In that
				346	situation, if the transaction is wounded, it backs off, clears the
				347	wounded status and retries. A great benefit of implementing preemption in
				348	this way is that the wounded transaction can identify a contending lock to
				349	wait for before restarting the transaction. Just blindly restarting the
				350	transaction would likely make the transaction end up in a situation where
				351	it would have to back off again.
Nicolai Hähnle	27bd57a	2016-12-21 19:46:40 +0100	[diff] [blame]	352
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	353	In general, not much contention is expected. The locks are typically used to
Thomas Hellstrom	08295b3	2018-06-15 10:17:38 +0200	[diff] [blame]	354	serialize access to resources for devices, and optimization focus should
				355	therefore be directed towards the uncontended cases.
Maarten Lankhorst	040a0a3	2013-06-24 10:30:04 +0200	[diff] [blame]	356
				357	Lockdep:
				358	Special care has been taken to warn for as many cases of api abuse
				359	as possible. Some common api abuses will be caught with
				360	CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
				361
				362	Some of the errors which will be warned about:
				363	- Forgetting to call ww_acquire_fini or ww_acquire_init.
				364	- Attempting to lock more mutexes after ww_acquire_done.
				365	- Attempting to lock the wrong mutex after -EDEADLK and
				366	unlocking all mutexes.
				367	- Attempting to lock the right mutex after -EDEADLK,
				368	before unlocking all mutexes.
				369
				370	- Calling ww_mutex_lock_slow before -EDEADLK was returned.
				371
				372	- Unlocking mutexes with the wrong unlock function.
				373	- Calling one of the ww_acquire_* twice on the same context.
				374	- Using a different ww_class for the mutex than for the ww_acquire_ctx.
				375	- Normal lockdep errors that can result in deadlocks.
				376
				377	Some of the lockdep errors that can result in deadlocks:
				378	- Calling ww_acquire_init to initialize a second ww_acquire_ctx before
				379	having called ww_acquire_fini on the first.
				380	- 'normal' deadlocks that can occur.
				381
				382	FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
				383	implemented.