Documentation/locking/ww-mutex-design.txt - kernel/msm-4.19 - Gitiles

 Wait/Wound Deadlock-Proof Mutex Design
 ======================================

 Please read mutex-design.txt first, as it applies to wait/wound mutexes too.

 Motivation for WW-Mutexes
 -------------------------

 GPU's do operations that commonly involve many buffers.  Those buffers
 can be shared across contexts/processes, exist in different memory
 domains (for example VRAM vs system memory), and so on.  And with
 PRIME / dmabuf, they can even be shared across devices.  So there are
 a handful of situations where the driver needs to wait for buffers to
 become ready.  If you think about this in terms of waiting on a buffer
 mutex for it to become available, this presents a problem because
 there is no way to guarantee that buffers appear in a execbuf/batch in
 the same order in all contexts.  That is directly under control of
 userspace, and a result of the sequence of GL calls that an application
 makes.	Which results in the potential for deadlock.  The problem gets
 more complex when you consider that the kernel may need to migrate the
 buffer(s) into VRAM before the GPU operates on the buffer(s), which
 may in turn require evicting some other buffers (and you don't want to
 evict other buffers which are already queued up to the GPU), but for a
 simplified understanding of the problem you can ignore this.

 The algorithm that the TTM graphics subsystem came up with for dealing with
 this problem is quite simple.  For each group of buffers (execbuf) that need
 to be locked, the caller would be assigned a unique reservation id/ticket,
 from a global counter.  In case of deadlock while locking all the buffers
 associated with a execbuf, the one with the lowest reservation ticket (i.e.
 the oldest task) wins, and the one with the higher reservation id (i.e. the
 younger task) unlocks all of the buffers that it has already locked, and then
 tries again.

 In the RDBMS literature this deadlock handling approach is called wait/wound:
 The older tasks waits until it can acquire the contended lock. The younger tasks
 needs to back off and drop all the locks it is currently holding, i.e. the
 younger task is wounded.

 Concepts
 --------

 Compared to normal mutexes two additional concepts/objects show up in the lock
 interface for w/w mutexes:

 Acquire context: To ensure eventual forward progress it is important the a task
 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
 acquired when starting the lock acquisition. This ticket is stored in the
 acquire context. Furthermore the acquire context keeps track of debugging state
 to catch w/w mutex interface abuse.

 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
 w/w mutexes, since it is required to initialize the acquire context.

 Furthermore there are three different class of w/w lock acquire functions:

 * Normal lock acquisition with a context, using ww_mutex_lock.

 * Slowpath lock acquisition on the contending lock, used by the wounded task
   after having dropped all already acquired locks. These functions have the
   _slow postfix.

   From a simple semantics point-of-view the _slow functions are not strictly
   required, since simply calling the normal ww_mutex_lock functions on the
   contending lock (after having dropped all other already acquired locks) will
   work correctly. After all if no other ww mutex has been acquired yet there's
   no deadlock potential and hence the ww_mutex_lock call will block and not
   prematurely return -EDEADLK. The advantage of the _slow functions is in
   interface safety:
   - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
     has a void return type. Note that since ww mutex code needs loops/retries
     anyway the __must_check doesn't result in spurious warnings, even though the
     very first lock operation can never fail.
   - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
     ww mutex have been released (preventing deadlocks) and makes sure that we
     block on the contending lock (preventing spinning through the -EDEADLK
     slowpath until the contended lock can be acquired).

 * Functions to only acquire a single w/w mutex, which results in the exact same
   semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
   context.

   Again this is not strictly required. But often you only want to acquire a
   single lock in which case it's pointless to set up an acquire context (and so
   better to avoid grabbing a deadlock avoidance ticket).

 Of course, all the usual variants for handling wake-ups due to signals are also
 provided.

 Usage
 -----

 Three different ways to acquire locks within the same w/w class. Common
 definitions for methods #1 and #2:

 static DEFINE_WW_CLASS(ww_class);

 struct obj {
 	struct ww_mutex lock;
 	/* obj data */
 };

 struct obj_entry {
 	struct list_head head;
 	struct obj *obj;
 };

 Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
 This is useful if a list of required objects is already tracked somewhere.
 Furthermore the lock helper can use propagate the -EALREADY return code back to
 the caller as a signal that an object is twice on the list. This is useful if
 the list is constructed from userspace input and the ABI requires userspace to
 not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).

 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 {
 	struct obj *res_obj = NULL;
 	struct obj_entry *contended_entry = NULL;
 	struct obj_entry *entry;

 	ww_acquire_init(ctx, &ww_class);

 retry:
 	list_for_each_entry (entry, list, head) {
 		if (entry->obj == res_obj) {
 			res_obj = NULL;
 			continue;
 		}
 		ret = ww_mutex_lock(&entry->obj->lock, ctx);
 		if (ret < 0) {
 			contended_entry = entry;
 			goto err;
 		}
 	}

 	ww_acquire_done(ctx);
 	return 0;

 err:
 	list_for_each_entry_continue_reverse (entry, list, head)
 		ww_mutex_unlock(&entry->obj->lock);

 	if (res_obj)
 		ww_mutex_unlock(&res_obj->lock);

 	if (ret == -EDEADLK) {
 		/* we lost out in a seqno race, lock and retry.. */
 		ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
 		res_obj = contended_entry->obj;
 		goto retry;
 	}
 	ww_acquire_fini(ctx);

 	return ret;
 }

 Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
 of duplicate entry detection using -EALREADY as method 1 above. But the
 list-reordering allows for a bit more idiomatic code.

 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 {
 	struct obj_entry *entry, *entry2;

 	ww_acquire_init(ctx, &ww_class);

 	list_for_each_entry (entry, list, head) {
 		ret = ww_mutex_lock(&entry->obj->lock, ctx);
 		if (ret < 0) {
 			entry2 = entry;

 			list_for_each_entry_continue_reverse (entry2, list, head)
 				ww_mutex_unlock(&entry2->obj->lock);

 			if (ret != -EDEADLK) {
 				ww_acquire_fini(ctx);
 				return ret;
 			}

 			/* we lost out in a seqno race, lock and retry.. */
 			ww_mutex_lock_slow(&entry->obj->lock, ctx);

 			/*
 			 * Move buf to head of the list, this will point
 			 * buf->next to the first unlocked entry,
 			 * restarting the for loop.
 			 */
 			list_del(&entry->head);
 			list_add(&entry->head, list);
 		}
 	}

 	ww_acquire_done(ctx);
 	return 0;
 }

 Unlocking works the same way for both methods #1 and #2:

 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 {
 	struct obj_entry *entry;

 	list_for_each_entry (entry, list, head)
 		ww_mutex_unlock(&entry->obj->lock);

 	ww_acquire_fini(ctx);
 }

 Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
 e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
 and edges can only be changed when holding the locks of all involved nodes. w/w
 mutexes are a natural fit for such a case for two reasons:
 - They can handle lock-acquisition in any order which allows us to start walking
   a graph from a starting point and then iteratively discovering new edges and
   locking down the nodes those edges connect to.
 - Due to the -EALREADY return code signalling that a given objects is already
   held there's no need for additional book-keeping to break cycles in the graph
   or keep track off which looks are already held (when using more than one node
   as a starting point).

 Note that this approach differs in two important ways from the above methods:
 - Since the list of objects is dynamically constructed (and might very well be
   different when retrying due to hitting the -EDEADLK wound condition) there's
   no need to keep any object on a persistent list when it's not locked. We can
   therefore move the list_head into the object itself.
 - On the other hand the dynamic object list construction also means that the -EALREADY return
   code can't be propagated.

 Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
 list of starting nodes (passed in from userspace) using one of the above
 methods. And then lock any additional objects affected by the operations using
 method #3 below. The backoff/retry procedure will be a bit more involved, since
 when the dynamic locking step hits -EDEADLK we also need to unlock all the
 objects acquired with the fixed list. But the w/w mutex debug checks will catch
 any interface misuse for these cases.

 Also, method 3 can't fail the lock acquisition step since it doesn't return
 -EALREADY. Of course this would be different when using the _interruptible
 variants, but that's outside of the scope of these examples here.

 struct obj {
 	struct ww_mutex ww_mutex;
 	struct list_head locked_list;
 };

 static DEFINE_WW_CLASS(ww_class);

 void __unlock_objs(struct list_head *list)
 {
 	struct obj *entry, *temp;

 	list_for_each_entry_safe (entry, temp, list, locked_list) {
 		/* need to do that before unlocking, since only the current lock holder is
 		allowed to use object */
 		list_del(&entry->locked_list);
 		ww_mutex_unlock(entry->ww_mutex)
 	}
 }

 void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 {
 	struct obj *obj;

 	ww_acquire_init(ctx, &ww_class);

 retry:
 	/* re-init loop start state */
 	loop {
 		/* magic code which walks over a graph and decides which objects
 		 * to lock */

 		ret = ww_mutex_lock(obj->ww_mutex, ctx);
 		if (ret == -EALREADY) {
 			/* we have that one already, get to the next object */
 			continue;
 		}
 		if (ret == -EDEADLK) {
 			__unlock_objs(list);

 			ww_mutex_lock_slow(obj, ctx);
 			list_add(&entry->locked_list, list);
 			goto retry;
 		}

 		/* locked a new object, add it to the list */
 		list_add_tail(&entry->locked_list, list);
 	}

 	ww_acquire_done(ctx);
 	return 0;
 }

 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 {
 	__unlock_objs(list);
 	ww_acquire_fini(ctx);
 }

 Method 4: Only lock one single objects. In that case deadlock detection and
 prevention is obviously overkill, since with grabbing just one lock you can't
 produce a deadlock within just one class. To simplify this case the w/w mutex
 api can be used with a NULL context.

 Implementation Details
 ----------------------

 Design:
   ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
   normal mutex locks, which are far more common. As such there is only a small
   increase in code size if wait/wound mutexes are not used.

   We maintain the following invariants for the wait list:
   (1) Waiters with an acquire context are sorted by stamp order; waiters
       without an acquire context are interspersed in FIFO order.
   (2) Among waiters with contexts, only the first one can have other locks
       acquired already (ctx->acquired > 0). Note that this waiter may come
       after other waiters without contexts in the list.

   In general, not much contention is expected. The locks are typically used to
   serialize access to resources for devices.

 Lockdep:
   Special care has been taken to warn for as many cases of api abuse
   as possible. Some common api abuses will be caught with
   CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.

   Some of the errors which will be warned about:
    - Forgetting to call ww_acquire_fini or ww_acquire_init.
    - Attempting to lock more mutexes after ww_acquire_done.
    - Attempting to lock the wrong mutex after -EDEADLK and
      unlocking all mutexes.
    - Attempting to lock the right mutex after -EDEADLK,
      before unlocking all mutexes.

    - Calling ww_mutex_lock_slow before -EDEADLK was returned.

    - Unlocking mutexes with the wrong unlock function.
    - Calling one of the ww_acquire_* twice on the same context.
    - Using a different ww_class for the mutex than for the ww_acquire_ctx.
    - Normal lockdep errors that can result in deadlocks.

   Some of the lockdep errors that can result in deadlocks:
    - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
      having called ww_acquire_fini on the first.
    - 'normal' deadlocks that can occur.

 FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
 implemented.
	Wait/Wound Deadlock-Proof Mutex Design
	======================================

	Please read mutex-design.txt first, as it applies to wait/wound mutexes too.

	Motivation for WW-Mutexes
	-------------------------

	GPU's do operations that commonly involve many buffers. Those buffers
	can be shared across contexts/processes, exist in different memory
	domains (for example VRAM vs system memory), and so on. And with
	PRIME / dmabuf, they can even be shared across devices. So there are
	a handful of situations where the driver needs to wait for buffers to
	become ready. If you think about this in terms of waiting on a buffer
	mutex for it to become available, this presents a problem because
	there is no way to guarantee that buffers appear in a execbuf/batch in
	the same order in all contexts. That is directly under control of
	userspace, and a result of the sequence of GL calls that an application
	makes. Which results in the potential for deadlock. The problem gets
	more complex when you consider that the kernel may need to migrate the
	buffer(s) into VRAM before the GPU operates on the buffer(s), which
	may in turn require evicting some other buffers (and you don't want to
	evict other buffers which are already queued up to the GPU), but for a
	simplified understanding of the problem you can ignore this.

	The algorithm that the TTM graphics subsystem came up with for dealing with
	this problem is quite simple. For each group of buffers (execbuf) that need
	to be locked, the caller would be assigned a unique reservation id/ticket,
	from a global counter. In case of deadlock while locking all the buffers
	associated with a execbuf, the one with the lowest reservation ticket (i.e.
	the oldest task) wins, and the one with the higher reservation id (i.e. the
	younger task) unlocks all of the buffers that it has already locked, and then
	tries again.

	In the RDBMS literature this deadlock handling approach is called wait/wound:
	The older tasks waits until it can acquire the contended lock. The younger tasks
	needs to back off and drop all the locks it is currently holding, i.e. the
	younger task is wounded.

	Concepts
	--------

	Compared to normal mutexes two additional concepts/objects show up in the lock
	interface for w/w mutexes:

	Acquire context: To ensure eventual forward progress it is important the a task
	trying to acquire locks doesn't grab a new reservation id, but keeps the one it
	acquired when starting the lock acquisition. This ticket is stored in the
	acquire context. Furthermore the acquire context keeps track of debugging state
	to catch w/w mutex interface abuse.

	W/w class: In contrast to normal mutexes the lock class needs to be explicit for
	w/w mutexes, since it is required to initialize the acquire context.

	Furthermore there are three different class of w/w lock acquire functions:

	* Normal lock acquisition with a context, using ww_mutex_lock.

	* Slowpath lock acquisition on the contending lock, used by the wounded task
	after having dropped all already acquired locks. These functions have the
	_slow postfix.

	From a simple semantics point-of-view the _slow functions are not strictly
	required, since simply calling the normal ww_mutex_lock functions on the
	contending lock (after having dropped all other already acquired locks) will
	work correctly. After all if no other ww mutex has been acquired yet there's
	no deadlock potential and hence the ww_mutex_lock call will block and not
	prematurely return -EDEADLK. The advantage of the _slow functions is in
	interface safety:
	- ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
	has a void return type. Note that since ww mutex code needs loops/retries
	anyway the __must_check doesn't result in spurious warnings, even though the
	very first lock operation can never fail.
	- When full debugging is enabled ww_mutex_lock_slow checks that all acquired
	ww mutex have been released (preventing deadlocks) and makes sure that we
	block on the contending lock (preventing spinning through the -EDEADLK
	slowpath until the contended lock can be acquired).

	* Functions to only acquire a single w/w mutex, which results in the exact same
	semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
	context.

	Again this is not strictly required. But often you only want to acquire a
	single lock in which case it's pointless to set up an acquire context (and so
	better to avoid grabbing a deadlock avoidance ticket).

	Of course, all the usual variants for handling wake-ups due to signals are also
	provided.

	Usage
	-----

	Three different ways to acquire locks within the same w/w class. Common
	definitions for methods #1 and #2:

	static DEFINE_WW_CLASS(ww_class);

	struct obj {
	struct ww_mutex lock;
	/* obj data */
	};

	struct obj_entry {
	struct list_head head;
	struct obj *obj;
	};

	Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
	This is useful if a list of required objects is already tracked somewhere.
	Furthermore the lock helper can use propagate the -EALREADY return code back to
	the caller as a signal that an object is twice on the list. This is useful if
	the list is constructed from userspace input and the ABI requires userspace to
	not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).

	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
	{
	struct obj *res_obj = NULL;
	struct obj_entry *contended_entry = NULL;
	struct obj_entry *entry;

	ww_acquire_init(ctx, &ww_class);

	retry:
	list_for_each_entry (entry, list, head) {
	if (entry->obj == res_obj) {
	res_obj = NULL;
	continue;
	}
	ret = ww_mutex_lock(&entry->obj->lock, ctx);
	if (ret < 0) {
	contended_entry = entry;
	goto err;
	}
	}

	ww_acquire_done(ctx);
	return 0;

	err:
	list_for_each_entry_continue_reverse (entry, list, head)
	ww_mutex_unlock(&entry->obj->lock);

	if (res_obj)
	ww_mutex_unlock(&res_obj->lock);

	if (ret == -EDEADLK) {
	/* we lost out in a seqno race, lock and retry.. */
	ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
	res_obj = contended_entry->obj;
	goto retry;
	}
	ww_acquire_fini(ctx);

	return ret;
	}

	Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
	of duplicate entry detection using -EALREADY as method 1 above. But the
	list-reordering allows for a bit more idiomatic code.

	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
	{
	struct obj_entry entry, entry2;

	ww_acquire_init(ctx, &ww_class);

	list_for_each_entry (entry, list, head) {
	ret = ww_mutex_lock(&entry->obj->lock, ctx);
	if (ret < 0) {
	entry2 = entry;

	list_for_each_entry_continue_reverse (entry2, list, head)
	ww_mutex_unlock(&entry2->obj->lock);

	if (ret != -EDEADLK) {
	ww_acquire_fini(ctx);
	return ret;
	}

	/* we lost out in a seqno race, lock and retry.. */
	ww_mutex_lock_slow(&entry->obj->lock, ctx);

	/*
	* Move buf to head of the list, this will point
	* buf->next to the first unlocked entry,
	* restarting the for loop.
	*/
	list_del(&entry->head);
	list_add(&entry->head, list);
	}
	}

	ww_acquire_done(ctx);
	return 0;
	}

	Unlocking works the same way for both methods #1 and #2:

	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
	{
	struct obj_entry *entry;

	list_for_each_entry (entry, list, head)
	ww_mutex_unlock(&entry->obj->lock);

	ww_acquire_fini(ctx);
	}

	Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
	e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
	and edges can only be changed when holding the locks of all involved nodes. w/w
	mutexes are a natural fit for such a case for two reasons:
	- They can handle lock-acquisition in any order which allows us to start walking
	a graph from a starting point and then iteratively discovering new edges and
	locking down the nodes those edges connect to.
	- Due to the -EALREADY return code signalling that a given objects is already
	held there's no need for additional book-keeping to break cycles in the graph
	or keep track off which looks are already held (when using more than one node
	as a starting point).

	Note that this approach differs in two important ways from the above methods:
	- Since the list of objects is dynamically constructed (and might very well be
	different when retrying due to hitting the -EDEADLK wound condition) there's
	no need to keep any object on a persistent list when it's not locked. We can
	therefore move the list_head into the object itself.
	- On the other hand the dynamic object list construction also means that the -EALREADY return
	code can't be propagated.

	Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
	list of starting nodes (passed in from userspace) using one of the above
	methods. And then lock any additional objects affected by the operations using
	method #3 below. The backoff/retry procedure will be a bit more involved, since
	when the dynamic locking step hits -EDEADLK we also need to unlock all the
	objects acquired with the fixed list. But the w/w mutex debug checks will catch
	any interface misuse for these cases.

	Also, method 3 can't fail the lock acquisition step since it doesn't return
	-EALREADY. Of course this would be different when using the _interruptible
	variants, but that's outside of the scope of these examples here.

	struct obj {
	struct ww_mutex ww_mutex;
	struct list_head locked_list;
	};

	static DEFINE_WW_CLASS(ww_class);

	void __unlock_objs(struct list_head *list)
	{
	struct obj entry, temp;

	list_for_each_entry_safe (entry, temp, list, locked_list) {
	/* need to do that before unlocking, since only the current lock holder is
	allowed to use object */
	list_del(&entry->locked_list);
	ww_mutex_unlock(entry->ww_mutex)
	}
	}

	void lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
	{
	struct obj *obj;

	ww_acquire_init(ctx, &ww_class);

	retry:
	/* re-init loop start state */
	loop {
	/* magic code which walks over a graph and decides which objects
	* to lock */

	ret = ww_mutex_lock(obj->ww_mutex, ctx);
	if (ret == -EALREADY) {
	/* we have that one already, get to the next object */
	continue;
	}
	if (ret == -EDEADLK) {
	__unlock_objs(list);

	ww_mutex_lock_slow(obj, ctx);
	list_add(&entry->locked_list, list);
	goto retry;
	}

	/* locked a new object, add it to the list */
	list_add_tail(&entry->locked_list, list);
	}

	ww_acquire_done(ctx);
	return 0;
	}

	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
	{
	__unlock_objs(list);
	ww_acquire_fini(ctx);
	}

	Method 4: Only lock one single objects. In that case deadlock detection and
	prevention is obviously overkill, since with grabbing just one lock you can't
	produce a deadlock within just one class. To simplify this case the w/w mutex
	api can be used with a NULL context.

	Implementation Details
	----------------------

	Design:
	ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
	normal mutex locks, which are far more common. As such there is only a small
	increase in code size if wait/wound mutexes are not used.

	We maintain the following invariants for the wait list:
	(1) Waiters with an acquire context are sorted by stamp order; waiters
	without an acquire context are interspersed in FIFO order.
	(2) Among waiters with contexts, only the first one can have other locks
	acquired already (ctx->acquired > 0). Note that this waiter may come
	after other waiters without contexts in the list.

	In general, not much contention is expected. The locks are typically used to
	serialize access to resources for devices.

	Lockdep:
	Special care has been taken to warn for as many cases of api abuse
	as possible. Some common api abuses will be caught with
	CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.

	Some of the errors which will be warned about:
	- Forgetting to call ww_acquire_fini or ww_acquire_init.
	- Attempting to lock more mutexes after ww_acquire_done.
	- Attempting to lock the wrong mutex after -EDEADLK and
	unlocking all mutexes.
	- Attempting to lock the right mutex after -EDEADLK,
	before unlocking all mutexes.

	- Calling ww_mutex_lock_slow before -EDEADLK was returned.

	- Unlocking mutexes with the wrong unlock function.
	- Calling one of the ww_acquire_* twice on the same context.
	- Using a different ww_class for the mutex than for the ww_acquire_ctx.
	- Normal lockdep errors that can result in deadlocks.

	Some of the lockdep errors that can result in deadlocks:
	- Calling ww_acquire_init to initialize a second ww_acquire_ctx before
	having called ww_acquire_fini on the first.
	- 'normal' deadlocks that can occur.

	FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
	implemented.