drm/i915: Modify error handler for per engine hang recovery This is a preparatory patch which modifies error handler to do per engine hang recovery. The actual patch which implements this sequence follows later in the series. The aim is to prepare existing recovery function to adapt to this new function where applicable (which fails at this point because core implementation is lacking) and continue recovery using legacy full gpu reset. A helper function is also added to query the availability of engine reset. A subsequent patch will add the capability to query which type of reset is present (engine -> full -> no-reset) via the get-param ioctl. It has been decided that the error events that are used to notify user of reset will only be sent in case if full chip reset. In case of just single (or multiple) engine resets, userspace won't be notified by these events. Note that this implementation of engine reset is for i915 directly submitting to the ELSP, where the driver manages the hang detection, recovery and resubmission. With GuC submission these tasks are shared between driver and firmware; i915 will still responsible for detecting a hang, and when it does it will have to request GuC to reset that Engine and remind the firmware about the outstanding submissions. This will be added in different patch. v2: rebase, advertise engine reset availability in platform definition, add note about GuC submission. v3: s/*engine_reset*/*reset_engine*/. (Chris) Handle reset as 2 level resets, by first going to engine only and fall backing to full/chip reset as needed, i.e. reset_engine will need the struct_mutex. v4: Pass the engine mask to i915_reset. (Chris) v5: Rebase, update selftests. v6: Rebase, prepare for mutex-less reset engine. v7: Pass reset_engine mask as a function parameter, and iterate over the engine mask for reset_engine. (Chris) v8: Use i915.reset >=2 in has_reset_engine; remove redundant reset logging; add a reset-engine-in-progress flag to prevent concurrent resets, and avoid dual purposing of reset-backoff. (Chris) v9: Support reset of different engines in parallel (Chris) v10: Handle reset-engine flag locking better (Chris) v11: Squash in reporting of per-engine-reset availability. Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Signed-off-by: Ian Lister <ian.lister@intel.com> Signed-off-by: Tomas Elf <tomas.elf@intel.com> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> Signed-off-by: Michel Thierry <michel.thierry@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170615201828.23144-4-michel.thierry@intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/20170620095751.13127-5-chris@chris-wilson.co.uk

commit: 142bc7d99bcfd17a9bc66b46eb1b5d1b93364549 [log] [tgz]
author: Michel Thierry <michel.thierry@intel.com> Tue Jun 20 10:57:46 2017 +0100
committer: Chris Wilson <chris@chris-wilson.co.uk> Tue Jun 20 21:00:11 2017 +0100
tree: 131dbc939cab57f583446a883e7e3c57fddadae2
parent: ed35dd7b259f10bff34e64847995f291ae8b490c [diff] [blame]
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8e9f437..f25e73f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c

@@ -2715,6 +2715,8 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
 		       const char *fmt, ...)
 {
+	struct intel_engine_cs *engine;
+	unsigned int tmp;
 	va_list args;
 	char error_msg[80];
 
@@ -2734,9 +2736,31 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	i915_capture_error_state(dev_priv, engine_mask, error_msg);
 	i915_clear_error_registers(dev_priv);
 
+	/*
+	 * Try engine reset when available. We fall back to full reset if
+	 * single reset fails.
+	 */
+	if (intel_has_reset_engine(dev_priv)) {
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			BUILD_BUG_ON(I915_RESET_HANDOFF >= I915_RESET_ENGINE);
+			if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
+					     &dev_priv->gpu_error.flags))
+				continue;
+
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~intel_engine_flag(engine);
+
+			clear_bit(I915_RESET_ENGINE + engine->id,
+				  &dev_priv->gpu_error.flags);
+			wake_up_bit(&dev_priv->gpu_error.flags,
+				    I915_RESET_ENGINE + engine->id);
+		}
+	}
+
 	if (!engine_mask)
 		goto out;
 
+	/* Full reset needs the mutex, stop any other user trying to do so. */
 	if (test_and_set_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags)) {
 		wait_event(dev_priv->gpu_error.reset_queue,
 			   !test_bit(I915_RESET_BACKOFF,
@@ -2744,8 +2768,22 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 		goto out;
 	}
 
+	/* Prevent any other reset-engine attempt. */
+	for_each_engine(engine, dev_priv, tmp) {
+		while (test_and_set_bit(I915_RESET_ENGINE + engine->id,
+					&dev_priv->gpu_error.flags))
+			wait_on_bit(&dev_priv->gpu_error.flags,
+				    I915_RESET_ENGINE + engine->id,
+				    TASK_UNINTERRUPTIBLE);
+	}
+
 	i915_reset_device(dev_priv);
 
+	for_each_engine(engine, dev_priv, tmp) {
+		clear_bit(I915_RESET_ENGINE + engine->id,
+			  &dev_priv->gpu_error.flags);
+	}
+
 	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
 	wake_up_all(&dev_priv->gpu_error.reset_queue);
commit	142bc7d99bcfd17a9bc66b46eb1b5d1b93364549	[log] [tgz]
author	Michel Thierry <michel.thierry@intel.com>	Tue Jun 20 10:57:46 2017 +0100
committer	Chris Wilson <chris@chris-wilson.co.uk>	Tue Jun 20 21:00:11 2017 +0100
tree	131dbc939cab57f583446a883e7e3c57fddadae2
parent	ed35dd7b259f10bff34e64847995f291ae8b490c [diff] [blame]