target: use a workqueue for I/O completions

Instead of abusing the target processing thread for offloading I/O
completion in the backends to user context add a new workqueue.  This means
completions can be processed as fast as available CPU time allows it,
including in parallel with other completions and more importantly I/O
submission or QUEUE FULL retries.  This should give much better performance
especially on loaded systems.

As a fallout we can merge all the completed states into a single
one.

On the downside this change complicates lun reset handling a bit by
requiring us to cancel a work item only for those states that have it
initialized.  The alternative would be to either always initialize the work
item to a dummy handler, or always use the same handler and do a switch on
the state. The long term solution will be a flag that says that the command
has an initialized work item, but that's only going to be useful once we
have more users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c
index 532ce31..570b144 100644
--- a/drivers/target/target_core_tmr.c
+++ b/drivers/target/target_core_tmr.c
@@ -255,6 +255,16 @@
 			atomic_read(&cmd->t_transport_stop),
 			atomic_read(&cmd->t_transport_sent));
 
+		/*
+		 * If the command may be queued onto a workqueue cancel it now.
+		 *
+		 * This is equivalent to removal from the execute queue in the
+		 * loop above, but we do it down here given that
+		 * cancel_work_sync may block.
+		 */
+		if (cmd->t_state == TRANSPORT_COMPLETE)
+			cancel_work_sync(&cmd->work);
+
 		spin_lock_irqsave(&cmd->t_state_lock, flags);
 		target_stop_task(task, &flags);