block: reimplement FLUSH/FUA to support merge
The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining. As such, sequencing is done
queue-wide one flush request after another. However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.
This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually. The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue. Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.
This allows arbitrary merging of different type of flushes. How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.
This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.
* As flush requests are never put on the IO scheduler, request fields
used for flush share space with rq->rb_node. rq->completion_data is
moved out of the union. This increases the request size by one
pointer.
As rq->elevator_private* are used only by the iosched too, it is
possible to reduce the request size further. However, to do that,
we need to modify request allocation path such that iosched data is
not allocated for flush requests.
* FLUSH/FUA processing happens on insertion now instead of dispatch.
- Comments updated as per Vivek and Mike.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36ab42c..6d7e9af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,13 +99,18 @@
/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
- * completion_data share space with the rb_node.
+ * flush fields share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
- void *completion_data;
+ struct {
+ unsigned int seq;
+ struct list_head list;
+ } flush;
};
+ void *completion_data;
+
/*
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it.
@@ -362,11 +367,12 @@
* for flush operations
*/
unsigned int flush_flags;
- unsigned int flush_seq;
- int flush_err;
+ unsigned int flush_pending_idx:1;
+ unsigned int flush_running_idx:1;
+ unsigned long flush_pending_since;
+ struct list_head flush_queue[2];
+ struct list_head flush_data_in_flight;
struct request flush_rq;
- struct request *orig_flush_rq;
- struct list_head pending_flushes;
struct mutex sysfs_lock;