blob: 4ccbd860b618f856cd4cd01d2365c2017b8ff96b [file] [log] [blame]
Shuveb Hussain95167d02020-09-14 10:57:50 +05301.\" Copyright (C) 2020 Shuveb Hussain <shuveb@gmail.com>
2.\" SPDX-License-Identifier: LGPL-2.0-or-later
3.\"
4
5.TH IO_URING 7 2020-07-26 "Linux" "Linux Programmer's Manual"
6.SH NAME
7io_uring \- Asynchronous I/O facility
8.SH SYNOPSIS
9.nf
10.B "#include <linux/io_uring.h>"
11.fi
12.PP
13.SH DESCRIPTION
14.PP
15.B io_uring
16is a Linux-specific API for asynchronous I/O.
17It allows the user to submit one or more I/O requests,
18which are processed asynchronously without blocking the calling process.
19.B io_uring
20gets its name from ring buffers which are shared between user space and
21kernel space. This arrangement allows for efficient I/O,
22while avoiding the overhead of copying buffers between them,
23where possible.
24This interface makes
25.B io_uring
26different from other UNIX I/O APIs,
27wherein,
28rather than just communicate between kernel and user space with system calls,
29ring buffers are used as the main mode of communication.
30This arrangement has various performance benefits which are discussed in a
31separate section below.
32This man page uses the terms shared buffers, shared ring buffers and
33queues interchangeably.
34.PP
35The general programming model you need to follow for
36.B io_uring
37is outlined below
38.IP \(bu
39Set up shared buffers with
40.BR io_uring_setup (2)
41and
42.BR mmap (2),
43mapping into user space shared buffers for the submission queue (SQ) and the
44completion queue (CQ).
45You place I/O requests you want to make on the SQ,
46while the kernel places the results of those operations on the CQ.
47.IP \(bu
48For every I/O request you need to make (like to read a file, write a file,
49accept a socket connection, etc), you create a submission queue entry,
50or SQE,
51describe the I/O operation you need to get done and add it to the tail of
52the submission queue (SQ).
53Each I/O operation is,
54in essence,
55the equivalent of a system call you would have made otherwise,
56if you were not using
57.BR io_uring .
58You can add more than one SQE to the queue depending on the number of
59operations you want to request.
60.IP \(bu
61After you add one or more SQEs,
62you need to call
63.BR io_uring_enter (2)
64to tell the kernel to dequeue your I/O requests off the SQ and begin
65processing them.
66.IP \(bu
67For each SQE you submit,
68once it is done processing the request,
69the kernel places a completion queue event or CQE at the tail of the
70completion queue or CQ.
71The kernel places exactly one matching CQE in the CQ for every SQE you
72submit on the SQ.
73After you retrieve a CQE,
74minimally,
75you might be interested in checking the
76.I res
77field of the CQE structure,
78which corresponds to the return value of the system
79call's equivalent,
80had you used it directly without using
81.BR io_uring .
82For instance,
83a read operation under
84.BR io_uring ,
85started with the
86.BR IORING_OP_READ
87operation,
88which issues the equivalent of the
89.BR read (2)
90system call,
91would return as part of
92.I res
93what
94.BR read (2)
95would have returned if called directly,
96without using
97.BR io_uring .
98.IP \(bu
99Optionally,
100.BR io_uring_enter (2)
101can also wait for a specified number of requests to be processed by the kernel
102before it returns.
103If you specified a certain number of completions to wait for,
104the kernel would have placed at least those many number of CQEs on the CQ,
105which you can then readily read,
106right after the return from
107.BR io_uring_enter (2).
108.IP \(bu
109It is important to remember that I/O requests submitted to the kernel can
110complete in any order. It is not necessary for the kernel to process one
111request after another,
Jens Axboec630d9a2020-09-14 19:25:35 -0600112in the order you placed them. Given that the interface is a ring, the requests
113are attempted in order, however that doesn't imply any sort of ordering on the
114completion of them. When more than one request is in flight, it is not possible
115to determine which one will complete first. When you dequeue CQEs off the CQ,
116you should always check which submitted request it corresponds to. The most
117common method for doing so is utilizing the
118.I user_data
119field in the request, which is passed back on the completion side.
Shuveb Hussain95167d02020-09-14 10:57:50 +0530120.PP
121Adding to and reading from the queues:
122.IP \(bu
123You add SQEs to the tail of the SQ.
124The kernel reads SQEs off the head of the queue.
125.IP \(bu
126The kernel adds CQEs to the tail of the CQ.
127You read CQEs off the head of the queue.
128.SS Submission queue polling
129One of the goals of
130.B io_uring
131is to provide a means for efficient I/O.
132To this end,
133.B io_uring
134supports a polling mode that lets you avoid the call to
135.BR io_uring_enter (2),
136which you use to inform the kernel that you have queued SQEs on to the SQ.
137With SQ Polling,
138.B io_uring
139starts a kernel thread that polls the submission queue for any I/O
140requests you submit by adding SQEs.
141With SQ Polling enabled,
142there is no need for you to call
143.BR io_uring_enter (2),
144letting you avoid the overhead of system calls.
145A designated kernel thread dequeues SQEs off the SQ as you add them and
146dispatches them for asynchronous processing.
147.SS Setting up io_uring
148.PP
149The following example function sets up
150.B io_uring
151with a QUEUE_DEPTH deep submission queue.
152Pay attention to the 2
153.BR mmap (2)
154calls that set up the shared submission and completion queues.
155If your kernel is older than version 5.4,
156three
157.BR mmap(2)
158calls are required.
159.PP
Shuveb Hussain95167d02020-09-14 10:57:50 +0530160.EX
Jens Axboe74665872020-09-14 19:21:31 -0600161int app_setup_uring(void) {
Shuveb Hussain95167d02020-09-14 10:57:50 +0530162 struct io_uring_params p;
163 void *sq_ptr, *cq_ptr;
164
165 /* See io_uring_setup(2) for io_uring_params.flags you can set */
166 memset(&p, 0, sizeof(p));
167 ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
168 if (ring_fd < 0) {
169 perror("io_uring_setup");
170 return 1;
171 }
172
173 /*
Jens Axboe74665872020-09-14 19:21:31 -0600174 * io_uring communication happens via 2 shared kernel-user space ring
175 * buffers, which can be jointly mapped with a single mmap() call in
176 * kernels >= 5.4.
177 */
Shuveb Hussain95167d02020-09-14 10:57:50 +0530178
179 int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
180 int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
181
182 /* Rather than check for kernel version, the recommended way is to
Jens Axboe74665872020-09-14 19:21:31 -0600183 * check the features field of the io_uring_params structure, which is a
184 * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
185 * second mmap() call to map in the completion ring separately.
186 */
Shuveb Hussain95167d02020-09-14 10:57:50 +0530187 if (p.features & IORING_FEAT_SINGLE_MMAP) {
Jens Axboe74665872020-09-14 19:21:31 -0600188 if (cring_sz > sring_sz)
Shuveb Hussain95167d02020-09-14 10:57:50 +0530189 sring_sz = cring_sz;
Shuveb Hussain95167d02020-09-14 10:57:50 +0530190 cring_sz = sring_sz;
191 }
192
193 /* Map in the submission and completion queue ring buffers.
Jens Axboe74665872020-09-14 19:21:31 -0600194 * Kernels < 5.4 only map in the submission queue, though.
195 */
Shuveb Hussain95167d02020-09-14 10:57:50 +0530196 sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE,
197 MAP_SHARED | MAP_POPULATE,
198 ring_fd, IORING_OFF_SQ_RING);
199 if (sq_ptr == MAP_FAILED) {
200 perror("mmap");
201 return 1;
202 }
203
204 if (p.features & IORING_FEAT_SINGLE_MMAP) {
205 cq_ptr = sq_ptr;
206 } else {
Jens Axboe74665872020-09-14 19:21:31 -0600207 /* Map in the completion queue ring buffer in older kernels separately */
Shuveb Hussain95167d02020-09-14 10:57:50 +0530208 cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE,
209 MAP_SHARED | MAP_POPULATE,
210 ring_fd, IORING_OFF_CQ_RING);
211 if (cq_ptr == MAP_FAILED) {
212 perror("mmap");
213 return 1;
214 }
215 }
216 /* Save useful fields for later easy reference */
217 sring_tail = sq_ptr + p.sq_off.tail;
218 sring_mask = sq_ptr + p.sq_off.ring_mask;
219 sring_array = sq_ptr + p.sq_off.array;
220
221 /* Map in the submission queue entries array */
222 sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
223 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
224 ring_fd, IORING_OFF_SQES);
225 if (sqes == MAP_FAILED) {
226 perror("mmap");
227 return 1;
228 }
229
230 /* Save useful fields for later easy reference */
231 cring_head = cq_ptr + p.cq_off.head;
232 cring_tail = cq_ptr + p.cq_off.tail;
233 cring_mask = cq_ptr + p.cq_off.ring_mask;
234 cqes = cq_ptr + p.cq_off.cqes;
235
236 return 0;
237}
238.EE
239.in
240
241.SS Submitting I/O requests
242The process of submitting a request consists of describing the I/O
243operation you need to get done using an
244.B io_uring_sqe
245structure instance.
246These details describe the equivalent system call and its parameters.
247Because the range of I/O operations Linux supports are very varied and the
248.B io_uring_sqe
249structure needs to be able to describe them,
250it has several fields,
251some packed into unions for space efficiency.
252Here is a simplified version of struct
253.B io_uring_sqe
254with some of the most often used fields:
255.PP
256.in +4n
257.EX
258struct io_uring_sqe {
259 __u8 opcode; /* type of operation for this sqe */
260 __s32 fd; /* file descriptor to do IO on */
261 __u64 off; /* offset into file */
262 __u64 addr; /* pointer to buffer or iovecs */
263 __u32 len; /* buffer size or number of iovecs */
264 __u64 user_data; /* data to be passed back at completion time */
265 __u8 flags; /* IOSQE_ flags */
266 ...
267};
268.EE
269.in
270
271Here is struct
272.B io_uring_sqe
273in full:
274
275.in +4n
276.EX
277struct io_uring_sqe {
278 __u8 opcode; /* type of operation for this sqe */
279 __u8 flags; /* IOSQE_ flags */
280 __u16 ioprio; /* ioprio for the request */
281 __s32 fd; /* file descriptor to do IO on */
282 union {
283 __u64 off; /* offset into file */
284 __u64 addr2;
285 };
286 union {
287 __u64 addr; /* pointer to buffer or iovecs */
288 __u64 splice_off_in;
289 };
290 __u32 len; /* buffer size or number of iovecs */
291 union {
292 __kernel_rwf_t rw_flags;
293 __u32 fsync_flags;
294 __u16 poll_events; /* compatibility */
295 __u32 poll32_events; /* word-reversed for BE */
296 __u32 sync_range_flags;
297 __u32 msg_flags;
298 __u32 timeout_flags;
299 __u32 accept_flags;
300 __u32 cancel_flags;
301 __u32 open_flags;
302 __u32 statx_flags;
303 __u32 fadvise_advice;
304 __u32 splice_flags;
305 };
306 __u64 user_data; /* data to be passed back at completion time */
307 union {
308 struct {
309 /* pack this to avoid bogus arm OABI complaints */
310 union {
311 /* index into fixed buffers, if used */
312 __u16 buf_index;
313 /* for grouped buffer selection */
314 __u16 buf_group;
315 } __attribute__((packed));
316 /* personality to use, if used */
317 __u16 personality;
318 __s32 splice_fd_in;
319 };
320 __u64 __pad2[3];
321 };
322};
323.EE
324.in
325.PP
326To submit an I/O request to
327.BR io_uring ,
328you need to acquire a submission queue entry (SQE) from the submission
329queue (SQ),
330fill it up with details of the operation you want to submit and call
331.BR io_uring_enter (2).
332If you want to avoid calling
333.BR io_uring_enter (2),
334you have the option of setting up Submission Queue Polling.
335.PP
336SQEs are added to the tail of the submission queue.
337The kernel picks up SQEs off the head of the SQ.
338The general algorithm to get the next available SQE and update the tail is
339as follows.
340.PP
341.in +4n
342.EX
343struct io_uring_sqe *sqe;
344unsigned tail, index;
345tail = *sqring->tail;
346index = tail & (*sqring->ring_mask);
347sqe = &sqring->sqes[index];
348/* fill up details about this I/O request */
349describe_io(sqe);
350/* fill the sqe index into the SQ ring array */
351sqring->array[index] = index;
352tail++;
Jens Axboe74665872020-09-14 19:21:31 -0600353atomic_store_release(sqring->tail, tail);
Shuveb Hussain95167d02020-09-14 10:57:50 +0530354.EE
355.in
356.PP
357To get the index of an entry,
358the application must mask the current tail index with the size mask of the
359ring.
360This holds true for both SQs and CQs.
361Once the SQE is acquired,
362the necessary fields are filled in,
363describing the request.
364While the CQ ring directly indexes the shared array of CQEs,
365the submission side has an indirection array between them.
366The submission side ring buffer is an index into this array,
367which in turn contains the index into the SQEs.
368.PP
369The following code snippet demonstrates how a read operation,
370an equivalent of a
371.BR preadv2 (2)
372system call is described by filling up an SQE with the necessary
373parameters.
374.PP
375.in +4n
376.EX
377struct iovec iovecs[16];
378 ...
379sqe->opcode = IORING_OP_READV;
380sqe->fd = fd;
381sqe->addr = (unsigned long) iovecs;
382sqe->len = 16;
383sqe->off = offset;
384sqe->flags = 0;
385.EE
386.in
387.TP
388.B Memory ordering
389Modern compilers and CPUs freely reorder reads and writes without
390affecting the program's outcome to optimize performance.
391Some aspects of this need to be kept in mind on SMP systems since
392.B io_uring
393involves buffers shared between kernel and user space.
394These buffers are both visible and modifiable from kernel and user space.
395As heads and tails belonging to these shared buffers are updated by kernel
396and user space,
397changes need to be coherently visible on either side,
398irrespective of whether a CPU switch took place after the kernel-user mode
399switch happened.
400We use memory barriers to enforce this coherency.
401Being significantly large subjects on their own,
402memory barriers are out of scope for further discussion on this man page.
403.TP
404.B Letting the kernel know about I/O submissions
405Once you place one or more SQEs on to the SQ,
406you need to let the kernel know that you've done so.
407You can do this by calling the
408.BR io_uring_enter (2)
409system call.
410This system call is also capable of waiting for a specified count of
411events to complete.
412This way,
413you can be sure to find completion events in the completion queue without
414having to poll it for events later.
415.SS Reading completion events
416Similar to the submission queue (SQ),
417the completion queue (CQ) is a shared buffer between the kernel and user
418space.
419Whereas you placed submission queue entries on the tail of the SQ and the
420kernel read off the head,
421when it comes to the CQ,
422the kernel places completion queue events or CQEs on the tail of the CQ and
423you read off its head.
424.PP
425Submission is flexible (and thus a bit more complicated) since it needs to
426be able to encode different types of system calls that take various
427parameters.
428Completion,
429on the other hand is simpler since we're looking only for a return value
430back from the kernel.
431This is easily understood by looking at the completion queue event
432structure,
433struct
434.BR io_uring_cqe :
435.PP
436.in +4n
437.EX
438struct io_uring_cqe {
439 __u64 user_data; /* sqe->data submission passed back */
440 __s32 res; /* result code for this event */
441 __u32 flags;
442};
443.EE
444.in
445.PP
446Here,
447.I user_data
448is custom data that is passed unchanged from submission to completion.
449That is,
450from SQEs to CQEs.
451This field can be used to set context,
452uniquely identifying submissions that got completed.
453Given that I/O requests can complete in any order,
454this field can be used to correlate a submission with a completion.
455.I res
456is the result from the system call that was performed as part of the
457submission;
458its return value.
459The
460.I flags
461field could carry request-specific metadata in the future,
462but is currently unused.
463.PP
464The general sequence to read completion events off the completion queue is
465as follows:
466.PP
467.in +4n
468.EX
469unsigned head;
470head = *cqring->head;
Jens Axboe74665872020-09-14 19:21:31 -0600471if (head != atomic_load_acquire(cqring->tail)) {
Shuveb Hussain95167d02020-09-14 10:57:50 +0530472 struct io_uring_cqe *cqe;
473 unsigned index;
474 index = head & (cqring->mask);
475 cqe = &cqring->cqes[index];
476 /* process completed CQE */
477 process_cqe(cqe);
478 /* CQE consumption complete */
479 head++;
480}
Jens Axboe74665872020-09-14 19:21:31 -0600481atomic_store_release(cqring->head, head);
Shuveb Hussain95167d02020-09-14 10:57:50 +0530482.EE
483.in
484.PP
485It helps to be reminded that the kernel adds CQEs to the tail of the CQ,
486while you need to dequeue them off the head.
487To get the index of an entry at the head,
488the application must mask the current head index with the size mask of the
489ring.
490Once the CQE has been consumed or processed,
491the head needs to be updated to reflect the consumption of the CQE.
492Attention should be paid to the read and write barriers to ensure
493successful read and update of the head.
494.SS io_uring performance
495Because of the shared ring buffers between kernel and user space,
496.B io_uring
497can be a zero-copy system.
498Copying buffers to and fro becomes necessary when system calls that
499transfer data between kernel and user space are involved.
500But since the bulk of the communication in
501.B io_uring
502is via buffers shared between the kernel and user space,
503this huge performance overhead is completely avoided.
504.PP
505While system calls may not seem like a significant overhead,
506in high performance applications,
507making a lot of them will begin to matter.
508While workarounds the operating system has in place to deal with Specter
509and Meltdown are ideally best done away with,
510unfortunately,
511some of these workarounds are around the system call interface,
512making system calls not as cheap as before on affected hardware.
513While newer hardware should not need these workarounds,
514hardware with these vulnerabilities can be expected to be in the wild for a
515long time.
516While using synchronous programming interfaces or even when using
517asynchronous programming interfaces under Linux,
518there is at least one system call involved in the submission of each
519request.
520In
521.BR io_uring ,
522on the other hand,
523you can batch several requests in one go,
524simply by queueing up multiple SQEs,
525each describing an I/O operation you want and make a single call to
526.BR io_uring_enter (2).
527This is possible due to
528.BR io_uring 's
529shared buffers based design.
530.PP
531While this batching in itself can avoid the overhead associated with
532potentially multiple and frequent system calls,
533you can reduce even this overhead further with Submission Queue Polling,
534by having the kernel poll and pick up your SQEs for processing as you add
535them to the submission queue. This avoids the
536.BR io_uring_enter (2)
537call you need to make to tell the kernel to pick SQEs up.
538For high-performance applications,
539this means even lesser system call overheads.
540.SH CONFORMING TO
541.B io_uring
542is Linux-specific.
543.SH SEE ALSO
544.BR io_uring_enter (2)
545.BR io_uring_register (2)
546.BR io_uring_setup (2)