Blame - man/io_uring.7 - platform/external/liburing

blob: 4ccbd860b618f856cd4cd01d2365c2017b8ff96b [file] [log] [blame]

Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	1	.\" Copyright (C) 2020 Shuveb Hussain <shuveb@gmail.com>
				2	.\" SPDX-License-Identifier: LGPL-2.0-or-later
				3	.\"
				4
				5	.TH IO_URING 7 2020-07-26 "Linux" "Linux Programmer's Manual"
				6	.SH NAME
				7	io_uring \- Asynchronous I/O facility
				8	.SH SYNOPSIS
				9	.nf
				10	.B "#include <linux/io_uring.h>"
				11	.fi
				12	.PP
				13	.SH DESCRIPTION
				14	.PP
				15	.B io_uring
				16	is a Linux-specific API for asynchronous I/O.
				17	It allows the user to submit one or more I/O requests,
				18	which are processed asynchronously without blocking the calling process.
				19	.B io_uring
				20	gets its name from ring buffers which are shared between user space and
				21	kernel space. This arrangement allows for efficient I/O,
				22	while avoiding the overhead of copying buffers between them,
				23	where possible.
				24	This interface makes
				25	.B io_uring
				26	different from other UNIX I/O APIs,
				27	wherein,
				28	rather than just communicate between kernel and user space with system calls,
				29	ring buffers are used as the main mode of communication.
				30	This arrangement has various performance benefits which are discussed in a
				31	separate section below.
				32	This man page uses the terms shared buffers, shared ring buffers and
				33	queues interchangeably.
				34	.PP
				35	The general programming model you need to follow for
				36	.B io_uring
				37	is outlined below
				38	.IP \(bu
				39	Set up shared buffers with
				40	.BR io_uring_setup (2)
				41	and
				42	.BR mmap (2),
				43	mapping into user space shared buffers for the submission queue (SQ) and the
				44	completion queue (CQ).
				45	You place I/O requests you want to make on the SQ,
				46	while the kernel places the results of those operations on the CQ.
				47	.IP \(bu
				48	For every I/O request you need to make (like to read a file, write a file,
				49	accept a socket connection, etc), you create a submission queue entry,
				50	or SQE,
				51	describe the I/O operation you need to get done and add it to the tail of
				52	the submission queue (SQ).
				53	Each I/O operation is,
				54	in essence,
				55	the equivalent of a system call you would have made otherwise,
				56	if you were not using
				57	.BR io_uring .
				58	You can add more than one SQE to the queue depending on the number of
				59	operations you want to request.
				60	.IP \(bu
				61	After you add one or more SQEs,
				62	you need to call
				63	.BR io_uring_enter (2)
				64	to tell the kernel to dequeue your I/O requests off the SQ and begin
				65	processing them.
				66	.IP \(bu
				67	For each SQE you submit,
				68	once it is done processing the request,
				69	the kernel places a completion queue event or CQE at the tail of the
				70	completion queue or CQ.
				71	The kernel places exactly one matching CQE in the CQ for every SQE you
				72	submit on the SQ.
				73	After you retrieve a CQE,
				74	minimally,
				75	you might be interested in checking the
				76	.I res
				77	field of the CQE structure,
				78	which corresponds to the return value of the system
				79	call's equivalent,
				80	had you used it directly without using
				81	.BR io_uring .
				82	For instance,
				83	a read operation under
				84	.BR io_uring ,
				85	started with the
				86	.BR IORING_OP_READ
				87	operation,
				88	which issues the equivalent of the
				89	.BR read (2)
				90	system call,
				91	would return as part of
				92	.I res
				93	what
				94	.BR read (2)
				95	would have returned if called directly,
				96	without using
				97	.BR io_uring .
				98	.IP \(bu
				99	Optionally,
				100	.BR io_uring_enter (2)
				101	can also wait for a specified number of requests to be processed by the kernel
				102	before it returns.
				103	If you specified a certain number of completions to wait for,
				104	the kernel would have placed at least those many number of CQEs on the CQ,
				105	which you can then readily read,
				106	right after the return from
				107	.BR io_uring_enter (2).
				108	.IP \(bu
				109	It is important to remember that I/O requests submitted to the kernel can
				110	complete in any order. It is not necessary for the kernel to process one
				111	request after another,
Jens Axboe	c630d9a	2020-09-14 19:25:35 -0600	[diff] [blame]	112	in the order you placed them. Given that the interface is a ring, the requests
				113	are attempted in order, however that doesn't imply any sort of ordering on the
				114	completion of them. When more than one request is in flight, it is not possible
				115	to determine which one will complete first. When you dequeue CQEs off the CQ,
				116	you should always check which submitted request it corresponds to. The most
				117	common method for doing so is utilizing the
				118	.I user_data
				119	field in the request, which is passed back on the completion side.
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	120	.PP
				121	Adding to and reading from the queues:
				122	.IP \(bu
				123	You add SQEs to the tail of the SQ.
				124	The kernel reads SQEs off the head of the queue.
				125	.IP \(bu
				126	The kernel adds CQEs to the tail of the CQ.
				127	You read CQEs off the head of the queue.
				128	.SS Submission queue polling
				129	One of the goals of
				130	.B io_uring
				131	is to provide a means for efficient I/O.
				132	To this end,
				133	.B io_uring
				134	supports a polling mode that lets you avoid the call to
				135	.BR io_uring_enter (2),
				136	which you use to inform the kernel that you have queued SQEs on to the SQ.
				137	With SQ Polling,
				138	.B io_uring
				139	starts a kernel thread that polls the submission queue for any I/O
				140	requests you submit by adding SQEs.
				141	With SQ Polling enabled,
				142	there is no need for you to call
				143	.BR io_uring_enter (2),
				144	letting you avoid the overhead of system calls.
				145	A designated kernel thread dequeues SQEs off the SQ as you add them and
				146	dispatches them for asynchronous processing.
				147	.SS Setting up io_uring
				148	.PP
				149	The following example function sets up
				150	.B io_uring
				151	with a QUEUE_DEPTH deep submission queue.
				152	Pay attention to the 2
				153	.BR mmap (2)
				154	calls that set up the shared submission and completion queues.
				155	If your kernel is older than version 5.4,
				156	three
				157	.BR mmap(2)
				158	calls are required.
				159	.PP
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	160	.EX
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	161	int app_setup_uring(void) {
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	162	struct io_uring_params p;
				163	void sq_ptr, cq_ptr;
				164
				165	/* See io_uring_setup(2) for io_uring_params.flags you can set */
				166	memset(&p, 0, sizeof(p));
				167	ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
				168	if (ring_fd < 0) {
				169	perror("io_uring_setup");
				170	return 1;
				171	}
				172
				173	/*
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	174	* io_uring communication happens via 2 shared kernel-user space ring
				175	* buffers, which can be jointly mapped with a single mmap() call in
				176	* kernels >= 5.4.
				177	*/
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	178
				179	int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
				180	int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
				181
				182	/* Rather than check for kernel version, the recommended way is to
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	183	* check the features field of the io_uring_params structure, which is a
				184	* bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
				185	* second mmap() call to map in the completion ring separately.
				186	*/
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	187	if (p.features & IORING_FEAT_SINGLE_MMAP) {
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	188	if (cring_sz > sring_sz)
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	189	sring_sz = cring_sz;
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	190	cring_sz = sring_sz;
				191	}
				192
				193	/* Map in the submission and completion queue ring buffers.
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	194	* Kernels < 5.4 only map in the submission queue, though.
				195	*/
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	196	sq_ptr = mmap(0, sring_sz, PROT_READ \| PROT_WRITE,
				197	MAP_SHARED \| MAP_POPULATE,
				198	ring_fd, IORING_OFF_SQ_RING);
				199	if (sq_ptr == MAP_FAILED) {
				200	perror("mmap");
				201	return 1;
				202	}
				203
				204	if (p.features & IORING_FEAT_SINGLE_MMAP) {
				205	cq_ptr = sq_ptr;
				206	} else {
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	207	/* Map in the completion queue ring buffer in older kernels separately */
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	208	cq_ptr = mmap(0, cring_sz, PROT_READ \| PROT_WRITE,
				209	MAP_SHARED \| MAP_POPULATE,
				210	ring_fd, IORING_OFF_CQ_RING);
				211	if (cq_ptr == MAP_FAILED) {
				212	perror("mmap");
				213	return 1;
				214	}
				215	}
				216	/* Save useful fields for later easy reference */
				217	sring_tail = sq_ptr + p.sq_off.tail;
				218	sring_mask = sq_ptr + p.sq_off.ring_mask;
				219	sring_array = sq_ptr + p.sq_off.array;
				220
				221	/* Map in the submission queue entries array */
				222	sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
				223	PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_POPULATE,
				224	ring_fd, IORING_OFF_SQES);
				225	if (sqes == MAP_FAILED) {
				226	perror("mmap");
				227	return 1;
				228	}
				229
				230	/* Save useful fields for later easy reference */
				231	cring_head = cq_ptr + p.cq_off.head;
				232	cring_tail = cq_ptr + p.cq_off.tail;
				233	cring_mask = cq_ptr + p.cq_off.ring_mask;
				234	cqes = cq_ptr + p.cq_off.cqes;
				235
				236	return 0;
				237	}
				238	.EE
				239	.in
				240
				241	.SS Submitting I/O requests
				242	The process of submitting a request consists of describing the I/O
				243	operation you need to get done using an
				244	.B io_uring_sqe
				245	structure instance.
				246	These details describe the equivalent system call and its parameters.
				247	Because the range of I/O operations Linux supports are very varied and the
				248	.B io_uring_sqe
				249	structure needs to be able to describe them,
				250	it has several fields,
				251	some packed into unions for space efficiency.
				252	Here is a simplified version of struct
				253	.B io_uring_sqe
				254	with some of the most often used fields:
				255	.PP
				256	.in +4n
				257	.EX
				258	struct io_uring_sqe {
				259	__u8 opcode; /* type of operation for this sqe */
				260	__s32 fd; /* file descriptor to do IO on */
				261	__u64 off; /* offset into file */
				262	__u64 addr; /* pointer to buffer or iovecs */
				263	__u32 len; /* buffer size or number of iovecs */
				264	__u64 user_data; /* data to be passed back at completion time */
				265	__u8 flags; /* IOSQE_ flags */
				266	...
				267	};
				268	.EE
				269	.in
				270
				271	Here is struct
				272	.B io_uring_sqe
				273	in full:
				274
				275	.in +4n
				276	.EX
				277	struct io_uring_sqe {
				278	__u8 opcode; /* type of operation for this sqe */
				279	__u8 flags; /* IOSQE_ flags */
				280	__u16 ioprio; /* ioprio for the request */
				281	__s32 fd; /* file descriptor to do IO on */
				282	union {
				283	__u64 off; /* offset into file */
				284	__u64 addr2;
				285	};
				286	union {
				287	__u64 addr; /* pointer to buffer or iovecs */
				288	__u64 splice_off_in;
				289	};
				290	__u32 len; /* buffer size or number of iovecs */
				291	union {
				292	__kernel_rwf_t rw_flags;
				293	__u32 fsync_flags;
				294	__u16 poll_events; /* compatibility */
				295	__u32 poll32_events; /* word-reversed for BE */
				296	__u32 sync_range_flags;
				297	__u32 msg_flags;
				298	__u32 timeout_flags;
				299	__u32 accept_flags;
				300	__u32 cancel_flags;
				301	__u32 open_flags;
				302	__u32 statx_flags;
				303	__u32 fadvise_advice;
				304	__u32 splice_flags;
				305	};
				306	__u64 user_data; /* data to be passed back at completion time */
				307	union {
				308	struct {
				309	/* pack this to avoid bogus arm OABI complaints */
				310	union {
				311	/* index into fixed buffers, if used */
				312	__u16 buf_index;
				313	/* for grouped buffer selection */
				314	__u16 buf_group;
				315	} __attribute__((packed));
				316	/* personality to use, if used */
				317	__u16 personality;
				318	__s32 splice_fd_in;
				319	};
				320	__u64 __pad2[3];
				321	};
				322	};
				323	.EE
				324	.in
				325	.PP
				326	To submit an I/O request to
				327	.BR io_uring ,
				328	you need to acquire a submission queue entry (SQE) from the submission
				329	queue (SQ),
				330	fill it up with details of the operation you want to submit and call
				331	.BR io_uring_enter (2).
				332	If you want to avoid calling
				333	.BR io_uring_enter (2),
				334	you have the option of setting up Submission Queue Polling.
				335	.PP
				336	SQEs are added to the tail of the submission queue.
				337	The kernel picks up SQEs off the head of the SQ.
				338	The general algorithm to get the next available SQE and update the tail is
				339	as follows.
				340	.PP
				341	.in +4n
				342	.EX
				343	struct io_uring_sqe *sqe;
				344	unsigned tail, index;
				345	tail = *sqring->tail;
				346	index = tail & (*sqring->ring_mask);
				347	sqe = &sqring->sqes[index];
				348	/* fill up details about this I/O request */
				349	describe_io(sqe);
				350	/* fill the sqe index into the SQ ring array */
				351	sqring->array[index] = index;
				352	tail++;
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	353	atomic_store_release(sqring->tail, tail);
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	354	.EE
				355	.in
				356	.PP
				357	To get the index of an entry,
				358	the application must mask the current tail index with the size mask of the
				359	ring.
				360	This holds true for both SQs and CQs.
				361	Once the SQE is acquired,
				362	the necessary fields are filled in,
				363	describing the request.
				364	While the CQ ring directly indexes the shared array of CQEs,
				365	the submission side has an indirection array between them.
				366	The submission side ring buffer is an index into this array,
				367	which in turn contains the index into the SQEs.
				368	.PP
				369	The following code snippet demonstrates how a read operation,
				370	an equivalent of a
				371	.BR preadv2 (2)
				372	system call is described by filling up an SQE with the necessary
				373	parameters.
				374	.PP
				375	.in +4n
				376	.EX
				377	struct iovec iovecs[16];
				378	...
				379	sqe->opcode = IORING_OP_READV;
				380	sqe->fd = fd;
				381	sqe->addr = (unsigned long) iovecs;
				382	sqe->len = 16;
				383	sqe->off = offset;
				384	sqe->flags = 0;
				385	.EE
				386	.in
				387	.TP
				388	.B Memory ordering
				389	Modern compilers and CPUs freely reorder reads and writes without
				390	affecting the program's outcome to optimize performance.
				391	Some aspects of this need to be kept in mind on SMP systems since
				392	.B io_uring
				393	involves buffers shared between kernel and user space.
				394	These buffers are both visible and modifiable from kernel and user space.
				395	As heads and tails belonging to these shared buffers are updated by kernel
				396	and user space,
				397	changes need to be coherently visible on either side,
				398	irrespective of whether a CPU switch took place after the kernel-user mode
				399	switch happened.
				400	We use memory barriers to enforce this coherency.
				401	Being significantly large subjects on their own,
				402	memory barriers are out of scope for further discussion on this man page.
				403	.TP
				404	.B Letting the kernel know about I/O submissions
				405	Once you place one or more SQEs on to the SQ,
				406	you need to let the kernel know that you've done so.
				407	You can do this by calling the
				408	.BR io_uring_enter (2)
				409	system call.
				410	This system call is also capable of waiting for a specified count of
				411	events to complete.
				412	This way,
				413	you can be sure to find completion events in the completion queue without
				414	having to poll it for events later.
				415	.SS Reading completion events
				416	Similar to the submission queue (SQ),
				417	the completion queue (CQ) is a shared buffer between the kernel and user
				418	space.
				419	Whereas you placed submission queue entries on the tail of the SQ and the
				420	kernel read off the head,
				421	when it comes to the CQ,
				422	the kernel places completion queue events or CQEs on the tail of the CQ and
				423	you read off its head.
				424	.PP
				425	Submission is flexible (and thus a bit more complicated) since it needs to
				426	be able to encode different types of system calls that take various
				427	parameters.
				428	Completion,
				429	on the other hand is simpler since we're looking only for a return value
				430	back from the kernel.
				431	This is easily understood by looking at the completion queue event
				432	structure,
				433	struct
				434	.BR io_uring_cqe :
				435	.PP
				436	.in +4n
				437	.EX
				438	struct io_uring_cqe {
				439	__u64 user_data; /* sqe->data submission passed back */
				440	__s32 res; /* result code for this event */
				441	__u32 flags;
				442	};
				443	.EE
				444	.in
				445	.PP
				446	Here,
				447	.I user_data
				448	is custom data that is passed unchanged from submission to completion.
				449	That is,
				450	from SQEs to CQEs.
				451	This field can be used to set context,
				452	uniquely identifying submissions that got completed.
				453	Given that I/O requests can complete in any order,
				454	this field can be used to correlate a submission with a completion.
				455	.I res
				456	is the result from the system call that was performed as part of the
				457	submission;
				458	its return value.
				459	The
				460	.I flags
				461	field could carry request-specific metadata in the future,
				462	but is currently unused.
				463	.PP
				464	The general sequence to read completion events off the completion queue is
				465	as follows:
				466	.PP
				467	.in +4n
				468	.EX
				469	unsigned head;
				470	head = *cqring->head;
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	471	if (head != atomic_load_acquire(cqring->tail)) {
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	472	struct io_uring_cqe *cqe;
				473	unsigned index;
				474	index = head & (cqring->mask);
				475	cqe = &cqring->cqes[index];
				476	/* process completed CQE */
				477	process_cqe(cqe);
				478	/* CQE consumption complete */
				479	head++;
				480	}
Jens Axboe	7466587	2020-09-14 19:21:31 -0600	[diff] [blame]	481	atomic_store_release(cqring->head, head);
Shuveb Hussain	95167d0	2020-09-14 10:57:50 +0530	[diff] [blame]	482	.EE
				483	.in
				484	.PP
				485	It helps to be reminded that the kernel adds CQEs to the tail of the CQ,
				486	while you need to dequeue them off the head.
				487	To get the index of an entry at the head,
				488	the application must mask the current head index with the size mask of the
				489	ring.
				490	Once the CQE has been consumed or processed,
				491	the head needs to be updated to reflect the consumption of the CQE.
				492	Attention should be paid to the read and write barriers to ensure
				493	successful read and update of the head.
				494	.SS io_uring performance
				495	Because of the shared ring buffers between kernel and user space,
				496	.B io_uring
				497	can be a zero-copy system.
				498	Copying buffers to and fro becomes necessary when system calls that
				499	transfer data between kernel and user space are involved.
				500	But since the bulk of the communication in
				501	.B io_uring
				502	is via buffers shared between the kernel and user space,
				503	this huge performance overhead is completely avoided.
				504	.PP
				505	While system calls may not seem like a significant overhead,
				506	in high performance applications,
				507	making a lot of them will begin to matter.
				508	While workarounds the operating system has in place to deal with Specter
				509	and Meltdown are ideally best done away with,
				510	unfortunately,
				511	some of these workarounds are around the system call interface,
				512	making system calls not as cheap as before on affected hardware.
				513	While newer hardware should not need these workarounds,
				514	hardware with these vulnerabilities can be expected to be in the wild for a
				515	long time.
				516	While using synchronous programming interfaces or even when using
				517	asynchronous programming interfaces under Linux,
				518	there is at least one system call involved in the submission of each
				519	request.
				520	In
				521	.BR io_uring ,
				522	on the other hand,
				523	you can batch several requests in one go,
				524	simply by queueing up multiple SQEs,
				525	each describing an I/O operation you want and make a single call to
				526	.BR io_uring_enter (2).
				527	This is possible due to
				528	.BR io_uring 's
				529	shared buffers based design.
				530	.PP
				531	While this batching in itself can avoid the overhead associated with
				532	potentially multiple and frequent system calls,
				533	you can reduce even this overhead further with Submission Queue Polling,
				534	by having the kernel poll and pick up your SQEs for processing as you add
				535	them to the submission queue. This avoids the
				536	.BR io_uring_enter (2)
				537	call you need to make to tell the kernel to pick SQEs up.
				538	For high-performance applications,
				539	this means even lesser system call overheads.
				540	.SH CONFORMING TO
				541	.B io_uring
				542	is Linux-specific.
				543	.SH SEE ALSO
				544	.BR io_uring_enter (2)
				545	.BR io_uring_register (2)
				546	.BR io_uring_setup (2)