Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 1 | .\" Copyright (C) 2020 Shuveb Hussain <shuveb@gmail.com> |
| 2 | .\" SPDX-License-Identifier: LGPL-2.0-or-later |
| 3 | .\" |
| 4 | |
| 5 | .TH IO_URING 7 2020-07-26 "Linux" "Linux Programmer's Manual" |
| 6 | .SH NAME |
| 7 | io_uring \- Asynchronous I/O facility |
| 8 | .SH SYNOPSIS |
| 9 | .nf |
| 10 | .B "#include <linux/io_uring.h>" |
| 11 | .fi |
| 12 | .PP |
| 13 | .SH DESCRIPTION |
| 14 | .PP |
| 15 | .B io_uring |
| 16 | is a Linux-specific API for asynchronous I/O. |
| 17 | It allows the user to submit one or more I/O requests, |
| 18 | which are processed asynchronously without blocking the calling process. |
| 19 | .B io_uring |
| 20 | gets its name from ring buffers which are shared between user space and |
| 21 | kernel space. This arrangement allows for efficient I/O, |
| 22 | while avoiding the overhead of copying buffers between them, |
| 23 | where possible. |
| 24 | This interface makes |
| 25 | .B io_uring |
| 26 | different from other UNIX I/O APIs, |
| 27 | wherein, |
| 28 | rather than just communicate between kernel and user space with system calls, |
| 29 | ring buffers are used as the main mode of communication. |
| 30 | This arrangement has various performance benefits which are discussed in a |
| 31 | separate section below. |
| 32 | This man page uses the terms shared buffers, shared ring buffers and |
| 33 | queues interchangeably. |
| 34 | .PP |
| 35 | The general programming model you need to follow for |
| 36 | .B io_uring |
| 37 | is outlined below |
| 38 | .IP \(bu |
| 39 | Set up shared buffers with |
| 40 | .BR io_uring_setup (2) |
| 41 | and |
| 42 | .BR mmap (2), |
| 43 | mapping into user space shared buffers for the submission queue (SQ) and the |
| 44 | completion queue (CQ). |
| 45 | You place I/O requests you want to make on the SQ, |
| 46 | while the kernel places the results of those operations on the CQ. |
| 47 | .IP \(bu |
| 48 | For every I/O request you need to make (like to read a file, write a file, |
| 49 | accept a socket connection, etc), you create a submission queue entry, |
| 50 | or SQE, |
| 51 | describe the I/O operation you need to get done and add it to the tail of |
| 52 | the submission queue (SQ). |
| 53 | Each I/O operation is, |
| 54 | in essence, |
| 55 | the equivalent of a system call you would have made otherwise, |
| 56 | if you were not using |
| 57 | .BR io_uring . |
| 58 | You can add more than one SQE to the queue depending on the number of |
| 59 | operations you want to request. |
| 60 | .IP \(bu |
| 61 | After you add one or more SQEs, |
| 62 | you need to call |
| 63 | .BR io_uring_enter (2) |
| 64 | to tell the kernel to dequeue your I/O requests off the SQ and begin |
| 65 | processing them. |
| 66 | .IP \(bu |
| 67 | For each SQE you submit, |
| 68 | once it is done processing the request, |
| 69 | the kernel places a completion queue event or CQE at the tail of the |
| 70 | completion queue or CQ. |
| 71 | The kernel places exactly one matching CQE in the CQ for every SQE you |
| 72 | submit on the SQ. |
| 73 | After you retrieve a CQE, |
| 74 | minimally, |
| 75 | you might be interested in checking the |
| 76 | .I res |
| 77 | field of the CQE structure, |
| 78 | which corresponds to the return value of the system |
| 79 | call's equivalent, |
| 80 | had you used it directly without using |
| 81 | .BR io_uring . |
| 82 | For instance, |
| 83 | a read operation under |
| 84 | .BR io_uring , |
| 85 | started with the |
| 86 | .BR IORING_OP_READ |
| 87 | operation, |
| 88 | which issues the equivalent of the |
| 89 | .BR read (2) |
| 90 | system call, |
| 91 | would return as part of |
| 92 | .I res |
| 93 | what |
| 94 | .BR read (2) |
| 95 | would have returned if called directly, |
| 96 | without using |
| 97 | .BR io_uring . |
| 98 | .IP \(bu |
| 99 | Optionally, |
| 100 | .BR io_uring_enter (2) |
| 101 | can also wait for a specified number of requests to be processed by the kernel |
| 102 | before it returns. |
| 103 | If you specified a certain number of completions to wait for, |
| 104 | the kernel would have placed at least those many number of CQEs on the CQ, |
| 105 | which you can then readily read, |
| 106 | right after the return from |
| 107 | .BR io_uring_enter (2). |
| 108 | .IP \(bu |
| 109 | It is important to remember that I/O requests submitted to the kernel can |
| 110 | complete in any order. It is not necessary for the kernel to process one |
| 111 | request after another, |
Jens Axboe | c630d9a | 2020-09-14 19:25:35 -0600 | [diff] [blame] | 112 | in the order you placed them. Given that the interface is a ring, the requests |
| 113 | are attempted in order, however that doesn't imply any sort of ordering on the |
| 114 | completion of them. When more than one request is in flight, it is not possible |
| 115 | to determine which one will complete first. When you dequeue CQEs off the CQ, |
| 116 | you should always check which submitted request it corresponds to. The most |
| 117 | common method for doing so is utilizing the |
| 118 | .I user_data |
| 119 | field in the request, which is passed back on the completion side. |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 120 | .PP |
| 121 | Adding to and reading from the queues: |
| 122 | .IP \(bu |
| 123 | You add SQEs to the tail of the SQ. |
| 124 | The kernel reads SQEs off the head of the queue. |
| 125 | .IP \(bu |
| 126 | The kernel adds CQEs to the tail of the CQ. |
| 127 | You read CQEs off the head of the queue. |
| 128 | .SS Submission queue polling |
| 129 | One of the goals of |
| 130 | .B io_uring |
| 131 | is to provide a means for efficient I/O. |
| 132 | To this end, |
| 133 | .B io_uring |
| 134 | supports a polling mode that lets you avoid the call to |
| 135 | .BR io_uring_enter (2), |
| 136 | which you use to inform the kernel that you have queued SQEs on to the SQ. |
| 137 | With SQ Polling, |
| 138 | .B io_uring |
| 139 | starts a kernel thread that polls the submission queue for any I/O |
| 140 | requests you submit by adding SQEs. |
| 141 | With SQ Polling enabled, |
| 142 | there is no need for you to call |
| 143 | .BR io_uring_enter (2), |
| 144 | letting you avoid the overhead of system calls. |
| 145 | A designated kernel thread dequeues SQEs off the SQ as you add them and |
| 146 | dispatches them for asynchronous processing. |
| 147 | .SS Setting up io_uring |
| 148 | .PP |
| 149 | The following example function sets up |
| 150 | .B io_uring |
| 151 | with a QUEUE_DEPTH deep submission queue. |
| 152 | Pay attention to the 2 |
| 153 | .BR mmap (2) |
| 154 | calls that set up the shared submission and completion queues. |
| 155 | If your kernel is older than version 5.4, |
| 156 | three |
| 157 | .BR mmap(2) |
| 158 | calls are required. |
| 159 | .PP |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 160 | .EX |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 161 | int app_setup_uring(void) { |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 162 | struct io_uring_params p; |
| 163 | void *sq_ptr, *cq_ptr; |
| 164 | |
| 165 | /* See io_uring_setup(2) for io_uring_params.flags you can set */ |
| 166 | memset(&p, 0, sizeof(p)); |
| 167 | ring_fd = io_uring_setup(QUEUE_DEPTH, &p); |
| 168 | if (ring_fd < 0) { |
| 169 | perror("io_uring_setup"); |
| 170 | return 1; |
| 171 | } |
| 172 | |
| 173 | /* |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 174 | * io_uring communication happens via 2 shared kernel-user space ring |
| 175 | * buffers, which can be jointly mapped with a single mmap() call in |
| 176 | * kernels >= 5.4. |
| 177 | */ |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 178 | |
| 179 | int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned); |
| 180 | int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe); |
| 181 | |
| 182 | /* Rather than check for kernel version, the recommended way is to |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 183 | * check the features field of the io_uring_params structure, which is a |
| 184 | * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the |
| 185 | * second mmap() call to map in the completion ring separately. |
| 186 | */ |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 187 | if (p.features & IORING_FEAT_SINGLE_MMAP) { |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 188 | if (cring_sz > sring_sz) |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 189 | sring_sz = cring_sz; |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 190 | cring_sz = sring_sz; |
| 191 | } |
| 192 | |
| 193 | /* Map in the submission and completion queue ring buffers. |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 194 | * Kernels < 5.4 only map in the submission queue, though. |
| 195 | */ |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 196 | sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE, |
| 197 | MAP_SHARED | MAP_POPULATE, |
| 198 | ring_fd, IORING_OFF_SQ_RING); |
| 199 | if (sq_ptr == MAP_FAILED) { |
| 200 | perror("mmap"); |
| 201 | return 1; |
| 202 | } |
| 203 | |
| 204 | if (p.features & IORING_FEAT_SINGLE_MMAP) { |
| 205 | cq_ptr = sq_ptr; |
| 206 | } else { |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 207 | /* Map in the completion queue ring buffer in older kernels separately */ |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 208 | cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE, |
| 209 | MAP_SHARED | MAP_POPULATE, |
| 210 | ring_fd, IORING_OFF_CQ_RING); |
| 211 | if (cq_ptr == MAP_FAILED) { |
| 212 | perror("mmap"); |
| 213 | return 1; |
| 214 | } |
| 215 | } |
| 216 | /* Save useful fields for later easy reference */ |
| 217 | sring_tail = sq_ptr + p.sq_off.tail; |
| 218 | sring_mask = sq_ptr + p.sq_off.ring_mask; |
| 219 | sring_array = sq_ptr + p.sq_off.array; |
| 220 | |
| 221 | /* Map in the submission queue entries array */ |
| 222 | sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe), |
| 223 | PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, |
| 224 | ring_fd, IORING_OFF_SQES); |
| 225 | if (sqes == MAP_FAILED) { |
| 226 | perror("mmap"); |
| 227 | return 1; |
| 228 | } |
| 229 | |
| 230 | /* Save useful fields for later easy reference */ |
| 231 | cring_head = cq_ptr + p.cq_off.head; |
| 232 | cring_tail = cq_ptr + p.cq_off.tail; |
| 233 | cring_mask = cq_ptr + p.cq_off.ring_mask; |
| 234 | cqes = cq_ptr + p.cq_off.cqes; |
| 235 | |
| 236 | return 0; |
| 237 | } |
| 238 | .EE |
| 239 | .in |
| 240 | |
| 241 | .SS Submitting I/O requests |
| 242 | The process of submitting a request consists of describing the I/O |
| 243 | operation you need to get done using an |
| 244 | .B io_uring_sqe |
| 245 | structure instance. |
| 246 | These details describe the equivalent system call and its parameters. |
| 247 | Because the range of I/O operations Linux supports are very varied and the |
| 248 | .B io_uring_sqe |
| 249 | structure needs to be able to describe them, |
| 250 | it has several fields, |
| 251 | some packed into unions for space efficiency. |
| 252 | Here is a simplified version of struct |
| 253 | .B io_uring_sqe |
| 254 | with some of the most often used fields: |
| 255 | .PP |
| 256 | .in +4n |
| 257 | .EX |
| 258 | struct io_uring_sqe { |
| 259 | __u8 opcode; /* type of operation for this sqe */ |
| 260 | __s32 fd; /* file descriptor to do IO on */ |
| 261 | __u64 off; /* offset into file */ |
| 262 | __u64 addr; /* pointer to buffer or iovecs */ |
| 263 | __u32 len; /* buffer size or number of iovecs */ |
| 264 | __u64 user_data; /* data to be passed back at completion time */ |
| 265 | __u8 flags; /* IOSQE_ flags */ |
| 266 | ... |
| 267 | }; |
| 268 | .EE |
| 269 | .in |
| 270 | |
| 271 | Here is struct |
| 272 | .B io_uring_sqe |
| 273 | in full: |
| 274 | |
| 275 | .in +4n |
| 276 | .EX |
| 277 | struct io_uring_sqe { |
| 278 | __u8 opcode; /* type of operation for this sqe */ |
| 279 | __u8 flags; /* IOSQE_ flags */ |
| 280 | __u16 ioprio; /* ioprio for the request */ |
| 281 | __s32 fd; /* file descriptor to do IO on */ |
| 282 | union { |
| 283 | __u64 off; /* offset into file */ |
| 284 | __u64 addr2; |
| 285 | }; |
| 286 | union { |
| 287 | __u64 addr; /* pointer to buffer or iovecs */ |
| 288 | __u64 splice_off_in; |
| 289 | }; |
| 290 | __u32 len; /* buffer size or number of iovecs */ |
| 291 | union { |
| 292 | __kernel_rwf_t rw_flags; |
| 293 | __u32 fsync_flags; |
| 294 | __u16 poll_events; /* compatibility */ |
| 295 | __u32 poll32_events; /* word-reversed for BE */ |
| 296 | __u32 sync_range_flags; |
| 297 | __u32 msg_flags; |
| 298 | __u32 timeout_flags; |
| 299 | __u32 accept_flags; |
| 300 | __u32 cancel_flags; |
| 301 | __u32 open_flags; |
| 302 | __u32 statx_flags; |
| 303 | __u32 fadvise_advice; |
| 304 | __u32 splice_flags; |
| 305 | }; |
| 306 | __u64 user_data; /* data to be passed back at completion time */ |
| 307 | union { |
| 308 | struct { |
| 309 | /* pack this to avoid bogus arm OABI complaints */ |
| 310 | union { |
| 311 | /* index into fixed buffers, if used */ |
| 312 | __u16 buf_index; |
| 313 | /* for grouped buffer selection */ |
| 314 | __u16 buf_group; |
| 315 | } __attribute__((packed)); |
| 316 | /* personality to use, if used */ |
| 317 | __u16 personality; |
| 318 | __s32 splice_fd_in; |
| 319 | }; |
| 320 | __u64 __pad2[3]; |
| 321 | }; |
| 322 | }; |
| 323 | .EE |
| 324 | .in |
| 325 | .PP |
| 326 | To submit an I/O request to |
| 327 | .BR io_uring , |
| 328 | you need to acquire a submission queue entry (SQE) from the submission |
| 329 | queue (SQ), |
| 330 | fill it up with details of the operation you want to submit and call |
| 331 | .BR io_uring_enter (2). |
| 332 | If you want to avoid calling |
| 333 | .BR io_uring_enter (2), |
| 334 | you have the option of setting up Submission Queue Polling. |
| 335 | .PP |
| 336 | SQEs are added to the tail of the submission queue. |
| 337 | The kernel picks up SQEs off the head of the SQ. |
| 338 | The general algorithm to get the next available SQE and update the tail is |
| 339 | as follows. |
| 340 | .PP |
| 341 | .in +4n |
| 342 | .EX |
| 343 | struct io_uring_sqe *sqe; |
| 344 | unsigned tail, index; |
| 345 | tail = *sqring->tail; |
| 346 | index = tail & (*sqring->ring_mask); |
| 347 | sqe = &sqring->sqes[index]; |
| 348 | /* fill up details about this I/O request */ |
| 349 | describe_io(sqe); |
| 350 | /* fill the sqe index into the SQ ring array */ |
| 351 | sqring->array[index] = index; |
| 352 | tail++; |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 353 | atomic_store_release(sqring->tail, tail); |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 354 | .EE |
| 355 | .in |
| 356 | .PP |
| 357 | To get the index of an entry, |
| 358 | the application must mask the current tail index with the size mask of the |
| 359 | ring. |
| 360 | This holds true for both SQs and CQs. |
| 361 | Once the SQE is acquired, |
| 362 | the necessary fields are filled in, |
| 363 | describing the request. |
| 364 | While the CQ ring directly indexes the shared array of CQEs, |
| 365 | the submission side has an indirection array between them. |
| 366 | The submission side ring buffer is an index into this array, |
| 367 | which in turn contains the index into the SQEs. |
| 368 | .PP |
| 369 | The following code snippet demonstrates how a read operation, |
| 370 | an equivalent of a |
| 371 | .BR preadv2 (2) |
| 372 | system call is described by filling up an SQE with the necessary |
| 373 | parameters. |
| 374 | .PP |
| 375 | .in +4n |
| 376 | .EX |
| 377 | struct iovec iovecs[16]; |
| 378 | ... |
| 379 | sqe->opcode = IORING_OP_READV; |
| 380 | sqe->fd = fd; |
| 381 | sqe->addr = (unsigned long) iovecs; |
| 382 | sqe->len = 16; |
| 383 | sqe->off = offset; |
| 384 | sqe->flags = 0; |
| 385 | .EE |
| 386 | .in |
| 387 | .TP |
| 388 | .B Memory ordering |
| 389 | Modern compilers and CPUs freely reorder reads and writes without |
| 390 | affecting the program's outcome to optimize performance. |
| 391 | Some aspects of this need to be kept in mind on SMP systems since |
| 392 | .B io_uring |
| 393 | involves buffers shared between kernel and user space. |
| 394 | These buffers are both visible and modifiable from kernel and user space. |
| 395 | As heads and tails belonging to these shared buffers are updated by kernel |
| 396 | and user space, |
| 397 | changes need to be coherently visible on either side, |
| 398 | irrespective of whether a CPU switch took place after the kernel-user mode |
| 399 | switch happened. |
| 400 | We use memory barriers to enforce this coherency. |
| 401 | Being significantly large subjects on their own, |
| 402 | memory barriers are out of scope for further discussion on this man page. |
| 403 | .TP |
| 404 | .B Letting the kernel know about I/O submissions |
| 405 | Once you place one or more SQEs on to the SQ, |
| 406 | you need to let the kernel know that you've done so. |
| 407 | You can do this by calling the |
| 408 | .BR io_uring_enter (2) |
| 409 | system call. |
| 410 | This system call is also capable of waiting for a specified count of |
| 411 | events to complete. |
| 412 | This way, |
| 413 | you can be sure to find completion events in the completion queue without |
| 414 | having to poll it for events later. |
| 415 | .SS Reading completion events |
| 416 | Similar to the submission queue (SQ), |
| 417 | the completion queue (CQ) is a shared buffer between the kernel and user |
| 418 | space. |
| 419 | Whereas you placed submission queue entries on the tail of the SQ and the |
| 420 | kernel read off the head, |
| 421 | when it comes to the CQ, |
| 422 | the kernel places completion queue events or CQEs on the tail of the CQ and |
| 423 | you read off its head. |
| 424 | .PP |
| 425 | Submission is flexible (and thus a bit more complicated) since it needs to |
| 426 | be able to encode different types of system calls that take various |
| 427 | parameters. |
| 428 | Completion, |
| 429 | on the other hand is simpler since we're looking only for a return value |
| 430 | back from the kernel. |
| 431 | This is easily understood by looking at the completion queue event |
| 432 | structure, |
| 433 | struct |
| 434 | .BR io_uring_cqe : |
| 435 | .PP |
| 436 | .in +4n |
| 437 | .EX |
| 438 | struct io_uring_cqe { |
| 439 | __u64 user_data; /* sqe->data submission passed back */ |
| 440 | __s32 res; /* result code for this event */ |
| 441 | __u32 flags; |
| 442 | }; |
| 443 | .EE |
| 444 | .in |
| 445 | .PP |
| 446 | Here, |
| 447 | .I user_data |
| 448 | is custom data that is passed unchanged from submission to completion. |
| 449 | That is, |
| 450 | from SQEs to CQEs. |
| 451 | This field can be used to set context, |
| 452 | uniquely identifying submissions that got completed. |
| 453 | Given that I/O requests can complete in any order, |
| 454 | this field can be used to correlate a submission with a completion. |
| 455 | .I res |
| 456 | is the result from the system call that was performed as part of the |
| 457 | submission; |
| 458 | its return value. |
| 459 | The |
| 460 | .I flags |
| 461 | field could carry request-specific metadata in the future, |
| 462 | but is currently unused. |
| 463 | .PP |
| 464 | The general sequence to read completion events off the completion queue is |
| 465 | as follows: |
| 466 | .PP |
| 467 | .in +4n |
| 468 | .EX |
| 469 | unsigned head; |
| 470 | head = *cqring->head; |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 471 | if (head != atomic_load_acquire(cqring->tail)) { |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 472 | struct io_uring_cqe *cqe; |
| 473 | unsigned index; |
| 474 | index = head & (cqring->mask); |
| 475 | cqe = &cqring->cqes[index]; |
| 476 | /* process completed CQE */ |
| 477 | process_cqe(cqe); |
| 478 | /* CQE consumption complete */ |
| 479 | head++; |
| 480 | } |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 481 | atomic_store_release(cqring->head, head); |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 482 | .EE |
| 483 | .in |
| 484 | .PP |
| 485 | It helps to be reminded that the kernel adds CQEs to the tail of the CQ, |
| 486 | while you need to dequeue them off the head. |
| 487 | To get the index of an entry at the head, |
| 488 | the application must mask the current head index with the size mask of the |
| 489 | ring. |
| 490 | Once the CQE has been consumed or processed, |
| 491 | the head needs to be updated to reflect the consumption of the CQE. |
| 492 | Attention should be paid to the read and write barriers to ensure |
| 493 | successful read and update of the head. |
| 494 | .SS io_uring performance |
| 495 | Because of the shared ring buffers between kernel and user space, |
| 496 | .B io_uring |
| 497 | can be a zero-copy system. |
| 498 | Copying buffers to and fro becomes necessary when system calls that |
| 499 | transfer data between kernel and user space are involved. |
| 500 | But since the bulk of the communication in |
| 501 | .B io_uring |
| 502 | is via buffers shared between the kernel and user space, |
| 503 | this huge performance overhead is completely avoided. |
| 504 | .PP |
| 505 | While system calls may not seem like a significant overhead, |
| 506 | in high performance applications, |
| 507 | making a lot of them will begin to matter. |
| 508 | While workarounds the operating system has in place to deal with Specter |
| 509 | and Meltdown are ideally best done away with, |
| 510 | unfortunately, |
| 511 | some of these workarounds are around the system call interface, |
| 512 | making system calls not as cheap as before on affected hardware. |
| 513 | While newer hardware should not need these workarounds, |
| 514 | hardware with these vulnerabilities can be expected to be in the wild for a |
| 515 | long time. |
| 516 | While using synchronous programming interfaces or even when using |
| 517 | asynchronous programming interfaces under Linux, |
| 518 | there is at least one system call involved in the submission of each |
| 519 | request. |
| 520 | In |
| 521 | .BR io_uring , |
| 522 | on the other hand, |
| 523 | you can batch several requests in one go, |
| 524 | simply by queueing up multiple SQEs, |
| 525 | each describing an I/O operation you want and make a single call to |
| 526 | .BR io_uring_enter (2). |
| 527 | This is possible due to |
| 528 | .BR io_uring 's |
| 529 | shared buffers based design. |
| 530 | .PP |
| 531 | While this batching in itself can avoid the overhead associated with |
| 532 | potentially multiple and frequent system calls, |
| 533 | you can reduce even this overhead further with Submission Queue Polling, |
| 534 | by having the kernel poll and pick up your SQEs for processing as you add |
| 535 | them to the submission queue. This avoids the |
| 536 | .BR io_uring_enter (2) |
| 537 | call you need to make to tell the kernel to pick SQEs up. |
| 538 | For high-performance applications, |
| 539 | this means even lesser system call overheads. |
| 540 | .SH CONFORMING TO |
| 541 | .B io_uring |
| 542 | is Linux-specific. |
| 543 | .SH SEE ALSO |
| 544 | .BR io_uring_enter (2) |
| 545 | .BR io_uring_register (2) |
| 546 | .BR io_uring_setup (2) |