Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 1 | .\" Copyright (C) 2020 Shuveb Hussain <shuveb@gmail.com> |
| 2 | .\" SPDX-License-Identifier: LGPL-2.0-or-later |
| 3 | .\" |
| 4 | |
| 5 | .TH IO_URING 7 2020-07-26 "Linux" "Linux Programmer's Manual" |
| 6 | .SH NAME |
| 7 | io_uring \- Asynchronous I/O facility |
| 8 | .SH SYNOPSIS |
| 9 | .nf |
| 10 | .B "#include <linux/io_uring.h>" |
| 11 | .fi |
| 12 | .PP |
| 13 | .SH DESCRIPTION |
| 14 | .PP |
| 15 | .B io_uring |
| 16 | is a Linux-specific API for asynchronous I/O. |
| 17 | It allows the user to submit one or more I/O requests, |
| 18 | which are processed asynchronously without blocking the calling process. |
| 19 | .B io_uring |
| 20 | gets its name from ring buffers which are shared between user space and |
| 21 | kernel space. This arrangement allows for efficient I/O, |
| 22 | while avoiding the overhead of copying buffers between them, |
| 23 | where possible. |
| 24 | This interface makes |
| 25 | .B io_uring |
| 26 | different from other UNIX I/O APIs, |
| 27 | wherein, |
| 28 | rather than just communicate between kernel and user space with system calls, |
| 29 | ring buffers are used as the main mode of communication. |
| 30 | This arrangement has various performance benefits which are discussed in a |
| 31 | separate section below. |
| 32 | This man page uses the terms shared buffers, shared ring buffers and |
| 33 | queues interchangeably. |
| 34 | .PP |
| 35 | The general programming model you need to follow for |
| 36 | .B io_uring |
| 37 | is outlined below |
| 38 | .IP \(bu |
| 39 | Set up shared buffers with |
| 40 | .BR io_uring_setup (2) |
| 41 | and |
| 42 | .BR mmap (2), |
| 43 | mapping into user space shared buffers for the submission queue (SQ) and the |
| 44 | completion queue (CQ). |
| 45 | You place I/O requests you want to make on the SQ, |
| 46 | while the kernel places the results of those operations on the CQ. |
| 47 | .IP \(bu |
| 48 | For every I/O request you need to make (like to read a file, write a file, |
| 49 | accept a socket connection, etc), you create a submission queue entry, |
| 50 | or SQE, |
| 51 | describe the I/O operation you need to get done and add it to the tail of |
| 52 | the submission queue (SQ). |
| 53 | Each I/O operation is, |
| 54 | in essence, |
| 55 | the equivalent of a system call you would have made otherwise, |
| 56 | if you were not using |
| 57 | .BR io_uring . |
| 58 | You can add more than one SQE to the queue depending on the number of |
| 59 | operations you want to request. |
| 60 | .IP \(bu |
| 61 | After you add one or more SQEs, |
| 62 | you need to call |
| 63 | .BR io_uring_enter (2) |
| 64 | to tell the kernel to dequeue your I/O requests off the SQ and begin |
| 65 | processing them. |
| 66 | .IP \(bu |
| 67 | For each SQE you submit, |
| 68 | once it is done processing the request, |
| 69 | the kernel places a completion queue event or CQE at the tail of the |
| 70 | completion queue or CQ. |
| 71 | The kernel places exactly one matching CQE in the CQ for every SQE you |
| 72 | submit on the SQ. |
| 73 | After you retrieve a CQE, |
| 74 | minimally, |
| 75 | you might be interested in checking the |
| 76 | .I res |
| 77 | field of the CQE structure, |
| 78 | which corresponds to the return value of the system |
| 79 | call's equivalent, |
| 80 | had you used it directly without using |
| 81 | .BR io_uring . |
| 82 | For instance, |
| 83 | a read operation under |
| 84 | .BR io_uring , |
| 85 | started with the |
| 86 | .BR IORING_OP_READ |
| 87 | operation, |
| 88 | which issues the equivalent of the |
| 89 | .BR read (2) |
| 90 | system call, |
| 91 | would return as part of |
| 92 | .I res |
| 93 | what |
| 94 | .BR read (2) |
| 95 | would have returned if called directly, |
| 96 | without using |
| 97 | .BR io_uring . |
| 98 | .IP \(bu |
| 99 | Optionally, |
| 100 | .BR io_uring_enter (2) |
| 101 | can also wait for a specified number of requests to be processed by the kernel |
| 102 | before it returns. |
| 103 | If you specified a certain number of completions to wait for, |
| 104 | the kernel would have placed at least those many number of CQEs on the CQ, |
| 105 | which you can then readily read, |
| 106 | right after the return from |
| 107 | .BR io_uring_enter (2). |
| 108 | .IP \(bu |
| 109 | It is important to remember that I/O requests submitted to the kernel can |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 110 | complete in any order. |
| 111 | It is not necessary for the kernel to process one request after another, |
| 112 | in the order you placed them. |
| 113 | Given that the interface is a ring, |
| 114 | the requests are attempted in order, |
| 115 | however that doesn't imply any sort of ordering on their completion. |
| 116 | When more than one request is in flight, |
| 117 | it is not possible to determine which one will complete first. |
| 118 | When you dequeue CQEs off the CQ, |
| 119 | you should always check which submitted request it corresponds to. |
| 120 | The most common method for doing so is utilizing the |
Jens Axboe | c630d9a | 2020-09-14 19:25:35 -0600 | [diff] [blame] | 121 | .I user_data |
| 122 | field in the request, which is passed back on the completion side. |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 123 | .PP |
| 124 | Adding to and reading from the queues: |
| 125 | .IP \(bu |
| 126 | You add SQEs to the tail of the SQ. |
| 127 | The kernel reads SQEs off the head of the queue. |
| 128 | .IP \(bu |
| 129 | The kernel adds CQEs to the tail of the CQ. |
| 130 | You read CQEs off the head of the queue. |
| 131 | .SS Submission queue polling |
| 132 | One of the goals of |
| 133 | .B io_uring |
| 134 | is to provide a means for efficient I/O. |
| 135 | To this end, |
| 136 | .B io_uring |
| 137 | supports a polling mode that lets you avoid the call to |
| 138 | .BR io_uring_enter (2), |
| 139 | which you use to inform the kernel that you have queued SQEs on to the SQ. |
| 140 | With SQ Polling, |
| 141 | .B io_uring |
| 142 | starts a kernel thread that polls the submission queue for any I/O |
| 143 | requests you submit by adding SQEs. |
| 144 | With SQ Polling enabled, |
| 145 | there is no need for you to call |
| 146 | .BR io_uring_enter (2), |
| 147 | letting you avoid the overhead of system calls. |
| 148 | A designated kernel thread dequeues SQEs off the SQ as you add them and |
| 149 | dispatches them for asynchronous processing. |
| 150 | .SS Setting up io_uring |
| 151 | .PP |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 152 | The main steps in setting up |
| 153 | .B io_uring |
| 154 | consist of mapping in the shared buffers with |
| 155 | .BR mmap (2) |
| 156 | calls. |
| 157 | In the example program included in this man page, |
| 158 | the function |
| 159 | .BR app_setup_uring () |
| 160 | sets up |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 161 | .B io_uring |
| 162 | with a QUEUE_DEPTH deep submission queue. |
| 163 | Pay attention to the 2 |
| 164 | .BR mmap (2) |
| 165 | calls that set up the shared submission and completion queues. |
| 166 | If your kernel is older than version 5.4, |
| 167 | three |
| 168 | .BR mmap(2) |
| 169 | calls are required. |
| 170 | .PP |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 171 | .SS Submitting I/O requests |
| 172 | The process of submitting a request consists of describing the I/O |
| 173 | operation you need to get done using an |
| 174 | .B io_uring_sqe |
| 175 | structure instance. |
| 176 | These details describe the equivalent system call and its parameters. |
| 177 | Because the range of I/O operations Linux supports are very varied and the |
| 178 | .B io_uring_sqe |
| 179 | structure needs to be able to describe them, |
| 180 | it has several fields, |
| 181 | some packed into unions for space efficiency. |
| 182 | Here is a simplified version of struct |
| 183 | .B io_uring_sqe |
| 184 | with some of the most often used fields: |
| 185 | .PP |
| 186 | .in +4n |
| 187 | .EX |
| 188 | struct io_uring_sqe { |
| 189 | __u8 opcode; /* type of operation for this sqe */ |
| 190 | __s32 fd; /* file descriptor to do IO on */ |
| 191 | __u64 off; /* offset into file */ |
| 192 | __u64 addr; /* pointer to buffer or iovecs */ |
| 193 | __u32 len; /* buffer size or number of iovecs */ |
| 194 | __u64 user_data; /* data to be passed back at completion time */ |
| 195 | __u8 flags; /* IOSQE_ flags */ |
| 196 | ... |
| 197 | }; |
| 198 | .EE |
| 199 | .in |
| 200 | |
| 201 | Here is struct |
| 202 | .B io_uring_sqe |
| 203 | in full: |
| 204 | |
| 205 | .in +4n |
| 206 | .EX |
| 207 | struct io_uring_sqe { |
| 208 | __u8 opcode; /* type of operation for this sqe */ |
| 209 | __u8 flags; /* IOSQE_ flags */ |
| 210 | __u16 ioprio; /* ioprio for the request */ |
| 211 | __s32 fd; /* file descriptor to do IO on */ |
| 212 | union { |
| 213 | __u64 off; /* offset into file */ |
| 214 | __u64 addr2; |
| 215 | }; |
| 216 | union { |
| 217 | __u64 addr; /* pointer to buffer or iovecs */ |
| 218 | __u64 splice_off_in; |
| 219 | }; |
| 220 | __u32 len; /* buffer size or number of iovecs */ |
| 221 | union { |
| 222 | __kernel_rwf_t rw_flags; |
| 223 | __u32 fsync_flags; |
| 224 | __u16 poll_events; /* compatibility */ |
| 225 | __u32 poll32_events; /* word-reversed for BE */ |
| 226 | __u32 sync_range_flags; |
| 227 | __u32 msg_flags; |
| 228 | __u32 timeout_flags; |
| 229 | __u32 accept_flags; |
| 230 | __u32 cancel_flags; |
| 231 | __u32 open_flags; |
| 232 | __u32 statx_flags; |
| 233 | __u32 fadvise_advice; |
| 234 | __u32 splice_flags; |
| 235 | }; |
| 236 | __u64 user_data; /* data to be passed back at completion time */ |
| 237 | union { |
| 238 | struct { |
| 239 | /* pack this to avoid bogus arm OABI complaints */ |
| 240 | union { |
| 241 | /* index into fixed buffers, if used */ |
| 242 | __u16 buf_index; |
| 243 | /* for grouped buffer selection */ |
| 244 | __u16 buf_group; |
| 245 | } __attribute__((packed)); |
| 246 | /* personality to use, if used */ |
| 247 | __u16 personality; |
| 248 | __s32 splice_fd_in; |
| 249 | }; |
| 250 | __u64 __pad2[3]; |
| 251 | }; |
| 252 | }; |
| 253 | .EE |
| 254 | .in |
| 255 | .PP |
| 256 | To submit an I/O request to |
| 257 | .BR io_uring , |
| 258 | you need to acquire a submission queue entry (SQE) from the submission |
| 259 | queue (SQ), |
| 260 | fill it up with details of the operation you want to submit and call |
| 261 | .BR io_uring_enter (2). |
| 262 | If you want to avoid calling |
| 263 | .BR io_uring_enter (2), |
| 264 | you have the option of setting up Submission Queue Polling. |
| 265 | .PP |
| 266 | SQEs are added to the tail of the submission queue. |
| 267 | The kernel picks up SQEs off the head of the SQ. |
| 268 | The general algorithm to get the next available SQE and update the tail is |
| 269 | as follows. |
| 270 | .PP |
| 271 | .in +4n |
| 272 | .EX |
| 273 | struct io_uring_sqe *sqe; |
| 274 | unsigned tail, index; |
| 275 | tail = *sqring->tail; |
| 276 | index = tail & (*sqring->ring_mask); |
| 277 | sqe = &sqring->sqes[index]; |
| 278 | /* fill up details about this I/O request */ |
| 279 | describe_io(sqe); |
| 280 | /* fill the sqe index into the SQ ring array */ |
| 281 | sqring->array[index] = index; |
| 282 | tail++; |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 283 | atomic_store_release(sqring->tail, tail); |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 284 | .EE |
| 285 | .in |
| 286 | .PP |
| 287 | To get the index of an entry, |
| 288 | the application must mask the current tail index with the size mask of the |
| 289 | ring. |
| 290 | This holds true for both SQs and CQs. |
| 291 | Once the SQE is acquired, |
| 292 | the necessary fields are filled in, |
| 293 | describing the request. |
| 294 | While the CQ ring directly indexes the shared array of CQEs, |
| 295 | the submission side has an indirection array between them. |
| 296 | The submission side ring buffer is an index into this array, |
| 297 | which in turn contains the index into the SQEs. |
| 298 | .PP |
| 299 | The following code snippet demonstrates how a read operation, |
| 300 | an equivalent of a |
| 301 | .BR preadv2 (2) |
| 302 | system call is described by filling up an SQE with the necessary |
| 303 | parameters. |
| 304 | .PP |
| 305 | .in +4n |
| 306 | .EX |
| 307 | struct iovec iovecs[16]; |
| 308 | ... |
| 309 | sqe->opcode = IORING_OP_READV; |
| 310 | sqe->fd = fd; |
| 311 | sqe->addr = (unsigned long) iovecs; |
| 312 | sqe->len = 16; |
| 313 | sqe->off = offset; |
| 314 | sqe->flags = 0; |
| 315 | .EE |
| 316 | .in |
| 317 | .TP |
| 318 | .B Memory ordering |
| 319 | Modern compilers and CPUs freely reorder reads and writes without |
| 320 | affecting the program's outcome to optimize performance. |
| 321 | Some aspects of this need to be kept in mind on SMP systems since |
| 322 | .B io_uring |
| 323 | involves buffers shared between kernel and user space. |
| 324 | These buffers are both visible and modifiable from kernel and user space. |
| 325 | As heads and tails belonging to these shared buffers are updated by kernel |
| 326 | and user space, |
| 327 | changes need to be coherently visible on either side, |
| 328 | irrespective of whether a CPU switch took place after the kernel-user mode |
| 329 | switch happened. |
| 330 | We use memory barriers to enforce this coherency. |
| 331 | Being significantly large subjects on their own, |
| 332 | memory barriers are out of scope for further discussion on this man page. |
| 333 | .TP |
| 334 | .B Letting the kernel know about I/O submissions |
| 335 | Once you place one or more SQEs on to the SQ, |
| 336 | you need to let the kernel know that you've done so. |
| 337 | You can do this by calling the |
| 338 | .BR io_uring_enter (2) |
| 339 | system call. |
| 340 | This system call is also capable of waiting for a specified count of |
| 341 | events to complete. |
| 342 | This way, |
| 343 | you can be sure to find completion events in the completion queue without |
| 344 | having to poll it for events later. |
| 345 | .SS Reading completion events |
| 346 | Similar to the submission queue (SQ), |
| 347 | the completion queue (CQ) is a shared buffer between the kernel and user |
| 348 | space. |
| 349 | Whereas you placed submission queue entries on the tail of the SQ and the |
| 350 | kernel read off the head, |
| 351 | when it comes to the CQ, |
| 352 | the kernel places completion queue events or CQEs on the tail of the CQ and |
| 353 | you read off its head. |
| 354 | .PP |
| 355 | Submission is flexible (and thus a bit more complicated) since it needs to |
| 356 | be able to encode different types of system calls that take various |
| 357 | parameters. |
| 358 | Completion, |
| 359 | on the other hand is simpler since we're looking only for a return value |
| 360 | back from the kernel. |
| 361 | This is easily understood by looking at the completion queue event |
| 362 | structure, |
| 363 | struct |
| 364 | .BR io_uring_cqe : |
| 365 | .PP |
| 366 | .in +4n |
| 367 | .EX |
| 368 | struct io_uring_cqe { |
| 369 | __u64 user_data; /* sqe->data submission passed back */ |
| 370 | __s32 res; /* result code for this event */ |
| 371 | __u32 flags; |
| 372 | }; |
| 373 | .EE |
| 374 | .in |
| 375 | .PP |
| 376 | Here, |
| 377 | .I user_data |
| 378 | is custom data that is passed unchanged from submission to completion. |
| 379 | That is, |
| 380 | from SQEs to CQEs. |
| 381 | This field can be used to set context, |
| 382 | uniquely identifying submissions that got completed. |
| 383 | Given that I/O requests can complete in any order, |
| 384 | this field can be used to correlate a submission with a completion. |
| 385 | .I res |
| 386 | is the result from the system call that was performed as part of the |
| 387 | submission; |
| 388 | its return value. |
| 389 | The |
| 390 | .I flags |
| 391 | field could carry request-specific metadata in the future, |
| 392 | but is currently unused. |
| 393 | .PP |
| 394 | The general sequence to read completion events off the completion queue is |
| 395 | as follows: |
| 396 | .PP |
| 397 | .in +4n |
| 398 | .EX |
| 399 | unsigned head; |
| 400 | head = *cqring->head; |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 401 | if (head != atomic_load_acquire(cqring->tail)) { |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 402 | struct io_uring_cqe *cqe; |
| 403 | unsigned index; |
| 404 | index = head & (cqring->mask); |
| 405 | cqe = &cqring->cqes[index]; |
| 406 | /* process completed CQE */ |
| 407 | process_cqe(cqe); |
| 408 | /* CQE consumption complete */ |
| 409 | head++; |
| 410 | } |
Jens Axboe | 7466587 | 2020-09-14 19:21:31 -0600 | [diff] [blame] | 411 | atomic_store_release(cqring->head, head); |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 412 | .EE |
| 413 | .in |
| 414 | .PP |
| 415 | It helps to be reminded that the kernel adds CQEs to the tail of the CQ, |
| 416 | while you need to dequeue them off the head. |
| 417 | To get the index of an entry at the head, |
| 418 | the application must mask the current head index with the size mask of the |
| 419 | ring. |
| 420 | Once the CQE has been consumed or processed, |
| 421 | the head needs to be updated to reflect the consumption of the CQE. |
| 422 | Attention should be paid to the read and write barriers to ensure |
| 423 | successful read and update of the head. |
| 424 | .SS io_uring performance |
| 425 | Because of the shared ring buffers between kernel and user space, |
| 426 | .B io_uring |
| 427 | can be a zero-copy system. |
| 428 | Copying buffers to and fro becomes necessary when system calls that |
| 429 | transfer data between kernel and user space are involved. |
| 430 | But since the bulk of the communication in |
| 431 | .B io_uring |
| 432 | is via buffers shared between the kernel and user space, |
| 433 | this huge performance overhead is completely avoided. |
| 434 | .PP |
| 435 | While system calls may not seem like a significant overhead, |
| 436 | in high performance applications, |
| 437 | making a lot of them will begin to matter. |
| 438 | While workarounds the operating system has in place to deal with Specter |
| 439 | and Meltdown are ideally best done away with, |
| 440 | unfortunately, |
| 441 | some of these workarounds are around the system call interface, |
| 442 | making system calls not as cheap as before on affected hardware. |
| 443 | While newer hardware should not need these workarounds, |
| 444 | hardware with these vulnerabilities can be expected to be in the wild for a |
| 445 | long time. |
| 446 | While using synchronous programming interfaces or even when using |
| 447 | asynchronous programming interfaces under Linux, |
| 448 | there is at least one system call involved in the submission of each |
| 449 | request. |
| 450 | In |
| 451 | .BR io_uring , |
| 452 | on the other hand, |
| 453 | you can batch several requests in one go, |
| 454 | simply by queueing up multiple SQEs, |
| 455 | each describing an I/O operation you want and make a single call to |
| 456 | .BR io_uring_enter (2). |
| 457 | This is possible due to |
| 458 | .BR io_uring 's |
| 459 | shared buffers based design. |
| 460 | .PP |
| 461 | While this batching in itself can avoid the overhead associated with |
| 462 | potentially multiple and frequent system calls, |
| 463 | you can reduce even this overhead further with Submission Queue Polling, |
| 464 | by having the kernel poll and pick up your SQEs for processing as you add |
| 465 | them to the submission queue. This avoids the |
| 466 | .BR io_uring_enter (2) |
| 467 | call you need to make to tell the kernel to pick SQEs up. |
| 468 | For high-performance applications, |
| 469 | this means even lesser system call overheads. |
| 470 | .SH CONFORMING TO |
| 471 | .B io_uring |
| 472 | is Linux-specific. |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 473 | .SH EXAMPLES |
| 474 | The following example uses |
| 475 | .B io_uring |
| 476 | to copy stdin to stdout. |
| 477 | Using shell redirection, |
| 478 | you should be able to copy files with this example. |
| 479 | Because it uses a queue depth of only one, |
| 480 | this example processes I/O requests one after the other. |
| 481 | It is purposefully kept this way to aid understanding. |
| 482 | In real-world scenarios however, |
| 483 | you'll want to have a larger queue depth to parallelize I/O request |
| 484 | processing so as to gain the kind of performance benefits |
| 485 | .B io_uring |
| 486 | provides with its asynchronous processing of requests. |
| 487 | .PP |
| 488 | .EX |
| 489 | #include <stdio.h> |
| 490 | #include <stdlib.h> |
| 491 | #include <sys/stat.h> |
| 492 | #include <sys/ioctl.h> |
| 493 | #include <sys/syscall.h> |
| 494 | #include <sys/mman.h> |
| 495 | #include <sys/uio.h> |
| 496 | #include <linux/fs.h> |
| 497 | #include <fcntl.h> |
| 498 | #include <unistd.h> |
| 499 | #include <string.h> |
| 500 | #include <stdatomic.h> |
| 501 | |
| 502 | #include <linux/io_uring.h> |
| 503 | |
| 504 | #define QUEUE_DEPTH 1 |
| 505 | #define BLOCK_SZ 1024 |
| 506 | |
| 507 | /* Macros for barriers needed by io_uring */ |
| 508 | #define io_uring_smp_store_release(p, v) \\ |
| 509 | atomic_store_explicit((_Atomic typeof(*(p)) *)(p), (v), \\ |
| 510 | memory_order_release) |
| 511 | #define io_uring_smp_load_acquire(p) \\ |
| 512 | atomic_load_explicit((_Atomic typeof(*(p)) *)(p), \\ |
| 513 | memory_order_acquire) |
| 514 | |
| 515 | int ring_fd; |
| 516 | unsigned *sring_tail, *sring_mask, *sring_array, |
| 517 | *cring_head, *cring_tail, *cring_mask; |
| 518 | struct io_uring_sqe *sqes; |
| 519 | struct io_uring_cqe *cqes; |
| 520 | char buff[BLOCK_SZ]; |
| 521 | off_t offset; |
| 522 | |
| 523 | /* |
| 524 | * System call wrappers provided since glibc does not yet |
| 525 | * provide wrappers for io_uring system calls. |
| 526 | * */ |
| 527 | |
| 528 | int io_uring_setup(unsigned entries, struct io_uring_params *p) |
| 529 | { |
| 530 | return (int) syscall(__NR_io_uring_setup, entries, p); |
| 531 | } |
| 532 | |
| 533 | int io_uring_enter(int ring_fd, unsigned int to_submit, |
| 534 | unsigned int min_complete, unsigned int flags) |
| 535 | { |
| 536 | return (int) syscall(__NR_io_uring_enter, ring_fd, to_submit, min_complete, |
| 537 | flags, NULL, 0); |
| 538 | } |
| 539 | |
| 540 | int app_setup_uring(void) { |
| 541 | struct io_uring_params p; |
| 542 | void *sq_ptr, *cq_ptr; |
| 543 | |
| 544 | /* See io_uring_setup(2) for io_uring_params.flags you can set */ |
| 545 | memset(&p, 0, sizeof(p)); |
| 546 | ring_fd = io_uring_setup(QUEUE_DEPTH, &p); |
| 547 | if (ring_fd < 0) { |
| 548 | perror("io_uring_setup"); |
| 549 | return 1; |
| 550 | } |
| 551 | |
| 552 | /* |
| 553 | * io_uring communication happens via 2 shared kernel-user space ring |
| 554 | * buffers, which can be jointly mapped with a single mmap() call in |
| 555 | * kernels >= 5.4. |
| 556 | */ |
| 557 | |
| 558 | int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned); |
| 559 | int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe); |
| 560 | |
| 561 | /* Rather than check for kernel version, the recommended way is to |
| 562 | * check the features field of the io_uring_params structure, which is a |
| 563 | * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the |
| 564 | * second mmap() call to map in the completion ring separately. |
| 565 | */ |
| 566 | if (p.features & IORING_FEAT_SINGLE_MMAP) { |
| 567 | if (cring_sz > sring_sz) |
| 568 | sring_sz = cring_sz; |
| 569 | cring_sz = sring_sz; |
| 570 | } |
| 571 | |
| 572 | /* Map in the submission and completion queue ring buffers. |
| 573 | * Kernels < 5.4 only map in the submission queue, though. |
| 574 | */ |
| 575 | sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE, |
| 576 | MAP_SHARED | MAP_POPULATE, |
| 577 | ring_fd, IORING_OFF_SQ_RING); |
| 578 | if (sq_ptr == MAP_FAILED) { |
| 579 | perror("mmap"); |
| 580 | return 1; |
| 581 | } |
| 582 | |
| 583 | if (p.features & IORING_FEAT_SINGLE_MMAP) { |
| 584 | cq_ptr = sq_ptr; |
| 585 | } else { |
| 586 | /* Map in the completion queue ring buffer in older kernels separately */ |
| 587 | cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE, |
| 588 | MAP_SHARED | MAP_POPULATE, |
| 589 | ring_fd, IORING_OFF_CQ_RING); |
| 590 | if (cq_ptr == MAP_FAILED) { |
| 591 | perror("mmap"); |
| 592 | return 1; |
| 593 | } |
| 594 | } |
| 595 | /* Save useful fields for later easy reference */ |
| 596 | sring_tail = sq_ptr + p.sq_off.tail; |
| 597 | sring_mask = sq_ptr + p.sq_off.ring_mask; |
| 598 | sring_array = sq_ptr + p.sq_off.array; |
| 599 | |
| 600 | /* Map in the submission queue entries array */ |
| 601 | sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe), |
| 602 | PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, |
| 603 | ring_fd, IORING_OFF_SQES); |
| 604 | if (sqes == MAP_FAILED) { |
| 605 | perror("mmap"); |
| 606 | return 1; |
| 607 | } |
| 608 | |
| 609 | /* Save useful fields for later easy reference */ |
| 610 | cring_head = cq_ptr + p.cq_off.head; |
| 611 | cring_tail = cq_ptr + p.cq_off.tail; |
| 612 | cring_mask = cq_ptr + p.cq_off.ring_mask; |
| 613 | cqes = cq_ptr + p.cq_off.cqes; |
| 614 | |
| 615 | return 0; |
| 616 | } |
| 617 | |
| 618 | /* |
| 619 | * Read from completion queue. |
| 620 | * In this function, we read completion events from the completion queue. |
| 621 | * We dequeue the CQE, update and head and return the result of the operation. |
| 622 | * */ |
| 623 | |
| 624 | int read_from_cq() { |
| 625 | struct io_uring_cqe *cqe; |
| 626 | unsigned head, reaped = 0; |
| 627 | |
| 628 | /* Read barrier */ |
| 629 | head = io_uring_smp_load_acquire(cring_head); |
| 630 | /* |
| 631 | * Remember, this is a ring buffer. If head == tail, it means that the |
| 632 | * buffer is empty. |
| 633 | * */ |
| 634 | if (head == *cring_tail) |
| 635 | return -1; |
| 636 | |
| 637 | /* Get the entry */ |
| 638 | cqe = &cqes[head & (*cring_mask)]; |
| 639 | if (cqe->res < 0) |
noah | 97e3a8b | 2020-12-16 19:08:33 -0500 | [diff] [blame] | 640 | fprintf(stderr, "Error: %s\\n", strerror(abs(cqe->res))); |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 641 | |
| 642 | head++; |
| 643 | |
| 644 | /* Write barrier so that update to the head are made visible */ |
| 645 | io_uring_smp_store_release(cring_head, head); |
| 646 | |
| 647 | return cqe->res; |
| 648 | } |
| 649 | |
| 650 | /* |
| 651 | * Submit a read or a write request to the submission queue. |
| 652 | * */ |
| 653 | |
| 654 | int submit_to_sq(int fd, int op) { |
| 655 | unsigned index, tail; |
| 656 | |
| 657 | /* Add our submission queue entry to the tail of the SQE ring buffer */ |
| 658 | tail = *sring_tail; |
| 659 | index = tail & *sring_mask; |
| 660 | struct io_uring_sqe *sqe = &sqes[index]; |
| 661 | /* Fill in the parameters required for the read or write operation */ |
| 662 | sqe->opcode = op; |
| 663 | sqe->fd = fd; |
| 664 | sqe->addr = (unsigned long) buff; |
| 665 | if (op == IORING_OP_READ) { |
| 666 | memset(buff, 0, sizeof(buff)); |
| 667 | sqe->len = BLOCK_SZ; |
| 668 | } |
| 669 | else { |
| 670 | sqe->len = strlen(buff); |
| 671 | } |
| 672 | sqe->off = offset; |
| 673 | |
| 674 | sring_array[index] = index; |
| 675 | tail++; |
| 676 | |
| 677 | /* Update the tail */ |
| 678 | io_uring_smp_store_release(sring_tail, tail); |
| 679 | |
| 680 | /* |
| 681 | * Tell the kernel we have submitted events with the io_uring_enter() system |
| 682 | * call. We also pass in the IOURING_ENTER_GETEVENTS flag which causes the |
| 683 | * io_uring_enter() call to wait until min_complete (the 3rd param) events |
| 684 | * complete. |
| 685 | * */ |
| 686 | int ret = io_uring_enter(ring_fd, 1,1, |
| 687 | IORING_ENTER_GETEVENTS); |
| 688 | if(ret < 0) { |
| 689 | perror("io_uring_enter"); |
| 690 | return -1; |
| 691 | } |
| 692 | |
| 693 | return ret; |
| 694 | } |
| 695 | |
| 696 | int main(int argc, char *argv[]) { |
| 697 | int res; |
| 698 | |
| 699 | /* Setup io_uring for use */ |
| 700 | if(app_setup_uring()) { |
noah | 97e3a8b | 2020-12-16 19:08:33 -0500 | [diff] [blame] | 701 | fprintf(stderr, "Unable to setup uring!\\n"); |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 702 | return 1; |
| 703 | } |
| 704 | |
| 705 | /* |
| 706 | * A while loop that reads from stdin and writes to stdout. |
| 707 | * Breaks on EOF. |
| 708 | */ |
| 709 | while (1) { |
| 710 | /* Initiate read from stdin and wait for it to complete */ |
| 711 | submit_to_sq(STDIN_FILENO, IORING_OP_READ); |
| 712 | /* Read completion queue entry */ |
| 713 | res = read_from_cq(); |
| 714 | if (res > 0) { |
| 715 | /* Read successful. Write to stdout. */ |
| 716 | submit_to_sq(STDOUT_FILENO, IORING_OP_WRITE); |
| 717 | read_from_cq(); |
| 718 | } else if (res == 0) { |
| 719 | /* reached EOF */ |
| 720 | break; |
| 721 | } |
| 722 | else if (res < 0) { |
| 723 | /* Error reading file */ |
noah | 97e3a8b | 2020-12-16 19:08:33 -0500 | [diff] [blame] | 724 | fprintf(stderr, "Error: %s\\n", strerror(abs(res))); |
Shuveb Hussain | 26afb36 | 2020-09-28 18:41:35 +0530 | [diff] [blame] | 725 | break; |
| 726 | } |
| 727 | offset += res; |
| 728 | } |
| 729 | |
| 730 | return 0; |
| 731 | } |
| 732 | .EE |
Shuveb Hussain | 95167d0 | 2020-09-14 10:57:50 +0530 | [diff] [blame] | 733 | .SH SEE ALSO |
| 734 | .BR io_uring_enter (2) |
| 735 | .BR io_uring_register (2) |
| 736 | .BR io_uring_setup (2) |