Rob Gardner | dd02732 | 2017-12-05 19:40:43 -0700 | [diff] [blame] | 1 | Oracle Data Analytics Accelerator (DAX) |
| 2 | --------------------------------------- |
| 3 | |
| 4 | DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 |
| 5 | (DAX2) processor chips, and has direct access to the CPU's L3 caches |
| 6 | as well as physical memory. It can perform several operations on data |
| 7 | streams with various input and output formats. A driver provides a |
| 8 | transport mechanism and has limited knowledge of the various opcodes |
| 9 | and data formats. A user space library provides high level services |
| 10 | and translates these into low level commands which are then passed |
| 11 | into the driver and subsequently the Hypervisor and the coprocessor. |
| 12 | The library is the recommended way for applications to use the |
| 13 | coprocessor, and the driver interface is not intended for general use. |
| 14 | This document describes the general flow of the driver, its |
| 15 | structures, and its programmatic interface. It also provides example |
| 16 | code sufficient to write user or kernel applications that use DAX |
| 17 | functionality. |
| 18 | |
| 19 | The user library is open source and available at: |
| 20 | https://oss.oracle.com/git/gitweb.cgi?p=libdax.git |
| 21 | |
| 22 | The Hypervisor interface to the coprocessor is described in detail in |
| 23 | the accompanying document, dax-hv-api.txt, which is a plain text |
| 24 | excerpt of the (Oracle internal) "UltraSPARC Virtual Machine |
| 25 | Specification" version 3.0.20+15, dated 2017-09-25. |
| 26 | |
| 27 | |
| 28 | High Level Overview |
| 29 | ------------------- |
| 30 | |
| 31 | A coprocessor request is described by a Command Control Block |
| 32 | (CCB). The CCB contains an opcode and various parameters. The opcode |
| 33 | specifies what operation is to be done, and the parameters specify |
| 34 | options, flags, sizes, and addresses. The CCB (or an array of CCBs) |
| 35 | is passed to the Hypervisor, which handles queueing and scheduling of |
| 36 | requests to the available coprocessor execution units. A status code |
| 37 | returned indicates if the request was submitted successfully or if |
| 38 | there was an error. One of the addresses given in each CCB is a |
| 39 | pointer to a "completion area", which is a 128 byte memory block that |
| 40 | is written by the coprocessor to provide execution status. No |
| 41 | interrupt is generated upon completion; the completion area must be |
| 42 | polled by software to find out when a transaction has finished, but |
| 43 | the M7 and later processors provide a mechanism to pause the virtual |
| 44 | processor until the completion status has been updated by the |
| 45 | coprocessor. This is done using the monitored load and mwait |
| 46 | instructions, which are described in more detail later. The DAX |
| 47 | coprocessor was designed so that after a request is submitted, the |
| 48 | kernel is no longer involved in the processing of it. The polling is |
| 49 | done at the user level, which results in almost zero latency between |
| 50 | completion of a request and resumption of execution of the requesting |
| 51 | thread. |
| 52 | |
| 53 | |
| 54 | Addressing Memory |
| 55 | ----------------- |
| 56 | |
| 57 | The kernel does not have access to physical memory in the Sun4v |
| 58 | architecture, as there is an additional level of memory virtualization |
| 59 | present. This intermediate level is called "real" memory, and the |
| 60 | kernel treats this as if it were physical. The Hypervisor handles the |
| 61 | translations between real memory and physical so that each logical |
| 62 | domain (LDOM) can have a partition of physical memory that is isolated |
| 63 | from that of other LDOMs. When the kernel sets up a virtual mapping, |
| 64 | it specifies a virtual address and the real address to which it should |
| 65 | be mapped. |
| 66 | |
| 67 | The DAX coprocessor can only operate on physical memory, so before a |
| 68 | request can be fed to the coprocessor, all the addresses in a CCB must |
| 69 | be converted into physical addresses. The kernel cannot do this since |
| 70 | it has no visibility into physical addresses. So a CCB may contain |
| 71 | either the virtual or real addresses of the buffers or a combination |
| 72 | of them. An "address type" field is available for each address that |
| 73 | may be given in the CCB. In all cases, the Hypervisor will translate |
| 74 | all the addresses to physical before dispatching to hardware. Address |
| 75 | translations are performed using the context of the process initiating |
| 76 | the request. |
| 77 | |
| 78 | |
| 79 | The Driver API |
| 80 | -------------- |
| 81 | |
| 82 | An application makes requests to the driver via the write() system |
| 83 | call, and gets results (if any) via read(). The completion areas are |
| 84 | made accessible via mmap(), and are read-only for the application. |
| 85 | |
| 86 | The request may either be an immediate command or an array of CCBs to |
| 87 | be submitted to the hardware. |
| 88 | |
| 89 | Each open instance of the device is exclusive to the thread that |
| 90 | opened it, and must be used by that thread for all subsequent |
| 91 | operations. The driver open function creates a new context for the |
| 92 | thread and initializes it for use. This context contains pointers and |
| 93 | values used internally by the driver to keep track of submitted |
| 94 | requests. The completion area buffer is also allocated, and this is |
| 95 | large enough to contain the completion areas for many concurrent |
| 96 | requests. When the device is closed, any outstanding transactions are |
| 97 | flushed and the context is cleaned up. |
| 98 | |
| 99 | On a DAX1 system (M7), the device will be called "oradax1", while on a |
| 100 | DAX2 system (M8) it will be "oradax2". If an application requires one |
| 101 | or the other, it should simply attempt to open the appropriate |
| 102 | device. Only one of the devices will exist on any given system, so the |
| 103 | name can be used to determine what the platform supports. |
| 104 | |
| 105 | The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For |
| 106 | all of these, success is indicated by a return value from write() |
| 107 | equal to the number of bytes given in the call. Otherwise -1 is |
| 108 | returned and errno is set. |
| 109 | |
| 110 | CCB_DEQUEUE |
| 111 | |
| 112 | Tells the driver to clean up resources associated with past |
| 113 | requests. Since no interrupt is generated upon the completion of a |
| 114 | request, the driver must be told when it may reclaim resources. No |
| 115 | further status information is returned, so the user should not |
| 116 | subsequently call read(). |
| 117 | |
| 118 | CCB_KILL |
| 119 | |
| 120 | Kills a CCB during execution. The CCB is guaranteed to not continue |
| 121 | executing once this call returns successfully. On success, read() must |
| 122 | be called to retrieve the result of the action. |
| 123 | |
| 124 | CCB_INFO |
| 125 | |
| 126 | Retrieves information about a currently executing CCB. Note that some |
| 127 | Hypervisors might return 'notfound' when the CCB is in 'inprogress' |
| 128 | state. To ensure a CCB in the 'notfound' state will never be executed, |
| 129 | CCB_KILL must be invoked on that CCB. Upon success, read() must be |
| 130 | called to retrieve the details of the action. |
| 131 | |
| 132 | Submission of an array of CCBs for execution |
| 133 | |
| 134 | A write() whose length is a multiple of the CCB size is treated as a |
| 135 | submit operation. The file offset is treated as the index of the |
| 136 | completion area to use, and may be set via lseek() or using the |
| 137 | pwrite() system call. If -1 is returned then errno is set to indicate |
| 138 | the error. Otherwise, the return value is the length of the array that |
| 139 | was actually accepted by the coprocessor. If the accepted length is |
| 140 | equal to the requested length, then the submission was completely |
| 141 | successful and there is no further status needed; hence, the user |
| 142 | should not subsequently call read(). Partial acceptance of the CCB |
| 143 | array is indicated by a return value less than the requested length, |
| 144 | and read() must be called to retrieve further status information. The |
| 145 | status will reflect the error caused by the first CCB that was not |
| 146 | accepted, and status_data will provide additional data in some cases. |
| 147 | |
| 148 | MMAP |
| 149 | |
| 150 | The mmap() function provides access to the completion area allocated |
| 151 | in the driver. Note that the completion area is not writeable by the |
| 152 | user process, and the mmap call must not specify PROT_WRITE. |
| 153 | |
| 154 | |
| 155 | Completion of a Request |
| 156 | ----------------------- |
| 157 | |
| 158 | The first byte in each completion area is the command status which is |
| 159 | updated by the coprocessor hardware. Software may take advantage of |
| 160 | new M7/M8 processor capabilities to efficiently poll this status byte. |
| 161 | First, a "monitored load" is achieved via a Load from Alternate Space |
| 162 | (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a |
| 163 | "monitored wait" is achieved via the mwait instruction (a write to |
| 164 | %asr28). This instruction is like pause in that it suspends execution |
| 165 | of the virtual processor for the given number of nanoseconds, but in |
| 166 | addition will terminate early when one of several events occur. If the |
| 167 | block of data containing the monitored location is modified, then the |
| 168 | mwait terminates. This causes software to resume execution immediately |
| 169 | (without a context switch or kernel to user transition) after a |
| 170 | transaction completes. Thus the latency between transaction completion |
| 171 | and resumption of execution may be just a few nanoseconds. |
| 172 | |
| 173 | |
| 174 | Application Life Cycle of a DAX Submission |
| 175 | ------------------------------------------ |
| 176 | |
| 177 | - open dax device |
| 178 | - call mmap() to get the completion area address |
| 179 | - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. |
| 180 | - submit CCB via write() or pwrite() |
| 181 | - go into a loop executing monitored load + monitored wait and |
| 182 | terminate when the command status indicates the request is complete |
| 183 | (CCB_KILL or CCB_INFO may be used any time as necessary) |
| 184 | - perform a CCB_DEQUEUE |
| 185 | - call munmap() for completion area |
| 186 | - close the dax device |
| 187 | |
| 188 | |
| 189 | Memory Constraints |
| 190 | ------------------ |
| 191 | |
| 192 | The DAX hardware operates only on physical addresses. Therefore, it is |
| 193 | not aware of virtual memory mappings and the discontiguities that may |
| 194 | exist in the physical memory that a virtual buffer maps to. There is |
| 195 | no I/O TLB or any scatter/gather mechanism. All buffers, whether input |
| 196 | or output, must reside in a physically contiguous region of memory. |
| 197 | |
| 198 | The Hypervisor translates all addresses within a CCB to physical |
| 199 | before handing off the CCB to DAX. The Hypervisor determines the |
| 200 | virtual page size for each virtual address given, and uses this to |
| 201 | program a size limit for each address. This prevents the coprocessor |
| 202 | from reading or writing beyond the bound of the virtual page, even |
| 203 | though it is accessing physical memory directly. A simpler way of |
| 204 | saying this is that a DAX operation will never "cross" a virtual page |
| 205 | boundary. If an 8k virtual page is used, then the data is strictly |
| 206 | limited to 8k. If a user's buffer is larger than 8k, then a larger |
| 207 | page size must be used, or the transaction size will be truncated to |
| 208 | 8k. |
| 209 | |
| 210 | Huge pages. A user may allocate huge pages using standard interfaces. |
| 211 | Memory buffers residing on huge pages may be used to achieve much |
| 212 | larger DAX transaction sizes, but the rules must still be followed, |
| 213 | and no transaction will cross a page boundary, even a huge page. A |
| 214 | major caveat is that Linux on Sparc presents 8Mb as one of the huge |
| 215 | page sizes. Sparc does not actually provide a 8Mb hardware page size, |
| 216 | and this size is synthesized by pasting together two 4Mb pages. The |
| 217 | reasons for this are historical, and it creates an issue because only |
| 218 | half of this 8Mb page can actually be used for any given buffer in a |
| 219 | DAX request, and it must be either the first half or the second half; |
| 220 | it cannot be a 4Mb chunk in the middle, since that crosses a |
| 221 | (hardware) page boundary. Note that this entire issue may be hidden by |
| 222 | higher level libraries. |
| 223 | |
| 224 | |
| 225 | CCB Structure |
| 226 | ------------- |
| 227 | A CCB is an array of 8 64-bit words. Several of these words provide |
| 228 | command opcodes, parameters, flags, etc., and the rest are addresses |
| 229 | for the completion area, output buffer, and various inputs: |
| 230 | |
| 231 | struct ccb { |
| 232 | u64 control; |
| 233 | u64 completion; |
| 234 | u64 input0; |
| 235 | u64 access; |
| 236 | u64 input1; |
| 237 | u64 op_data; |
| 238 | u64 output; |
| 239 | u64 table; |
| 240 | }; |
| 241 | |
| 242 | See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of |
| 243 | each of these fields, and see dax-hv-api.txt for a complete description |
| 244 | of the Hypervisor API available to the guest OS (ie, Linux kernel). |
| 245 | |
| 246 | The first word (control) is examined by the driver for the following: |
| 247 | - CCB version, which must be consistent with hardware version |
| 248 | - Opcode, which must be one of the documented allowable commands |
| 249 | - Address types, which must be set to "virtual" for all the addresses |
| 250 | given by the user, thereby ensuring that the application can |
| 251 | only access memory that it owns |
| 252 | |
| 253 | |
| 254 | Example Code |
| 255 | ------------ |
| 256 | |
| 257 | The DAX is accessible to both user and kernel code. The kernel code |
| 258 | can make hypercalls directly while the user code must use wrappers |
| 259 | provided by the driver. The setup of the CCB is nearly identical for |
| 260 | both; the only difference is in preparation of the completion area. An |
| 261 | example of user code is given now, with kernel code afterwards. |
| 262 | |
| 263 | In order to program using the driver API, the file |
| 264 | arch/sparc/include/uapi/asm/oradax.h must be included. |
| 265 | |
| 266 | First, the proper device must be opened. For M7 it will be |
| 267 | /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest |
| 268 | procedure is to attempt to open both, as only one will succeed: |
| 269 | |
| 270 | fd = open("/dev/oradax1", O_RDWR); |
| 271 | if (fd < 0) |
| 272 | fd = open("/dev/oradax2", O_RDWR); |
| 273 | if (fd < 0) |
| 274 | /* No DAX found */ |
| 275 | |
| 276 | Next, the completion area must be mapped: |
| 277 | |
| 278 | completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); |
| 279 | |
| 280 | All input and output buffers must be fully contained in one hardware |
| 281 | page, since as explained above, the DAX is strictly constrained by |
| 282 | virtual page boundaries. In addition, the output buffer must be |
| 283 | 64-byte aligned and its size must be a multiple of 64 bytes because |
| 284 | the coprocessor writes in units of cache lines. |
| 285 | |
| 286 | This example demonstrates the DAX Scan command, which takes as input a |
| 287 | vector and a match value, and produces a bitmap as the output. For |
| 288 | each input element that matches the value, the corresponding bit is |
| 289 | set in the output. |
| 290 | |
| 291 | In this example, the input vector consists of a series of single bits, |
| 292 | and the match value is 0. So each 0 bit in the input will produce a 1 |
| 293 | in the output, and vice versa, which produces an output bitmap which |
| 294 | is the input bitmap inverted. |
| 295 | |
| 296 | For details of all the parameters and bits used in this CCB, please |
| 297 | refer to section 36.2.1.3 of the DAX Hypervisor API document, which |
| 298 | describes the Scan command in detail. |
| 299 | |
| 300 | ccb->control = /* Table 36.1, CCB Header Format */ |
| 301 | (2L << 48) /* command = Scan Value */ |
| 302 | | (3L << 40) /* output address type = primary virtual */ |
| 303 | | (3L << 34) /* primary input address type = primary virtual */ |
| 304 | /* Section 36.2.1, Query CCB Command Formats */ |
| 305 | | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */ |
| 306 | | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ |
| 307 | | (8 << 10) /* 36.2.1.1.6 output format = bit vector */ |
| 308 | | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */ |
| 309 | | (31 << 0); /* 36.2.1.3 Disable second scan criteria */ |
| 310 | |
| 311 | ccb->completion = 0; /* Completion area address, to be filled in by driver */ |
| 312 | |
| 313 | ccb->input0 = (unsigned long) input; /* primary input address */ |
| 314 | |
| 315 | ccb->access = /* Section 36.2.1.2, Data Access Control */ |
| 316 | (2 << 24) /* Primary input length format = bits */ |
| 317 | | (nbits - 1); /* number of bits in primary input stream, minus 1 */ |
| 318 | |
| 319 | ccb->input1 = 0; /* secondary input address, unused */ |
| 320 | |
| 321 | ccb->op_data = 0; /* scan criteria (value to be matched) */ |
| 322 | |
| 323 | ccb->output = (unsigned long) output; /* output address */ |
| 324 | |
| 325 | ccb->table = 0; /* table address, unused */ |
| 326 | |
| 327 | The CCB submission is a write() or pwrite() system call to the |
| 328 | driver. If the call fails, then a read() must be used to retrieve the |
| 329 | status: |
| 330 | |
| 331 | if (pwrite(fd, ccb, 64, 0) != 64) { |
| 332 | struct ccb_exec_result status; |
| 333 | read(fd, &status, sizeof(status)); |
| 334 | /* bail out */ |
| 335 | } |
| 336 | |
| 337 | After a successful submission of the CCB, the completion area may be |
| 338 | polled to determine when the DAX is finished. Detailed information on |
| 339 | the contents of the completion area can be found in section 36.2.2 of |
| 340 | the DAX HV API document. |
| 341 | |
| 342 | while (1) { |
| 343 | /* Monitored Load */ |
| 344 | __asm__ __volatile__("lduba [%1] 0x84, %0\n" |
| 345 | : "=r" (status) |
| 346 | : "r" (completion_area)); |
| 347 | |
| 348 | if (status) /* 0 indicates command in progress */ |
| 349 | break; |
| 350 | |
| 351 | /* MWAIT */ |
| 352 | __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ |
| 353 | } |
| 354 | |
| 355 | A completion area status of 1 indicates successful completion of the |
| 356 | CCB and validity of the output bitmap, which may be used immediately. |
| 357 | All other non-zero values indicate error conditions which are |
| 358 | described in section 36.2.2. |
| 359 | |
| 360 | if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ |
| 361 | /* completion_area[0] contains the completion status */ |
| 362 | /* completion_area[1] contains an error code, see 36.2.2 */ |
| 363 | } |
| 364 | |
| 365 | After the completion area has been processed, the driver must be |
| 366 | notified that it can release any resources associated with the |
| 367 | request. This is done via the dequeue operation: |
| 368 | |
| 369 | struct dax_command cmd; |
| 370 | cmd.command = CCB_DEQUEUE; |
| 371 | if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { |
| 372 | /* bail out */ |
| 373 | } |
| 374 | |
| 375 | Finally, normal program cleanup should be done, i.e., unmapping |
| 376 | completion area, closing the dax device, freeing memory etc. |
| 377 | |
| 378 | [Kernel example] |
| 379 | |
| 380 | The only difference in using the DAX in kernel code is the treatment |
| 381 | of the completion area. Unlike user applications which mmap the |
| 382 | completion area allocated by the driver, kernel code must allocate its |
| 383 | own memory to use for the completion area, and this address and its |
| 384 | type must be given in the CCB: |
| 385 | |
| 386 | ccb->control |= /* Table 36.1, CCB Header Format */ |
| 387 | (3L << 32); /* completion area address type = primary virtual */ |
| 388 | |
| 389 | ccb->completion = (unsigned long) completion_area; /* Completion area address */ |
| 390 | |
| 391 | The dax submit hypercall is made directly. The flags used in the |
| 392 | ccb_submit call are documented in the DAX HV API in section 36.3.1. |
| 393 | |
| 394 | #include <asm/hypervisor.h> |
| 395 | |
| 396 | hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, |
| 397 | HV_CCB_QUERY_CMD | |
| 398 | HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | |
| 399 | HV_CCB_VA_PRIVILEGED, |
| 400 | 0, &bytes_accepted, &status_data); |
| 401 | |
| 402 | if (hv_rv != HV_EOK) { |
| 403 | /* hv_rv is an error code, status_data contains */ |
| 404 | /* potential additional status, see 36.3.1.1 */ |
| 405 | } |
| 406 | |
| 407 | After the submission, the completion area polling code is identical to |
| 408 | that in user land: |
| 409 | |
| 410 | while (1) { |
| 411 | /* Monitored Load */ |
| 412 | __asm__ __volatile__("lduba [%1] 0x84, %0\n" |
| 413 | : "=r" (status) |
| 414 | : "r" (completion_area)); |
| 415 | |
| 416 | if (status) /* 0 indicates command in progress */ |
| 417 | break; |
| 418 | |
| 419 | /* MWAIT */ |
| 420 | __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ |
| 421 | } |
| 422 | |
| 423 | if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ |
| 424 | /* completion_area[0] contains the completion status */ |
| 425 | /* completion_area[1] contains an error code, see 36.2.2 */ |
| 426 | } |
| 427 | |
| 428 | The output bitmap is ready for consumption immediately after the |
| 429 | completion status indicates success. |