Maxime Ripard | c4d2ae9 | 2014-10-28 21:55:50 +0100 | [diff] [blame] | 1 | DMAengine controller documentation |
| 2 | ================================== |
| 3 | |
| 4 | Hardware Introduction |
| 5 | +++++++++++++++++++++ |
| 6 | |
| 7 | Most of the Slave DMA controllers have the same general principles of |
| 8 | operations. |
| 9 | |
| 10 | They have a given number of channels to use for the DMA transfers, and |
| 11 | a given number of requests lines. |
| 12 | |
| 13 | Requests and channels are pretty much orthogonal. Channels can be used |
| 14 | to serve several to any requests. To simplify, channels are the |
| 15 | entities that will be doing the copy, and requests what endpoints are |
| 16 | involved. |
| 17 | |
| 18 | The request lines actually correspond to physical lines going from the |
| 19 | DMA-eligible devices to the controller itself. Whenever the device |
| 20 | will want to start a transfer, it will assert a DMA request (DRQ) by |
| 21 | asserting that request line. |
| 22 | |
| 23 | A very simple DMA controller would only take into account a single |
| 24 | parameter: the transfer size. At each clock cycle, it would transfer a |
| 25 | byte of data from one buffer to another, until the transfer size has |
| 26 | been reached. |
| 27 | |
| 28 | That wouldn't work well in the real world, since slave devices might |
| 29 | require a specific number of bits to be transferred in a single |
| 30 | cycle. For example, we may want to transfer as much data as the |
| 31 | physical bus allows to maximize performances when doing a simple |
| 32 | memory copy operation, but our audio device could have a narrower FIFO |
| 33 | that requires data to be written exactly 16 or 24 bits at a time. This |
| 34 | is why most if not all of the DMA controllers can adjust this, using a |
| 35 | parameter called the transfer width. |
| 36 | |
| 37 | Moreover, some DMA controllers, whenever the RAM is used as a source |
| 38 | or destination, can group the reads or writes in memory into a buffer, |
| 39 | so instead of having a lot of small memory accesses, which is not |
| 40 | really efficient, you'll get several bigger transfers. This is done |
| 41 | using a parameter called the burst size, that defines how many single |
| 42 | reads/writes it's allowed to do without the controller splitting the |
| 43 | transfer into smaller sub-transfers. |
| 44 | |
| 45 | Our theoretical DMA controller would then only be able to do transfers |
| 46 | that involve a single contiguous block of data. However, some of the |
| 47 | transfers we usually have are not, and want to copy data from |
| 48 | non-contiguous buffers to a contiguous buffer, which is called |
| 49 | scatter-gather. |
| 50 | |
| 51 | DMAEngine, at least for mem2dev transfers, require support for |
| 52 | scatter-gather. So we're left with two cases here: either we have a |
| 53 | quite simple DMA controller that doesn't support it, and we'll have to |
| 54 | implement it in software, or we have a more advanced DMA controller, |
| 55 | that implements in hardware scatter-gather. |
| 56 | |
| 57 | The latter are usually programmed using a collection of chunks to |
| 58 | transfer, and whenever the transfer is started, the controller will go |
| 59 | over that collection, doing whatever we programmed there. |
| 60 | |
| 61 | This collection is usually either a table or a linked list. You will |
| 62 | then push either the address of the table and its number of elements, |
| 63 | or the first item of the list to one channel of the DMA controller, |
| 64 | and whenever a DRQ will be asserted, it will go through the collection |
| 65 | to know where to fetch the data from. |
| 66 | |
| 67 | Either way, the format of this collection is completely dependent on |
| 68 | your hardware. Each DMA controller will require a different structure, |
| 69 | but all of them will require, for every chunk, at least the source and |
| 70 | destination addresses, whether it should increment these addresses or |
| 71 | not and the three parameters we saw earlier: the burst size, the |
| 72 | transfer width and the transfer size. |
| 73 | |
| 74 | The one last thing is that usually, slave devices won't issue DRQ by |
| 75 | default, and you have to enable this in your slave device driver first |
| 76 | whenever you're willing to use DMA. |
| 77 | |
| 78 | These were just the general memory-to-memory (also called mem2mem) or |
| 79 | memory-to-device (mem2dev) kind of transfers. Most devices often |
| 80 | support other kind of transfers or memory operations that dmaengine |
| 81 | support and will be detailed later in this document. |
| 82 | |
| 83 | DMA Support in Linux |
| 84 | ++++++++++++++++++++ |
| 85 | |
| 86 | Historically, DMA controller drivers have been implemented using the |
| 87 | async TX API, to offload operations such as memory copy, XOR, |
| 88 | cryptography, etc., basically any memory to memory operation. |
| 89 | |
| 90 | Over time, the need for memory to device transfers arose, and |
| 91 | dmaengine was extended. Nowadays, the async TX API is written as a |
| 92 | layer on top of dmaengine, and acts as a client. Still, dmaengine |
| 93 | accommodates that API in some cases, and made some design choices to |
| 94 | ensure that it stayed compatible. |
| 95 | |
| 96 | For more information on the Async TX API, please look the relevant |
| 97 | documentation file in Documentation/crypto/async-tx-api.txt. |
| 98 | |
| 99 | DMAEngine Registration |
| 100 | ++++++++++++++++++++++ |
| 101 | |
| 102 | struct dma_device Initialization |
| 103 | -------------------------------- |
| 104 | |
| 105 | Just like any other kernel framework, the whole DMAEngine registration |
| 106 | relies on the driver filling a structure and registering against the |
| 107 | framework. In our case, that structure is dma_device. |
| 108 | |
| 109 | The first thing you need to do in your driver is to allocate this |
| 110 | structure. Any of the usual memory allocators will do, but you'll also |
| 111 | need to initialize a few fields in there: |
| 112 | |
| 113 | * channels: should be initialized as a list using the |
| 114 | INIT_LIST_HEAD macro for example |
| 115 | |
Maxime Ripard | 1faab1f | 2014-11-17 14:42:55 +0100 | [diff] [blame] | 116 | * src_addr_widths: |
| 117 | - should contain a bitmask of the supported source transfer width |
| 118 | |
| 119 | * dst_addr_widths: |
| 120 | - should contain a bitmask of the supported destination transfer |
| 121 | width |
| 122 | |
| 123 | * directions: |
| 124 | - should contain a bitmask of the supported slave directions |
| 125 | (i.e. excluding mem2mem transfers) |
| 126 | |
| 127 | * residue_granularity: |
| 128 | - Granularity of the transfer residue reported to dma_set_residue. |
| 129 | - This can be either: |
| 130 | + Descriptor |
| 131 | -> Your device doesn't support any kind of residue |
| 132 | reporting. The framework will only know that a particular |
| 133 | transaction descriptor is done. |
| 134 | + Segment |
| 135 | -> Your device is able to report which chunks have been |
| 136 | transferred |
| 137 | + Burst |
| 138 | -> Your device is able to report which burst have been |
| 139 | transferred |
| 140 | |
Maxime Ripard | c4d2ae9 | 2014-10-28 21:55:50 +0100 | [diff] [blame] | 141 | * dev: should hold the pointer to the struct device associated |
| 142 | to your current driver instance. |
| 143 | |
| 144 | Supported transaction types |
| 145 | --------------------------- |
| 146 | |
| 147 | The next thing you need is to set which transaction types your device |
| 148 | (and driver) supports. |
| 149 | |
| 150 | Our dma_device structure has a field called cap_mask that holds the |
| 151 | various types of transaction supported, and you need to modify this |
| 152 | mask using the dma_cap_set function, with various flags depending on |
| 153 | transaction types you support as an argument. |
| 154 | |
| 155 | All those capabilities are defined in the dma_transaction_type enum, |
| 156 | in include/linux/dmaengine.h |
| 157 | |
| 158 | Currently, the types available are: |
| 159 | * DMA_MEMCPY |
| 160 | - The device is able to do memory to memory copies |
| 161 | |
| 162 | * DMA_XOR |
| 163 | - The device is able to perform XOR operations on memory areas |
| 164 | - Used to accelerate XOR intensive tasks, such as RAID5 |
| 165 | |
| 166 | * DMA_XOR_VAL |
| 167 | - The device is able to perform parity check using the XOR |
| 168 | algorithm against a memory buffer. |
| 169 | |
| 170 | * DMA_PQ |
| 171 | - The device is able to perform RAID6 P+Q computations, P being a |
| 172 | simple XOR, and Q being a Reed-Solomon algorithm. |
| 173 | |
| 174 | * DMA_PQ_VAL |
| 175 | - The device is able to perform parity check using RAID6 P+Q |
| 176 | algorithm against a memory buffer. |
| 177 | |
| 178 | * DMA_INTERRUPT |
| 179 | - The device is able to trigger a dummy transfer that will |
| 180 | generate periodic interrupts |
| 181 | - Used by the client drivers to register a callback that will be |
| 182 | called on a regular basis through the DMA controller interrupt |
| 183 | |
| 184 | * DMA_SG |
| 185 | - The device supports memory to memory scatter-gather |
| 186 | transfers. |
| 187 | - Even though a plain memcpy can look like a particular case of a |
| 188 | scatter-gather transfer, with a single chunk to transfer, it's a |
| 189 | distinct transaction type in the mem2mem transfers case |
| 190 | |
| 191 | * DMA_PRIVATE |
| 192 | - The devices only supports slave transfers, and as such isn't |
| 193 | available for async transfers. |
| 194 | |
| 195 | * DMA_ASYNC_TX |
| 196 | - Must not be set by the device, and will be set by the framework |
| 197 | if needed |
| 198 | - /* TODO: What is it about? */ |
| 199 | |
| 200 | * DMA_SLAVE |
| 201 | - The device can handle device to memory transfers, including |
| 202 | scatter-gather transfers. |
| 203 | - While in the mem2mem case we were having two distinct types to |
| 204 | deal with a single chunk to copy or a collection of them, here, |
| 205 | we just have a single transaction type that is supposed to |
| 206 | handle both. |
| 207 | - If you want to transfer a single contiguous memory buffer, |
| 208 | simply build a scatter list with only one item. |
| 209 | |
| 210 | * DMA_CYCLIC |
| 211 | - The device can handle cyclic transfers. |
| 212 | - A cyclic transfer is a transfer where the chunk collection will |
| 213 | loop over itself, with the last item pointing to the first. |
| 214 | - It's usually used for audio transfers, where you want to operate |
| 215 | on a single ring buffer that you will fill with your audio data. |
| 216 | |
| 217 | * DMA_INTERLEAVE |
| 218 | - The device supports interleaved transfer. |
| 219 | - These transfers can transfer data from a non-contiguous buffer |
| 220 | to a non-contiguous buffer, opposed to DMA_SLAVE that can |
| 221 | transfer data from a non-contiguous data set to a continuous |
| 222 | destination buffer. |
| 223 | - It's usually used for 2d content transfers, in which case you |
| 224 | want to transfer a portion of uncompressed data directly to the |
| 225 | display to print it |
| 226 | |
| 227 | These various types will also affect how the source and destination |
| 228 | addresses change over time. |
| 229 | |
| 230 | Addresses pointing to RAM are typically incremented (or decremented) |
| 231 | after each transfer. In case of a ring buffer, they may loop |
| 232 | (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO) |
| 233 | are typically fixed. |
| 234 | |
| 235 | Device operations |
| 236 | ----------------- |
| 237 | |
| 238 | Our dma_device structure also requires a few function pointers in |
| 239 | order to implement the actual logic, now that we described what |
| 240 | operations we were able to perform. |
| 241 | |
| 242 | The functions that we have to fill in there, and hence have to |
| 243 | implement, obviously depend on the transaction types you reported as |
| 244 | supported. |
| 245 | |
| 246 | * device_alloc_chan_resources |
| 247 | * device_free_chan_resources |
| 248 | - These functions will be called whenever a driver will call |
| 249 | dma_request_channel or dma_release_channel for the first/last |
| 250 | time on the channel associated to that driver. |
| 251 | - They are in charge of allocating/freeing all the needed |
| 252 | resources in order for that channel to be useful for your |
| 253 | driver. |
| 254 | - These functions can sleep. |
| 255 | |
| 256 | * device_prep_dma_* |
| 257 | - These functions are matching the capabilities you registered |
| 258 | previously. |
| 259 | - These functions all take the buffer or the scatterlist relevant |
| 260 | for the transfer being prepared, and should create a hardware |
| 261 | descriptor or a list of hardware descriptors from it |
| 262 | - These functions can be called from an interrupt context |
| 263 | - Any allocation you might do should be using the GFP_NOWAIT |
| 264 | flag, in order not to potentially sleep, but without depleting |
| 265 | the emergency pool either. |
| 266 | - Drivers should try to pre-allocate any memory they might need |
| 267 | during the transfer setup at probe time to avoid putting to |
| 268 | much pressure on the nowait allocator. |
| 269 | |
| 270 | - It should return a unique instance of the |
| 271 | dma_async_tx_descriptor structure, that further represents this |
| 272 | particular transfer. |
| 273 | |
| 274 | - This structure can be initialized using the function |
| 275 | dma_async_tx_descriptor_init. |
| 276 | - You'll also need to set two fields in this structure: |
| 277 | + flags: |
| 278 | TODO: Can it be modified by the driver itself, or |
| 279 | should it be always the flags passed in the arguments |
| 280 | |
| 281 | + tx_submit: A pointer to a function you have to implement, |
| 282 | that is supposed to push the current |
| 283 | transaction descriptor to a pending queue, waiting |
| 284 | for issue_pending to be called. |
| 285 | |
| 286 | * device_issue_pending |
| 287 | - Takes the first transaction descriptor in the pending queue, |
| 288 | and starts the transfer. Whenever that transfer is done, it |
| 289 | should move to the next transaction in the list. |
| 290 | - This function can be called in an interrupt context |
| 291 | |
| 292 | * device_tx_status |
| 293 | - Should report the bytes left to go over on the given channel |
| 294 | - Should only care about the transaction descriptor passed as |
| 295 | argument, not the currently active one on a given channel |
| 296 | - The tx_state argument might be NULL |
| 297 | - Should use dma_set_residue to report it |
| 298 | - In the case of a cyclic transfer, it should only take into |
| 299 | account the current period. |
| 300 | - This function can be called in an interrupt context. |
| 301 | |
Maxime Ripard | 1faab1f | 2014-11-17 14:42:55 +0100 | [diff] [blame] | 302 | * device_config |
| 303 | - Reconfigures the channel with the configuration given as |
| 304 | argument |
| 305 | - This command should NOT perform synchronously, or on any |
| 306 | currently queued transfers, but only on subsequent ones |
| 307 | - In this case, the function will receive a dma_slave_config |
| 308 | structure pointer as an argument, that will detail which |
| 309 | configuration to use. |
| 310 | - Even though that structure contains a direction field, this |
| 311 | field is deprecated in favor of the direction argument given to |
| 312 | the prep_* functions |
Vinod Koul | 6269591 | 2014-12-07 23:18:01 +0530 | [diff] [blame] | 313 | - This call is mandatory for slave operations only. This should NOT be |
| 314 | set or expected to be set for memcpy operations. |
| 315 | If a driver support both, it should use this call for slave |
| 316 | operations only and not for memcpy ones. |
Maxime Ripard | c4d2ae9 | 2014-10-28 21:55:50 +0100 | [diff] [blame] | 317 | |
Maxime Ripard | 1faab1f | 2014-11-17 14:42:55 +0100 | [diff] [blame] | 318 | * device_pause |
| 319 | - Pauses a transfer on the channel |
| 320 | - This command should operate synchronously on the channel, |
| 321 | pausing right away the work of the given channel |
| 322 | |
| 323 | * device_resume |
| 324 | - Resumes a transfer on the channel |
| 325 | - This command should operate synchronously on the channel, |
| 326 | pausing right away the work of the given channel |
| 327 | |
| 328 | * device_terminate_all |
| 329 | - Aborts all the pending and ongoing transfers on the channel |
| 330 | - This command should operate synchronously on the channel, |
| 331 | terminating right away all the channels |
Maxime Ripard | c4d2ae9 | 2014-10-28 21:55:50 +0100 | [diff] [blame] | 332 | |
| 333 | Misc notes (stuff that should be documented, but don't really know |
| 334 | where to put them) |
| 335 | ------------------------------------------------------------------ |
| 336 | * dma_run_dependencies |
| 337 | - Should be called at the end of an async TX transfer, and can be |
| 338 | ignored in the slave transfers case. |
| 339 | - Makes sure that dependent operations are run before marking it |
| 340 | as complete. |
| 341 | |
| 342 | * dma_cookie_t |
| 343 | - it's a DMA transaction ID that will increment over time. |
| 344 | - Not really relevant any more since the introduction of virt-dma |
| 345 | that abstracts it away. |
| 346 | |
| 347 | * DMA_CTRL_ACK |
Robert Jarzmik | 5f88d97 | 2015-05-26 23:06:34 +0200 | [diff] [blame] | 348 | - If set, the transfer can be reused after being completed. |
| 349 | - There is a guarantee the transfer won't be freed until it is acked |
| 350 | by async_tx_ack(). |
| 351 | - As a consequence, if a device driver wants to skip the dma_map_sg() and |
| 352 | dma_unmap_sg() in between 2 transfers, because the DMA'd data wasn't used, |
| 353 | it can resubmit the transfer right after its completion. |
Maxime Ripard | c4d2ae9 | 2014-10-28 21:55:50 +0100 | [diff] [blame] | 354 | |
| 355 | General Design Notes |
| 356 | -------------------- |
| 357 | |
| 358 | Most of the DMAEngine drivers you'll see are based on a similar design |
| 359 | that handles the end of transfer interrupts in the handler, but defer |
| 360 | most work to a tasklet, including the start of a new transfer whenever |
| 361 | the previous transfer ended. |
| 362 | |
| 363 | This is a rather inefficient design though, because the inter-transfer |
| 364 | latency will be not only the interrupt latency, but also the |
| 365 | scheduling latency of the tasklet, which will leave the channel idle |
| 366 | in between, which will slow down the global transfer rate. |
| 367 | |
| 368 | You should avoid this kind of practice, and instead of electing a new |
| 369 | transfer in your tasklet, move that part to the interrupt handler in |
| 370 | order to have a shorter idle window (that we can't really avoid |
| 371 | anyway). |
| 372 | |
| 373 | Glossary |
| 374 | -------- |
| 375 | |
| 376 | Burst: A number of consecutive read or write operations |
| 377 | that can be queued to buffers before being flushed to |
| 378 | memory. |
| 379 | Chunk: A contiguous collection of bursts |
| 380 | Transfer: A collection of chunks (be it contiguous or not) |