| DMAengine controller documentation | 
 | ================================== | 
 |  | 
 | Hardware Introduction | 
 | +++++++++++++++++++++ | 
 |  | 
 | Most of the Slave DMA controllers have the same general principles of | 
 | operations. | 
 |  | 
 | They have a given number of channels to use for the DMA transfers, and | 
 | a given number of requests lines. | 
 |  | 
 | Requests and channels are pretty much orthogonal. Channels can be used | 
 | to serve several to any requests. To simplify, channels are the | 
 | entities that will be doing the copy, and requests what endpoints are | 
 | involved. | 
 |  | 
 | The request lines actually correspond to physical lines going from the | 
 | DMA-eligible devices to the controller itself. Whenever the device | 
 | will want to start a transfer, it will assert a DMA request (DRQ) by | 
 | asserting that request line. | 
 |  | 
 | A very simple DMA controller would only take into account a single | 
 | parameter: the transfer size. At each clock cycle, it would transfer a | 
 | byte of data from one buffer to another, until the transfer size has | 
 | been reached. | 
 |  | 
 | That wouldn't work well in the real world, since slave devices might | 
 | require a specific number of bits to be transferred in a single | 
 | cycle. For example, we may want to transfer as much data as the | 
 | physical bus allows to maximize performances when doing a simple | 
 | memory copy operation, but our audio device could have a narrower FIFO | 
 | that requires data to be written exactly 16 or 24 bits at a time. This | 
 | is why most if not all of the DMA controllers can adjust this, using a | 
 | parameter called the transfer width. | 
 |  | 
 | Moreover, some DMA controllers, whenever the RAM is used as a source | 
 | or destination, can group the reads or writes in memory into a buffer, | 
 | so instead of having a lot of small memory accesses, which is not | 
 | really efficient, you'll get several bigger transfers. This is done | 
 | using a parameter called the burst size, that defines how many single | 
 | reads/writes it's allowed to do without the controller splitting the | 
 | transfer into smaller sub-transfers. | 
 |  | 
 | Our theoretical DMA controller would then only be able to do transfers | 
 | that involve a single contiguous block of data. However, some of the | 
 | transfers we usually have are not, and want to copy data from | 
 | non-contiguous buffers to a contiguous buffer, which is called | 
 | scatter-gather. | 
 |  | 
 | DMAEngine, at least for mem2dev transfers, require support for | 
 | scatter-gather. So we're left with two cases here: either we have a | 
 | quite simple DMA controller that doesn't support it, and we'll have to | 
 | implement it in software, or we have a more advanced DMA controller, | 
 | that implements in hardware scatter-gather. | 
 |  | 
 | The latter are usually programmed using a collection of chunks to | 
 | transfer, and whenever the transfer is started, the controller will go | 
 | over that collection, doing whatever we programmed there. | 
 |  | 
 | This collection is usually either a table or a linked list. You will | 
 | then push either the address of the table and its number of elements, | 
 | or the first item of the list to one channel of the DMA controller, | 
 | and whenever a DRQ will be asserted, it will go through the collection | 
 | to know where to fetch the data from. | 
 |  | 
 | Either way, the format of this collection is completely dependent on | 
 | your hardware. Each DMA controller will require a different structure, | 
 | but all of them will require, for every chunk, at least the source and | 
 | destination addresses, whether it should increment these addresses or | 
 | not and the three parameters we saw earlier: the burst size, the | 
 | transfer width and the transfer size. | 
 |  | 
 | The one last thing is that usually, slave devices won't issue DRQ by | 
 | default, and you have to enable this in your slave device driver first | 
 | whenever you're willing to use DMA. | 
 |  | 
 | These were just the general memory-to-memory (also called mem2mem) or | 
 | memory-to-device (mem2dev) kind of transfers. Most devices often | 
 | support other kind of transfers or memory operations that dmaengine | 
 | support and will be detailed later in this document. | 
 |  | 
 | DMA Support in Linux | 
 | ++++++++++++++++++++ | 
 |  | 
 | Historically, DMA controller drivers have been implemented using the | 
 | async TX API, to offload operations such as memory copy, XOR, | 
 | cryptography, etc., basically any memory to memory operation. | 
 |  | 
 | Over time, the need for memory to device transfers arose, and | 
 | dmaengine was extended. Nowadays, the async TX API is written as a | 
 | layer on top of dmaengine, and acts as a client. Still, dmaengine | 
 | accommodates that API in some cases, and made some design choices to | 
 | ensure that it stayed compatible. | 
 |  | 
 | For more information on the Async TX API, please look the relevant | 
 | documentation file in Documentation/crypto/async-tx-api.txt. | 
 |  | 
 | DMAEngine Registration | 
 | ++++++++++++++++++++++ | 
 |  | 
 | struct dma_device Initialization | 
 | -------------------------------- | 
 |  | 
 | Just like any other kernel framework, the whole DMAEngine registration | 
 | relies on the driver filling a structure and registering against the | 
 | framework. In our case, that structure is dma_device. | 
 |  | 
 | The first thing you need to do in your driver is to allocate this | 
 | structure. Any of the usual memory allocators will do, but you'll also | 
 | need to initialize a few fields in there: | 
 |  | 
 |   * channels:	should be initialized as a list using the | 
 | 		INIT_LIST_HEAD macro for example | 
 |  | 
 |   * src_addr_widths: | 
 |     - should contain a bitmask of the supported source transfer width | 
 |  | 
 |   * dst_addr_widths: | 
 |     - should contain a bitmask of the supported destination transfer | 
 |       width | 
 |  | 
 |   * directions: | 
 |     - should contain a bitmask of the supported slave directions | 
 |       (i.e. excluding mem2mem transfers) | 
 |  | 
 |   * residue_granularity: | 
 |     - Granularity of the transfer residue reported to dma_set_residue. | 
 |     - This can be either: | 
 |       + Descriptor | 
 |         -> Your device doesn't support any kind of residue | 
 |            reporting. The framework will only know that a particular | 
 |            transaction descriptor is done. | 
 |       + Segment | 
 |         -> Your device is able to report which chunks have been | 
 |            transferred | 
 |       + Burst | 
 |         -> Your device is able to report which burst have been | 
 |            transferred | 
 |  | 
 |   * dev: 	should hold the pointer to the struct device associated | 
 | 		to your current driver instance. | 
 |  | 
 | Supported transaction types | 
 | --------------------------- | 
 |  | 
 | The next thing you need is to set which transaction types your device | 
 | (and driver) supports. | 
 |  | 
 | Our dma_device structure has a field called cap_mask that holds the | 
 | various types of transaction supported, and you need to modify this | 
 | mask using the dma_cap_set function, with various flags depending on | 
 | transaction types you support as an argument. | 
 |  | 
 | All those capabilities are defined in the dma_transaction_type enum, | 
 | in include/linux/dmaengine.h | 
 |  | 
 | Currently, the types available are: | 
 |   * DMA_MEMCPY | 
 |     - The device is able to do memory to memory copies | 
 |  | 
 |   * DMA_XOR | 
 |     - The device is able to perform XOR operations on memory areas | 
 |     - Used to accelerate XOR intensive tasks, such as RAID5 | 
 |  | 
 |   * DMA_XOR_VAL | 
 |     - The device is able to perform parity check using the XOR | 
 |       algorithm against a memory buffer. | 
 |  | 
 |   * DMA_PQ | 
 |     - The device is able to perform RAID6 P+Q computations, P being a | 
 |       simple XOR, and Q being a Reed-Solomon algorithm. | 
 |  | 
 |   * DMA_PQ_VAL | 
 |     - The device is able to perform parity check using RAID6 P+Q | 
 |       algorithm against a memory buffer. | 
 |  | 
 |   * DMA_INTERRUPT | 
 |     - The device is able to trigger a dummy transfer that will | 
 |       generate periodic interrupts | 
 |     - Used by the client drivers to register a callback that will be | 
 |       called on a regular basis through the DMA controller interrupt | 
 |  | 
 |   * DMA_SG | 
 |     - The device supports memory to memory scatter-gather | 
 |       transfers. | 
 |     - Even though a plain memcpy can look like a particular case of a | 
 |       scatter-gather transfer, with a single chunk to transfer, it's a | 
 |       distinct transaction type in the mem2mem transfers case | 
 |  | 
 |   * DMA_PRIVATE | 
 |     - The devices only supports slave transfers, and as such isn't | 
 |       available for async transfers. | 
 |  | 
 |   * DMA_ASYNC_TX | 
 |     - Must not be set by the device, and will be set by the framework | 
 |       if needed | 
 |     - /* TODO: What is it about? */ | 
 |  | 
 |   * DMA_SLAVE | 
 |     - The device can handle device to memory transfers, including | 
 |       scatter-gather transfers. | 
 |     - While in the mem2mem case we were having two distinct types to | 
 |       deal with a single chunk to copy or a collection of them, here, | 
 |       we just have a single transaction type that is supposed to | 
 |       handle both. | 
 |     - If you want to transfer a single contiguous memory buffer, | 
 |       simply build a scatter list with only one item. | 
 |  | 
 |   * DMA_CYCLIC | 
 |     - The device can handle cyclic transfers. | 
 |     - A cyclic transfer is a transfer where the chunk collection will | 
 |       loop over itself, with the last item pointing to the first. | 
 |     - It's usually used for audio transfers, where you want to operate | 
 |       on a single ring buffer that you will fill with your audio data. | 
 |  | 
 |   * DMA_INTERLEAVE | 
 |     - The device supports interleaved transfer. | 
 |     - These transfers can transfer data from a non-contiguous buffer | 
 |       to a non-contiguous buffer, opposed to DMA_SLAVE that can | 
 |       transfer data from a non-contiguous data set to a continuous | 
 |       destination buffer. | 
 |     - It's usually used for 2d content transfers, in which case you | 
 |       want to transfer a portion of uncompressed data directly to the | 
 |       display to print it | 
 |  | 
 | These various types will also affect how the source and destination | 
 | addresses change over time. | 
 |  | 
 | Addresses pointing to RAM are typically incremented (or decremented) | 
 | after each transfer. In case of a ring buffer, they may loop | 
 | (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO) | 
 | are typically fixed. | 
 |  | 
 | Device operations | 
 | ----------------- | 
 |  | 
 | Our dma_device structure also requires a few function pointers in | 
 | order to implement the actual logic, now that we described what | 
 | operations we were able to perform. | 
 |  | 
 | The functions that we have to fill in there, and hence have to | 
 | implement, obviously depend on the transaction types you reported as | 
 | supported. | 
 |  | 
 |    * device_alloc_chan_resources | 
 |    * device_free_chan_resources | 
 |      - These functions will be called whenever a driver will call | 
 |        dma_request_channel or dma_release_channel for the first/last | 
 |        time on the channel associated to that driver. | 
 |      - They are in charge of allocating/freeing all the needed | 
 |        resources in order for that channel to be useful for your | 
 |        driver. | 
 |      - These functions can sleep. | 
 |  | 
 |    * device_prep_dma_* | 
 |      - These functions are matching the capabilities you registered | 
 |        previously. | 
 |      - These functions all take the buffer or the scatterlist relevant | 
 |        for the transfer being prepared, and should create a hardware | 
 |        descriptor or a list of hardware descriptors from it | 
 |      - These functions can be called from an interrupt context | 
 |      - Any allocation you might do should be using the GFP_NOWAIT | 
 |        flag, in order not to potentially sleep, but without depleting | 
 |        the emergency pool either. | 
 |      - Drivers should try to pre-allocate any memory they might need | 
 |        during the transfer setup at probe time to avoid putting to | 
 |        much pressure on the nowait allocator. | 
 |  | 
 |      - It should return a unique instance of the | 
 |        dma_async_tx_descriptor structure, that further represents this | 
 |        particular transfer. | 
 |  | 
 |      - This structure can be initialized using the function | 
 |        dma_async_tx_descriptor_init. | 
 |      - You'll also need to set two fields in this structure: | 
 |        + flags: | 
 | 		TODO: Can it be modified by the driver itself, or | 
 | 		should it be always the flags passed in the arguments | 
 |  | 
 |        + tx_submit:	A pointer to a function you have to implement, | 
 | 			that is supposed to push the current | 
 | 			transaction descriptor to a pending queue, waiting | 
 | 			for issue_pending to be called. | 
 |      - In this structure the function pointer callback_result can be | 
 |        initialized in order for the submitter to be notified that a | 
 |        transaction has completed. In the earlier code the function pointer | 
 |        callback has been used. However it does not provide any status to the | 
 |        transaction and will be deprecated. The result structure defined as | 
 |        dmaengine_result that is passed in to callback_result has two fields: | 
 |        + result: This provides the transfer result defined by | 
 | 		 dmaengine_tx_result. Either success or some error | 
 | 		 condition. | 
 |        + residue: Provides the residue bytes of the transfer for those that | 
 | 		  support residue. | 
 |  | 
 |    * device_issue_pending | 
 |      - Takes the first transaction descriptor in the pending queue, | 
 |        and starts the transfer. Whenever that transfer is done, it | 
 |        should move to the next transaction in the list. | 
 |      - This function can be called in an interrupt context | 
 |  | 
 |    * device_tx_status | 
 |      - Should report the bytes left to go over on the given channel | 
 |      - Should only care about the transaction descriptor passed as | 
 |        argument, not the currently active one on a given channel | 
 |      - The tx_state argument might be NULL | 
 |      - Should use dma_set_residue to report it | 
 |      - In the case of a cyclic transfer, it should only take into | 
 |        account the current period. | 
 |      - This function can be called in an interrupt context. | 
 |  | 
 |    * device_config | 
 |      - Reconfigures the channel with the configuration given as | 
 |        argument | 
 |      - This command should NOT perform synchronously, or on any | 
 |        currently queued transfers, but only on subsequent ones | 
 |      - In this case, the function will receive a dma_slave_config | 
 |        structure pointer as an argument, that will detail which | 
 |        configuration to use. | 
 |      - Even though that structure contains a direction field, this | 
 |        field is deprecated in favor of the direction argument given to | 
 |        the prep_* functions | 
 |      - This call is mandatory for slave operations only. This should NOT be | 
 |        set or expected to be set for memcpy operations. | 
 |        If a driver support both, it should use this call for slave | 
 |        operations only and not for memcpy ones. | 
 |  | 
 |    * device_pause | 
 |      - Pauses a transfer on the channel | 
 |      - This command should operate synchronously on the channel, | 
 |        pausing right away the work of the given channel | 
 |  | 
 |    * device_resume | 
 |      - Resumes a transfer on the channel | 
 |      - This command should operate synchronously on the channel, | 
 |        resuming right away the work of the given channel | 
 |  | 
 |    * device_terminate_all | 
 |      - Aborts all the pending and ongoing transfers on the channel | 
 |      - For aborted transfers the complete callback should not be called | 
 |      - Can be called from atomic context or from within a complete | 
 |        callback of a descriptor. Must not sleep. Drivers must be able | 
 |        to handle this correctly. | 
 |      - Termination may be asynchronous. The driver does not have to | 
 |        wait until the currently active transfer has completely stopped. | 
 |        See device_synchronize. | 
 |  | 
 |    * device_synchronize | 
 |      - Must synchronize the termination of a channel to the current | 
 |        context. | 
 |      - Must make sure that memory for previously submitted | 
 |        descriptors is no longer accessed by the DMA controller. | 
 |      - Must make sure that all complete callbacks for previously | 
 |        submitted descriptors have finished running and none are | 
 |        scheduled to run. | 
 |      - May sleep. | 
 |  | 
 |  | 
 | Misc notes (stuff that should be documented, but don't really know | 
 | where to put them) | 
 | ------------------------------------------------------------------ | 
 |   * dma_run_dependencies | 
 |     - Should be called at the end of an async TX transfer, and can be | 
 |       ignored in the slave transfers case. | 
 |     - Makes sure that dependent operations are run before marking it | 
 |       as complete. | 
 |  | 
 |   * dma_cookie_t | 
 |     - it's a DMA transaction ID that will increment over time. | 
 |     - Not really relevant any more since the introduction of virt-dma | 
 |       that abstracts it away. | 
 |  | 
 |   * DMA_CTRL_ACK | 
 |     - If clear, the descriptor cannot be reused by provider until the | 
 |       client acknowledges receipt, i.e. has has a chance to establish any | 
 |       dependency chains | 
 |     - This can be acked by invoking async_tx_ack() | 
 |     - If set, does not mean descriptor can be reused | 
 |  | 
 |   * DMA_CTRL_REUSE | 
 |     - If set, the descriptor can be reused after being completed. It should | 
 |       not be freed by provider if this flag is set. | 
 |     - The descriptor should be prepared for reuse by invoking | 
 |       dmaengine_desc_set_reuse() which will set DMA_CTRL_REUSE. | 
 |     - dmaengine_desc_set_reuse() will succeed only when channel support | 
 |       reusable descriptor as exhibited by capabilities | 
 |     - As a consequence, if a device driver wants to skip the dma_map_sg() and | 
 |       dma_unmap_sg() in between 2 transfers, because the DMA'd data wasn't used, | 
 |       it can resubmit the transfer right after its completion. | 
 |     - Descriptor can be freed in few ways | 
 | 	- Clearing DMA_CTRL_REUSE by invoking dmaengine_desc_clear_reuse() | 
 | 	  and submitting for last txn | 
 | 	- Explicitly invoking dmaengine_desc_free(), this can succeed only | 
 | 	  when DMA_CTRL_REUSE is already set | 
 | 	- Terminating the channel | 
 |  | 
 |  | 
 | General Design Notes | 
 | -------------------- | 
 |  | 
 | Most of the DMAEngine drivers you'll see are based on a similar design | 
 | that handles the end of transfer interrupts in the handler, but defer | 
 | most work to a tasklet, including the start of a new transfer whenever | 
 | the previous transfer ended. | 
 |  | 
 | This is a rather inefficient design though, because the inter-transfer | 
 | latency will be not only the interrupt latency, but also the | 
 | scheduling latency of the tasklet, which will leave the channel idle | 
 | in between, which will slow down the global transfer rate. | 
 |  | 
 | You should avoid this kind of practice, and instead of electing a new | 
 | transfer in your tasklet, move that part to the interrupt handler in | 
 | order to have a shorter idle window (that we can't really avoid | 
 | anyway). | 
 |  | 
 | Glossary | 
 | -------- | 
 |  | 
 | Burst: 		A number of consecutive read or write operations | 
 | 		that can be queued to buffers before being flushed to | 
 | 		memory. | 
 | Chunk:		A contiguous collection of bursts | 
 | Transfer:	A collection of chunks (be it contiguous or not) |