Netanel Belgazal | 1738cd3 | 2016-08-10 14:03:22 +0300 | [diff] [blame] | 1 | Linux kernel driver for Elastic Network Adapter (ENA) family: |
| 2 | ============================================================= |
| 3 | |
| 4 | Overview: |
| 5 | ========= |
| 6 | ENA is a networking interface designed to make good use of modern CPU |
| 7 | features and system architectures. |
| 8 | |
| 9 | The ENA device exposes a lightweight management interface with a |
| 10 | minimal set of memory mapped registers and extendable command set |
| 11 | through an Admin Queue. |
| 12 | |
| 13 | The driver supports a range of ENA devices, is link-speed independent |
| 14 | (i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has |
| 15 | a negotiated and extendable feature set. |
| 16 | |
| 17 | Some ENA devices support SR-IOV. This driver is used for both the |
| 18 | SR-IOV Physical Function (PF) and Virtual Function (VF) devices. |
| 19 | |
| 20 | ENA devices enable high speed and low overhead network traffic |
| 21 | processing by providing multiple Tx/Rx queue pairs (the maximum number |
| 22 | is advertised by the device via the Admin Queue), a dedicated MSI-X |
| 23 | interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, |
| 24 | and CPU cacheline optimized data placement. |
| 25 | |
| 26 | The ENA driver supports industry standard TCP/IP offload features such |
| 27 | as checksum offload and TCP transmit segmentation offload (TSO). |
| 28 | Receive-side scaling (RSS) is supported for multi-core scaling. |
| 29 | |
| 30 | The ENA driver and its corresponding devices implement health |
| 31 | monitoring mechanisms such as watchdog, enabling the device and driver |
| 32 | to recover in a manner transparent to the application, as well as |
| 33 | debug logs. |
| 34 | |
| 35 | Some of the ENA devices support a working mode called Low-latency |
| 36 | Queue (LLQ), which saves several more microseconds. |
| 37 | |
| 38 | Supported PCI vendor ID/device IDs: |
| 39 | =================================== |
| 40 | 1d0f:0ec2 - ENA PF |
| 41 | 1d0f:1ec2 - ENA PF with LLQ support |
| 42 | 1d0f:ec20 - ENA VF |
| 43 | 1d0f:ec21 - ENA VF with LLQ support |
| 44 | |
| 45 | ENA Source Code Directory Structure: |
| 46 | ==================================== |
| 47 | ena_com.[ch] - Management communication layer. This layer is |
| 48 | responsible for the handling all the management |
| 49 | (admin) communication between the device and the |
| 50 | driver. |
| 51 | ena_eth_com.[ch] - Tx/Rx data path. |
| 52 | ena_admin_defs.h - Definition of ENA management interface. |
| 53 | ena_eth_io_defs.h - Definition of ENA data path interface. |
| 54 | ena_common_defs.h - Common definitions for ena_com layer. |
| 55 | ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. |
| 56 | ena_netdev.[ch] - Main Linux kernel driver. |
| 57 | ena_syfsfs.[ch] - Sysfs files. |
| 58 | ena_ethtool.c - ethtool callbacks. |
| 59 | ena_pci_id_tbl.h - Supported device IDs. |
| 60 | |
| 61 | Management Interface: |
| 62 | ===================== |
| 63 | ENA management interface is exposed by means of: |
| 64 | - PCIe Configuration Space |
| 65 | - Device Registers |
| 66 | - Admin Queue (AQ) and Admin Completion Queue (ACQ) |
| 67 | - Asynchronous Event Notification Queue (AENQ) |
| 68 | |
| 69 | ENA device MMIO Registers are accessed only during driver |
| 70 | initialization and are not involved in further normal device |
| 71 | operation. |
| 72 | |
| 73 | AQ is used for submitting management commands, and the |
| 74 | results/responses are reported asynchronously through ACQ. |
| 75 | |
| 76 | ENA introduces a very small set of management commands with room for |
| 77 | vendor-specific extensions. Most of the management operations are |
| 78 | framed in a generic Get/Set feature command. |
| 79 | |
| 80 | The following admin queue commands are supported: |
| 81 | - Create I/O submission queue |
| 82 | - Create I/O completion queue |
| 83 | - Destroy I/O submission queue |
| 84 | - Destroy I/O completion queue |
| 85 | - Get feature |
| 86 | - Set feature |
| 87 | - Configure AENQ |
| 88 | - Get statistics |
| 89 | |
| 90 | Refer to ena_admin_defs.h for the list of supported Get/Set Feature |
| 91 | properties. |
| 92 | |
| 93 | The Asynchronous Event Notification Queue (AENQ) is a uni-directional |
| 94 | queue used by the ENA device to send to the driver events that cannot |
| 95 | be reported using ACQ. AENQ events are subdivided into groups. Each |
| 96 | group may have multiple syndromes, as shown below |
| 97 | |
| 98 | The events are: |
| 99 | Group Syndrome |
| 100 | Link state change - X - |
| 101 | Fatal error - X - |
| 102 | Notification Suspend traffic |
| 103 | Notification Resume traffic |
| 104 | Keep-Alive - X - |
| 105 | |
| 106 | ACQ and AENQ share the same MSI-X vector. |
| 107 | |
| 108 | Keep-Alive is a special mechanism that allows monitoring of the |
| 109 | device's health. The driver maintains a watchdog (WD) handler which, |
| 110 | if fired, logs the current state and statistics then resets and |
| 111 | restarts the ENA device and driver. A Keep-Alive event is delivered by |
| 112 | the device every second. The driver re-arms the WD upon reception of a |
| 113 | Keep-Alive event. A missed Keep-Alive event causes the WD handler to |
| 114 | fire. |
| 115 | |
| 116 | Data Path Interface: |
| 117 | ==================== |
| 118 | I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx |
| 119 | SQ correspondingly). Each SQ has a completion queue (CQ) associated |
| 120 | with it. |
| 121 | |
| 122 | The SQs and CQs are implemented as descriptor rings in contiguous |
| 123 | physical memory. |
| 124 | |
| 125 | The ENA driver supports two Queue Operation modes for Tx SQs: |
| 126 | - Regular mode |
| 127 | * In this mode the Tx SQs reside in the host's memory. The ENA |
| 128 | device fetches the ENA Tx descriptors and packet data from host |
| 129 | memory. |
| 130 | - Low Latency Queue (LLQ) mode or "push-mode". |
| 131 | * In this mode the driver pushes the transmit descriptors and the |
| 132 | first 128 bytes of the packet directly to the ENA device memory |
| 133 | space. The rest of the packet payload is fetched by the |
| 134 | device. For this operation mode, the driver uses a dedicated PCI |
| 135 | device memory BAR, which is mapped with write-combine capability. |
| 136 | |
| 137 | The Rx SQs support only the regular mode. |
| 138 | |
| 139 | Note: Not all ENA devices support LLQ, and this feature is negotiated |
| 140 | with the device upon initialization. If the ENA device does not |
| 141 | support LLQ mode, the driver falls back to the regular mode. |
| 142 | |
| 143 | The driver supports multi-queue for both Tx and Rx. This has various |
| 144 | benefits: |
| 145 | - Reduced CPU/thread/process contention on a given Ethernet interface. |
| 146 | - Cache miss rate on completion is reduced, particularly for data |
| 147 | cache lines that hold the sk_buff structures. |
| 148 | - Increased process-level parallelism when handling received packets. |
| 149 | - Increased data cache hit rate, by steering kernel processing of |
| 150 | packets to the CPU, where the application thread consuming the |
| 151 | packet is running. |
| 152 | - In hardware interrupt re-direction. |
| 153 | |
| 154 | Interrupt Modes: |
| 155 | ================ |
| 156 | The driver assigns a single MSI-X vector per queue pair (for both Tx |
| 157 | and Rx directions). The driver assigns an additional dedicated MSI-X vector |
| 158 | for management (for ACQ and AENQ). |
| 159 | |
| 160 | Management interrupt registration is performed when the Linux kernel |
| 161 | probes the adapter, and it is de-registered when the adapter is |
| 162 | removed. I/O queue interrupt registration is performed when the Linux |
| 163 | interface of the adapter is opened, and it is de-registered when the |
| 164 | interface is closed. |
| 165 | |
| 166 | The management interrupt is named: |
| 167 | ena-mgmnt@pci:<PCI domain:bus:slot.function> |
| 168 | and for each queue pair, an interrupt is named: |
| 169 | <interface name>-Tx-Rx-<queue index> |
| 170 | |
| 171 | The ENA device operates in auto-mask and auto-clear interrupt |
| 172 | modes. That is, once MSI-X is delivered to the host, its Cause bit is |
| 173 | automatically cleared and the interrupt is masked. The interrupt is |
| 174 | unmasked by the driver after NAPI processing is complete. |
| 175 | |
| 176 | Interrupt Moderation: |
| 177 | ===================== |
| 178 | ENA driver and device can operate in conventional or adaptive interrupt |
| 179 | moderation mode. |
| 180 | |
| 181 | In conventional mode the driver instructs device to postpone interrupt |
| 182 | posting according to static interrupt delay value. The interrupt delay |
| 183 | value can be configured through ethtool(8). The following ethtool |
| 184 | parameters are supported by the driver: tx-usecs, rx-usecs |
| 185 | |
| 186 | In adaptive interrupt moderation mode the interrupt delay value is |
| 187 | updated by the driver dynamically and adjusted every NAPI cycle |
| 188 | according to the traffic nature. |
| 189 | |
| 190 | By default ENA driver applies adaptive coalescing on Rx traffic and |
| 191 | conventional coalescing on Tx traffic. |
| 192 | |
| 193 | Adaptive coalescing can be switched on/off through ethtool(8) |
| 194 | adaptive_rx on|off parameter. |
| 195 | |
| 196 | The driver chooses interrupt delay value according to the number of |
| 197 | bytes and packets received between interrupt unmasking and interrupt |
| 198 | posting. The driver uses interrupt delay table that subdivides the |
| 199 | range of received bytes/packets into 5 levels and assigns interrupt |
| 200 | delay value to each level. |
| 201 | |
| 202 | The user can enable/disable adaptive moderation, modify the interrupt |
| 203 | delay table and restore its default values through sysfs. |
| 204 | |
| 205 | The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK |
| 206 | and can be configured by the ETHTOOL_STUNABLE command of the |
| 207 | SIOCETHTOOL ioctl. |
| 208 | |
| 209 | SKB: |
| 210 | The driver-allocated SKB for frames received from Rx handling using |
| 211 | NAPI context. The allocation method depends on the size of the packet. |
| 212 | If the frame length is larger than rx_copybreak, napi_get_frags() |
| 213 | is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer |
| 214 | content is copied (by CPU) to the SKB, and the buffer is recycled. |
| 215 | |
| 216 | Statistics: |
| 217 | =========== |
| 218 | The user can obtain ENA device and driver statistics using ethtool. |
| 219 | The driver can collect regular or extended statistics (including |
| 220 | per-queue stats) from the device. |
| 221 | |
| 222 | In addition the driver logs the stats to syslog upon device reset. |
| 223 | |
| 224 | MTU: |
| 225 | ==== |
| 226 | The driver supports an arbitrarily large MTU with a maximum that is |
| 227 | negotiated with the device. The driver configures MTU using the |
| 228 | SetFeature command (ENA_ADMIN_MTU property). The user can change MTU |
| 229 | via ip(8) and similar legacy tools. |
| 230 | |
| 231 | Stateless Offloads: |
| 232 | =================== |
| 233 | The ENA driver supports: |
| 234 | - TSO over IPv4/IPv6 |
| 235 | - TSO with ECN |
| 236 | - IPv4 header checksum offload |
| 237 | - TCP/UDP over IPv4/IPv6 checksum offloads |
| 238 | |
| 239 | RSS: |
| 240 | ==== |
| 241 | - The ENA device supports RSS that allows flexible Rx traffic |
| 242 | steering. |
| 243 | - Toeplitz and CRC32 hash functions are supported. |
| 244 | - Different combinations of L2/L3/L4 fields can be configured as |
| 245 | inputs for hash functions. |
| 246 | - The driver configures RSS settings using the AQ SetFeature command |
| 247 | (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and |
| 248 | ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). |
| 249 | - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash |
| 250 | function delivered in the Rx CQ descriptor is set in the received |
| 251 | SKB. |
| 252 | - The user can provide a hash key, hash function, and configure the |
| 253 | indirection table through ethtool(8). |
| 254 | |
| 255 | DATA PATH: |
| 256 | ========== |
| 257 | Tx: |
| 258 | --- |
| 259 | end_start_xmit() is called by the stack. This function does the following: |
| 260 | - Maps data buffers (skb->data and frags). |
| 261 | - Populates ena_buf for the push buffer (if the driver and device are |
| 262 | in push mode.) |
| 263 | - Prepares ENA bufs for the remaining frags. |
| 264 | - Allocates a new request ID from the empty req_id ring. The request |
| 265 | ID is the index of the packet in the Tx info. This is used for |
| 266 | out-of-order TX completions. |
| 267 | - Adds the packet to the proper place in the Tx ring. |
| 268 | - Calls ena_com_prepare_tx(), an ENA communication layer that converts |
| 269 | the ena_bufs to ENA descriptors (and adds meta ENA descriptors as |
| 270 | needed.) |
| 271 | * This function also copies the ENA descriptors and the push buffer |
| 272 | to the Device memory space (if in push mode.) |
| 273 | - Writes doorbell to the ENA device. |
| 274 | - When the ENA device finishes sending the packet, a completion |
| 275 | interrupt is raised. |
| 276 | - The interrupt handler schedules NAPI. |
| 277 | - The ena_clean_tx_irq() function is called. This function handles the |
| 278 | completion descriptors generated by the ENA, with a single |
| 279 | completion descriptor per completed packet. |
| 280 | * req_id is retrieved from the completion descriptor. The tx_info of |
| 281 | the packet is retrieved via the req_id. The data buffers are |
| 282 | unmapped and req_id is returned to the empty req_id ring. |
| 283 | * The function stops when the completion descriptors are completed or |
| 284 | the budget is reached. |
| 285 | |
| 286 | Rx: |
| 287 | --- |
| 288 | - When a packet is received from the ENA device. |
| 289 | - The interrupt handler schedules NAPI. |
| 290 | - The ena_clean_rx_irq() function is called. This function calls |
| 291 | ena_rx_pkt(), an ENA communication layer function, which returns the |
| 292 | number of descriptors used for a new unhandled packet, and zero if |
| 293 | no new packet is found. |
| 294 | - Then it calls the ena_clean_rx_irq() function. |
| 295 | - ena_eth_rx_skb() checks packet length: |
| 296 | * If the packet is small (len < rx_copybreak), the driver allocates |
| 297 | a SKB for the new packet, and copies the packet payload into the |
| 298 | SKB data buffer. |
| 299 | - In this way the original data buffer is not passed to the stack |
| 300 | and is reused for future Rx packets. |
| 301 | * Otherwise the function unmaps the Rx buffer, then allocates the |
| 302 | new SKB structure and hooks the Rx buffer to the SKB frags. |
| 303 | - The new SKB is updated with the necessary information (protocol, |
| 304 | checksum hw verify result, etc.), and then passed to the network |
| 305 | stack, using the NAPI interface function napi_gro_receive(). |