Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | HISTORY: |
| 2 | February 16/2002 -- revision 0.2.1: |
| 3 | COR typo corrected |
| 4 | February 10/2002 -- revision 0.2: |
| 5 | some spell checking ;-> |
| 6 | January 12/2002 -- revision 0.1 |
| 7 | This is still work in progress so may change. |
| 8 | To keep up to date please watch this space. |
| 9 | |
| 10 | Introduction to NAPI |
| 11 | ==================== |
| 12 | |
| 13 | NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique |
| 14 | to improve network performance on Linux. For more details please |
| 15 | read that paper. |
| 16 | NAPI provides a "inherent mitigation" which is bound by system capacity |
| 17 | as can be seen from the following data collected by Robert on Gigabit |
| 18 | ethernet (e1000): |
| 19 | |
| 20 | Psize Ipps Tput Rxint Txint Done Ndone |
| 21 | --------------------------------------------------------------- |
| 22 | 60 890000 409362 17 27622 7 6823 |
| 23 | 128 758150 464364 21 9301 10 7738 |
| 24 | 256 445632 774646 42 15507 21 12906 |
| 25 | 512 232666 994445 241292 19147 241192 1062 |
| 26 | 1024 119061 1000003 872519 19258 872511 0 |
| 27 | 1440 85193 1000003 946576 19505 946569 0 |
| 28 | |
| 29 | |
| 30 | Legend: |
| 31 | "Ipps" stands for input packets per second. |
| 32 | "Tput" == packets out of total 1M that made it out. |
| 33 | "txint" == transmit completion interrupts seen |
| 34 | "Done" == The number of times that the poll() managed to pull all |
| 35 | packets out of the rx ring. Note from this that the lower the |
| 36 | load the more we could clean up the rxring |
| 37 | "Ndone" == is the converse of "Done". Note again, that the higher |
Matt LaPlante | fff9289 | 2006-10-03 22:47:42 +0200 | [diff] [blame] | 38 | the load the more times we couldn't clean up the rxring. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 39 | |
| 40 | Observe that: |
| 41 | when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. |
| 42 | The system cant handle the processing at 1 interrupt/packet at that load level. |
| 43 | At lower rates on the other hand, rx interrupts go up and therefore the |
| 44 | interrupt/packet ratio goes up (as observable from that table). So there is |
| 45 | possibility that under low enough input, you get one poll call for each |
| 46 | input packet caused by a single interrupt each time. And if the system |
| 47 | cant handle interrupt per packet ratio of 1, then it will just have to |
| 48 | chug along .... |
| 49 | |
| 50 | |
| 51 | 0) Prerequisites: |
| 52 | ================== |
| 53 | A driver MAY continue using the old 2.4 technique for interfacing |
| 54 | to the network stack and not benefit from the NAPI changes. |
| 55 | NAPI additions to the kernel do not break backward compatibility. |
| 56 | NAPI, however, requires the following features to be available: |
| 57 | |
| 58 | A) DMA ring or enough RAM to store packets in software devices. |
| 59 | |
| 60 | B) Ability to turn off interrupts or maybe events that send packets up |
| 61 | the stack. |
| 62 | |
| 63 | NAPI processes packet events in what is known as dev->poll() method. |
| 64 | Typically, only packet receive events are processed in dev->poll(). |
| 65 | The rest of the events MAY be processed by the regular interrupt handler |
| 66 | to reduce processing latency (justified also because there are not that |
| 67 | many of them). |
| 68 | Note, however, NAPI does not enforce that dev->poll() only processes |
| 69 | receive events. |
| 70 | Tests with the tulip driver indicated slightly increased latency if |
| 71 | all of the interrupt handler is moved to dev->poll(). Also MII handling |
| 72 | gets a little trickier. |
| 73 | The example used in this document is to move the receive processing only |
| 74 | to dev->poll(); this is shown with the patch for the tulip driver. |
| 75 | For an example of code that moves all the interrupt driver to |
| 76 | dev->poll() look at the ported e1000 code. |
| 77 | |
| 78 | There are caveats that might force you to go with moving everything to |
| 79 | dev->poll(). Different NICs work differently depending on their status/event |
| 80 | acknowledgement setup. |
| 81 | There are two types of event register ACK mechanisms. |
| 82 | I) what is known as Clear-on-read (COR). |
| 83 | when you read the status/event register, it clears everything! |
| 84 | The natsemi and sunbmac NICs are known to do this. |
| 85 | In this case your only choice is to move all to dev->poll() |
| 86 | |
| 87 | II) Clear-on-write (COW) |
| 88 | i) you clear the status by writing a 1 in the bit-location you want. |
| 89 | These are the majority of the NICs and work the best with NAPI. |
| 90 | Put only receive events in dev->poll(); leave the rest in |
| 91 | the old interrupt handler. |
| 92 | ii) whatever you write in the status register clears every thing ;-> |
| 93 | Cant seem to find any supported by Linux which do this. If |
| 94 | someone knows such a chip email us please. |
| 95 | Move all to dev->poll() |
| 96 | |
| 97 | C) Ability to detect new work correctly. |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 98 | NAPI works by shutting down event interrupts when there's work and |
| 99 | turning them on when there's none. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 100 | New packets might show up in the small window while interrupts were being |
| 101 | re-enabled (refer to appendix 2). A packet might sneak in during the period |
| 102 | we are enabling interrupts. We only get to know about such a packet when the |
| 103 | next new packet arrives and generates an interrupt. |
| 104 | Essentially, there is a small window of opportunity for a race condition |
| 105 | which for clarity we'll refer to as the "rotting packet". |
| 106 | |
| 107 | This is a very important topic and appendix 2 is dedicated for more |
| 108 | discussion. |
| 109 | |
| 110 | Locking rules and environmental guarantees |
| 111 | ========================================== |
| 112 | |
| 113 | -Guarantee: Only one CPU at any time can call dev->poll(); this is because |
| 114 | only one CPU can pick the initial interrupt and hence the initial |
| 115 | netif_rx_schedule(dev); |
| 116 | - The core layer invokes devices to send packets in a round robin format. |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 117 | This implies receive is totally lockless because of the guarantee that only |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 118 | one CPU is executing it. |
| 119 | - contention can only be the result of some other CPU accessing the rx |
| 120 | ring. This happens only in close() and suspend() (when these methods |
| 121 | try to clean the rx ring); |
| 122 | ****guarantee: driver authors need not worry about this; synchronization |
| 123 | is taken care for them by the top net layer. |
| 124 | -local interrupts are enabled (if you dont move all to dev->poll()). For |
| 125 | example link/MII and txcomplete continue functioning just same old way. |
| 126 | This improves the latency of processing these events. It is also assumed that |
| 127 | the receive interrupt is the largest cause of noise. Note this might not |
| 128 | always be true. |
| 129 | [according to Manfred Spraul, the winbond insists on sending one |
| 130 | txmitcomplete interrupt for each packet (although this can be mitigated)]. |
| 131 | For these broken drivers, move all to dev->poll(). |
| 132 | |
| 133 | For the rest of this text, we'll assume that dev->poll() only |
| 134 | processes receive events. |
| 135 | |
| 136 | new methods introduce by NAPI |
| 137 | ============================= |
| 138 | |
| 139 | a) netif_rx_schedule(dev) |
| 140 | Called by an IRQ handler to schedule a poll for device |
| 141 | |
| 142 | b) netif_rx_schedule_prep(dev) |
| 143 | puts the device in a state which allows for it to be added to the |
| 144 | CPU polling list if it is up and running. You can look at this as |
| 145 | the first half of netif_rx_schedule(dev) above; the second half |
| 146 | being c) below. |
| 147 | |
| 148 | c) __netif_rx_schedule(dev) |
| 149 | Add device to the poll list for this CPU; assuming that _prep above |
| 150 | has already been called and returned 1. |
| 151 | |
| 152 | d) netif_rx_reschedule(dev, undo) |
| 153 | Called to reschedule polling for device specifically for some |
| 154 | deficient hardware. Read Appendix 2 for more details. |
| 155 | |
| 156 | e) netif_rx_complete(dev) |
| 157 | |
| 158 | Remove interface from the CPU poll list: it must be in the poll list |
| 159 | on current cpu. This primitive is called by dev->poll(), when |
| 160 | it completes its work. The device cannot be out of poll list at this |
| 161 | call, if it is then clearly it is a BUG(). You'll know ;-> |
| 162 | |
Matt LaPlante | a982ac0 | 2007-05-09 07:35:06 +0200 | [diff] [blame] | 163 | All of the above methods are used below, so keep reading for clarity. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 164 | |
| 165 | Device driver changes to be made when porting NAPI |
| 166 | ================================================== |
| 167 | |
| 168 | Below we describe what kind of changes are required for NAPI to work. |
| 169 | |
| 170 | 1) introduction of dev->poll() method |
| 171 | ===================================== |
| 172 | |
| 173 | This is the method that is invoked by the network core when it requests |
| 174 | for new packets from the driver. A driver is allowed to send upto |
| 175 | dev->quota packets by the current CPU before yielding to the network |
| 176 | subsystem (so other devices can also get opportunity to send to the stack). |
| 177 | |
| 178 | dev->poll() prototype looks as follows: |
| 179 | int my_poll(struct net_device *dev, int *budget) |
| 180 | |
| 181 | budget is the remaining number of packets the network subsystem on the |
| 182 | current CPU can send up the stack before yielding to other system tasks. |
| 183 | *Each driver is responsible for decrementing budget by the total number of |
| 184 | packets sent. |
| 185 | Total number of packets cannot exceed dev->quota. |
| 186 | |
| 187 | dev->poll() method is invoked by the top layer, the driver just sends if it |
| 188 | can to the stack the packet quantity requested. |
| 189 | |
| 190 | more on dev->poll() below after the interrupt changes are explained. |
| 191 | |
| 192 | 2) registering dev->poll() method |
| 193 | =================================== |
| 194 | |
| 195 | dev->poll should be set in the dev->probe() method. |
| 196 | e.g: |
| 197 | dev->open = my_open; |
| 198 | . |
| 199 | . |
| 200 | /* two new additions */ |
| 201 | /* first register my poll method */ |
| 202 | dev->poll = my_poll; |
| 203 | /* next register my weight/quanta; can be overridden in /proc */ |
| 204 | dev->weight = 16; |
| 205 | . |
| 206 | . |
| 207 | dev->stop = my_close; |
| 208 | |
| 209 | |
| 210 | |
| 211 | 3) scheduling dev->poll() |
| 212 | ============================= |
| 213 | This involves modifying the interrupt handler and the code |
| 214 | path which takes the packet off the NIC and sends them to the |
| 215 | stack. |
| 216 | |
| 217 | it's important at this point to introduce the classical D Becker |
| 218 | interrupt processor: |
| 219 | |
| 220 | ------------------ |
| 221 | static irqreturn_t |
| 222 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) |
| 223 | { |
| 224 | |
| 225 | struct net_device *dev = (struct net_device *)dev_instance; |
| 226 | struct my_private *tp = (struct my_private *)dev->priv; |
| 227 | |
| 228 | int work_count = my_work_count; |
| 229 | status = read_interrupt_status_reg(); |
| 230 | if (status == 0) |
| 231 | return IRQ_NONE; /* Shared IRQ: not us */ |
| 232 | if (status == 0xffff) |
| 233 | return IRQ_HANDLED; /* Hot unplug */ |
| 234 | if (status & error) |
| 235 | do_some_error_handling() |
| 236 | |
| 237 | do { |
| 238 | acknowledge_ints_ASAP(); |
| 239 | |
| 240 | if (status & link_interrupt) { |
| 241 | spin_lock(&tp->link_lock); |
| 242 | do_some_link_stat_stuff(); |
| 243 | spin_lock(&tp->link_lock); |
| 244 | } |
| 245 | |
| 246 | if (status & rx_interrupt) { |
| 247 | receive_packets(dev); |
| 248 | } |
| 249 | |
| 250 | if (status & rx_nobufs) { |
| 251 | make_rx_buffs_avail(); |
| 252 | } |
| 253 | |
| 254 | if (status & tx_related) { |
| 255 | spin_lock(&tp->lock); |
| 256 | tx_ring_free(dev); |
| 257 | if (tx_died) |
| 258 | restart_tx(); |
| 259 | spin_unlock(&tp->lock); |
| 260 | } |
| 261 | |
| 262 | status = read_interrupt_status_reg(); |
| 263 | |
| 264 | } while (!(status & error) || more_work_to_be_done); |
| 265 | return IRQ_HANDLED; |
| 266 | } |
| 267 | |
| 268 | ---------------------------------------------------------------------- |
| 269 | |
| 270 | We now change this to what is shown below to NAPI-enable it: |
| 271 | |
| 272 | ---------------------------------------------------------------------- |
| 273 | static irqreturn_t |
| 274 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) |
| 275 | { |
| 276 | struct net_device *dev = (struct net_device *)dev_instance; |
| 277 | struct my_private *tp = (struct my_private *)dev->priv; |
| 278 | |
| 279 | status = read_interrupt_status_reg(); |
| 280 | if (status == 0) |
| 281 | return IRQ_NONE; /* Shared IRQ: not us */ |
| 282 | if (status == 0xffff) |
| 283 | return IRQ_HANDLED; /* Hot unplug */ |
| 284 | if (status & error) |
| 285 | do_some_error_handling(); |
| 286 | |
| 287 | do { |
| 288 | /************************ start note *********************************/ |
| 289 | acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here |
| 290 | /************************ end note *********************************/ |
| 291 | |
| 292 | if (status & link_interrupt) { |
| 293 | spin_lock(&tp->link_lock); |
| 294 | do_some_link_stat_stuff(); |
| 295 | spin_unlock(&tp->link_lock); |
| 296 | } |
| 297 | /************************ start note *********************************/ |
| 298 | if (status & rx_interrupt || (status & rx_nobuffs)) { |
| 299 | if (netif_rx_schedule_prep(dev)) { |
| 300 | |
| 301 | /* disable interrupts caused |
| 302 | * by arriving packets */ |
| 303 | disable_rx_and_rxnobuff_ints(); |
| 304 | /* tell system we have work to be done. */ |
| 305 | __netif_rx_schedule(dev); |
| 306 | } else { |
| 307 | printk("driver bug! interrupt while in poll\n"); |
| 308 | /* FIX by disabling interrupts */ |
| 309 | disable_rx_and_rxnobuff_ints(); |
| 310 | } |
| 311 | } |
| 312 | /************************ end note note *********************************/ |
| 313 | |
| 314 | if (status & tx_related) { |
| 315 | spin_lock(&tp->lock); |
| 316 | tx_ring_free(dev); |
| 317 | |
| 318 | if (tx_died) |
| 319 | restart_tx(); |
| 320 | spin_unlock(&tp->lock); |
| 321 | } |
| 322 | |
| 323 | status = read_interrupt_status_reg(); |
| 324 | |
| 325 | /************************ start note *********************************/ |
| 326 | } while (!(status & error) || more_work_to_be_done(status)); |
| 327 | /************************ end note note *********************************/ |
| 328 | return IRQ_HANDLED; |
| 329 | } |
| 330 | |
| 331 | --------------------------------------------------------------------- |
| 332 | |
| 333 | |
| 334 | We note several things from above: |
| 335 | |
| 336 | I) Any interrupt source which is caused by arriving packets is now |
| 337 | turned off when it occurs. Depending on the hardware, there could be |
| 338 | several reasons that arriving packets would cause interrupts; these are the |
| 339 | interrupt sources we wish to avoid. The two common ones are a) a packet |
| 340 | arriving (rxint) b) a packet arriving and finding no DMA buffers available |
| 341 | (rxnobuff) . |
| 342 | This means also acknowledge_ints_ASAP() will not clear the status |
| 343 | register for those two items above; clearing is done in the place where |
| 344 | proper work is done within NAPI; at the poll() and refill_rx_ring() |
| 345 | discussed further below. |
| 346 | netif_rx_schedule_prep() returns 1 if device is in running state and |
| 347 | gets successfully added to the core poll list. If we get a zero value |
| 348 | we can _almost_ assume are already added to the list (instead of not running. |
| 349 | Logic based on the fact that you shouldn't get interrupt if not running) |
| 350 | We rectify this by disabling rx and rxnobuf interrupts. |
| 351 | |
| 352 | II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. |
| 353 | These functionalities are still around actually...... |
| 354 | |
| 355 | infact, receive_packets(dev) is very close to my_poll() and |
| 356 | make_rx_buffs_avail() is invoked from my_poll() |
| 357 | |
| 358 | 4) converting receive_packets() to dev->poll() |
| 359 | =============================================== |
| 360 | |
| 361 | We need to convert the classical D Becker receive_packets(dev) to my_poll() |
| 362 | |
| 363 | First the typical receive_packets() below: |
| 364 | ------------------------------------------------------------------- |
| 365 | |
| 366 | /* this is called by interrupt handler */ |
| 367 | static void receive_packets (struct net_device *dev) |
| 368 | { |
| 369 | |
| 370 | struct my_private *tp = (struct my_private *)dev->priv; |
| 371 | rx_ring = tp->rx_ring; |
| 372 | cur_rx = tp->cur_rx; |
| 373 | int entry = cur_rx % RX_RING_SIZE; |
| 374 | int received = 0; |
| 375 | int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; |
| 376 | |
| 377 | while (rx_ring_not_empty) { |
| 378 | u32 rx_status; |
| 379 | unsigned int rx_size; |
| 380 | unsigned int pkt_size; |
| 381 | struct sk_buff *skb; |
| 382 | /* read size+status of next frame from DMA ring buffer */ |
| 383 | /* the number 16 and 4 are just examples */ |
| 384 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); |
| 385 | rx_size = rx_status >> 16; |
| 386 | pkt_size = rx_size - 4; |
| 387 | |
| 388 | /* process errors */ |
| 389 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || |
| 390 | (!(rx_status & RxStatusOK))) { |
| 391 | netdrv_rx_err (rx_status, dev, tp, ioaddr); |
| 392 | return; |
| 393 | } |
| 394 | |
| 395 | if (--rx_work_limit < 0) |
| 396 | break; |
| 397 | |
| 398 | /* grab a skb */ |
| 399 | skb = dev_alloc_skb (pkt_size + 2); |
| 400 | if (skb) { |
| 401 | . |
| 402 | . |
| 403 | netif_rx (skb); |
| 404 | . |
| 405 | . |
| 406 | } else { /* OOM */ |
| 407 | /*seems very driver specific ... some just pass |
| 408 | whatever is on the ring already. */ |
| 409 | } |
| 410 | |
| 411 | /* move to the next skb on the ring */ |
| 412 | entry = (++tp->cur_rx) % RX_RING_SIZE; |
| 413 | received++ ; |
| 414 | |
| 415 | } |
| 416 | |
| 417 | /* store current ring pointer state */ |
| 418 | tp->cur_rx = cur_rx; |
| 419 | |
| 420 | /* Refill the Rx ring buffers if they are needed */ |
| 421 | refill_rx_ring(); |
| 422 | . |
| 423 | . |
| 424 | |
| 425 | } |
| 426 | ------------------------------------------------------------------- |
| 427 | We change it to a new one below; note the additional parameter in |
| 428 | the call. |
| 429 | |
| 430 | ------------------------------------------------------------------- |
| 431 | |
| 432 | /* this is called by the network core */ |
| 433 | static int my_poll (struct net_device *dev, int *budget) |
| 434 | { |
| 435 | |
| 436 | struct my_private *tp = (struct my_private *)dev->priv; |
| 437 | rx_ring = tp->rx_ring; |
| 438 | cur_rx = tp->cur_rx; |
| 439 | int entry = cur_rx % RX_BUF_LEN; |
| 440 | /* maximum packets to send to the stack */ |
| 441 | /************************ note note *********************************/ |
| 442 | int rx_work_limit = dev->quota; |
| 443 | |
| 444 | /************************ end note note *********************************/ |
| 445 | do { // outer beginning loop starts here |
| 446 | |
| 447 | clear_rx_status_register_bit(); |
| 448 | |
| 449 | while (rx_ring_not_empty) { |
| 450 | u32 rx_status; |
| 451 | unsigned int rx_size; |
| 452 | unsigned int pkt_size; |
| 453 | struct sk_buff *skb; |
| 454 | /* read size+status of next frame from DMA ring buffer */ |
| 455 | /* the number 16 and 4 are just examples */ |
| 456 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); |
| 457 | rx_size = rx_status >> 16; |
| 458 | pkt_size = rx_size - 4; |
| 459 | |
| 460 | /* process errors */ |
| 461 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || |
| 462 | (!(rx_status & RxStatusOK))) { |
| 463 | netdrv_rx_err (rx_status, dev, tp, ioaddr); |
| 464 | return 1; |
| 465 | } |
| 466 | |
| 467 | /************************ note note *********************************/ |
| 468 | if (--rx_work_limit < 0) { /* we got packets, but no quota */ |
| 469 | /* store current ring pointer state */ |
| 470 | tp->cur_rx = cur_rx; |
| 471 | |
| 472 | /* Refill the Rx ring buffers if they are needed */ |
| 473 | refill_rx_ring(dev); |
| 474 | goto not_done; |
| 475 | } |
| 476 | /********************** end note **********************************/ |
| 477 | |
| 478 | /* grab a skb */ |
| 479 | skb = dev_alloc_skb (pkt_size + 2); |
| 480 | if (skb) { |
| 481 | . |
| 482 | . |
| 483 | /************************ note note *********************************/ |
| 484 | netif_receive_skb (skb); |
| 485 | /********************** end note **********************************/ |
| 486 | . |
| 487 | . |
| 488 | } else { /* OOM */ |
| 489 | /*seems very driver specific ... common is just pass |
| 490 | whatever is on the ring already. */ |
| 491 | } |
| 492 | |
| 493 | /* move to the next skb on the ring */ |
| 494 | entry = (++tp->cur_rx) % RX_RING_SIZE; |
| 495 | received++ ; |
| 496 | |
| 497 | } |
| 498 | |
| 499 | /* store current ring pointer state */ |
| 500 | tp->cur_rx = cur_rx; |
| 501 | |
| 502 | /* Refill the Rx ring buffers if they are needed */ |
| 503 | refill_rx_ring(dev); |
| 504 | |
| 505 | /* no packets on ring; but new ones can arrive since we last |
| 506 | checked */ |
| 507 | status = read_interrupt_status_reg(); |
| 508 | if (rx status is not set) { |
| 509 | /* If something arrives in this narrow window, |
| 510 | an interrupt will be generated */ |
| 511 | goto done; |
| 512 | } |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 513 | /* done! at least that's what it looks like ;-> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 514 | if new packets came in after our last check on status bits |
| 515 | they'll be caught by the while check and we go back and clear them |
| 516 | since we havent exceeded our quota */ |
| 517 | } while (rx_status_is_set); |
| 518 | |
| 519 | done: |
| 520 | |
| 521 | /************************ note note *********************************/ |
| 522 | dev->quota -= received; |
| 523 | *budget -= received; |
| 524 | |
| 525 | /* If RX ring is not full we are out of memory. */ |
| 526 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| 527 | goto oom; |
| 528 | |
| 529 | /* we are happy/done, no more packets on ring; put us back |
| 530 | to where we can start processing interrupts again */ |
| 531 | netif_rx_complete(dev); |
| 532 | enable_rx_and_rxnobuf_ints(); |
| 533 | |
| 534 | /* The last op happens after poll completion. Which means the following: |
| 535 | * 1. it can race with disabling irqs in irq handler (which are done to |
| 536 | * schedule polls) |
| 537 | * 2. it can race with dis/enabling irqs in other poll threads |
Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 538 | * 3. if an irq raised after the beginning of the outer beginning |
| 539 | * loop (marked in the code above), it will be immediately |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 540 | * triggered here. |
| 541 | * |
Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 542 | * Summarizing: the logic may result in some redundant irqs both |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 543 | * due to races in masking and due to too late acking of already |
| 544 | * processed irqs. The good news: no events are ever lost. |
| 545 | */ |
| 546 | |
| 547 | return 0; /* done */ |
| 548 | |
| 549 | not_done: |
| 550 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || |
| 551 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| 552 | refill_rx_ring(dev); |
| 553 | |
| 554 | if (!received) { |
| 555 | printk("received==0\n"); |
| 556 | received = 1; |
| 557 | } |
| 558 | dev->quota -= received; |
| 559 | *budget -= received; |
| 560 | return 1; /* not_done */ |
| 561 | |
| 562 | oom: |
| 563 | /* Start timer, stop polling, but do not enable rx interrupts. */ |
| 564 | start_poll_timer(dev); |
| 565 | return 0; /* we'll take it from here so tell core "done"*/ |
| 566 | |
| 567 | /************************ End note note *********************************/ |
| 568 | } |
| 569 | ------------------------------------------------------------------- |
| 570 | |
| 571 | From above we note that: |
| 572 | 0) rx_work_limit = dev->quota |
| 573 | 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when |
| 574 | it does the work. |
| 575 | 2) We have a done and not_done state. |
| 576 | 3) instead of netif_rx() we call netif_receive_skb() to pass the skb. |
| 577 | 4) we have a new way of handling oom condition |
| 578 | 5) A new outer for (;;) loop has been added. This serves the purpose of |
| 579 | ensuring that if a new packet has come in, after we are all set and done, |
| 580 | and we have not exceeded our quota that we continue sending packets up. |
| 581 | |
| 582 | |
| 583 | ----------------------------------------------------------- |
| 584 | Poll timer code will need to do the following: |
| 585 | |
| 586 | a) |
| 587 | |
| 588 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || |
| 589 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| 590 | refill_rx_ring(dev); |
| 591 | |
| 592 | /* If RX ring is not full we are still out of memory. |
| 593 | Restart the timer again. Else we re-add ourselves |
| 594 | to the master poll list. |
| 595 | */ |
| 596 | |
| 597 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| 598 | restart_timer(); |
| 599 | |
| 600 | else netif_rx_schedule(dev); /* we are back on the poll list */ |
| 601 | |
| 602 | 5) dev->close() and dev->suspend() issues |
| 603 | ========================================== |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 604 | The driver writer needn't worry about this; the top net layer takes |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 605 | care of it. |
| 606 | |
| 607 | 6) Adding new Stats to /proc |
| 608 | ============================= |
| 609 | In order to debug some of the new features, we introduce new stats |
| 610 | that need to be collected. |
| 611 | TODO: Fill this later. |
| 612 | |
| 613 | APPENDIX 1: discussion on using ethernet HW FC |
| 614 | ============================================== |
| 615 | Most chips with FC only send a pause packet when they run out of Rx buffers. |
| 616 | Since packets are pulled off the DMA ring by a softirq in NAPI, |
| 617 | if the system is slow in grabbing them and we have a high input |
| 618 | rate (faster than the system's capacity to remove packets), then theoretically |
| 619 | there will only be one rx interrupt for all packets during a given packetstorm. |
| 620 | Under low load, we might have a single interrupt per packet. |
| 621 | FC should be programmed to apply in the case when the system cant pull out |
| 622 | packets fast enough i.e send a pause only when you run out of rx buffers. |
| 623 | Note FC in itself is a good solution but we have found it to not be |
| 624 | much of a commodity feature (both in NICs and switches) and hence falls |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 625 | under the same category as using NIC based mitigation. Also, experiments |
| 626 | indicate that it's much harder to resolve the resource allocation |
| 627 | issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 628 | proved harder. In any case, FC works even better with NAPI but is not |
| 629 | necessary. |
| 630 | |
| 631 | |
| 632 | APPENDIX 2: the "rotting packet" race-window avoidance scheme |
| 633 | ============================================================= |
| 634 | |
| 635 | There are two types of associations seen here |
| 636 | |
| 637 | 1) status/int which honors level triggered IRQ |
| 638 | |
| 639 | If a status bit for receive or rxnobuff is set and the corresponding |
| 640 | interrupt-enable bit is not on, then no interrupts will be generated. However, |
| 641 | as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is |
| 642 | generated. [assuming the status bit was not turned off]. |
| 643 | Generally the concept of level triggered IRQs in association with a status and |
| 644 | interrupt-enable CSR register set is used to avoid the race. |
| 645 | |
| 646 | If we take the example of the tulip: |
| 647 | "pending work" is indicated by the status bit(CSR5 in tulip). |
| 648 | the corresponding interrupt bit (CSR7 in tulip) might be turned off (but |
| 649 | the CSR5 will continue to be turned on with new packet arrivals even if |
| 650 | we clear it the first time) |
| 651 | Very important is the fact that if we turn on the interrupt bit on when |
| 652 | status is set that an immediate irq is triggered. |
| 653 | |
| 654 | If we cleared the rx ring and proclaimed there was "no more work |
| 655 | to be done" and then went on to do a few other things; then when we enable |
| 656 | interrupts, there is a possibility that a new packet might sneak in during |
| 657 | this phase. It helps to look at the pseudo code for the tulip poll |
| 658 | routine: |
| 659 | |
| 660 | -------------------------- |
| 661 | do { |
| 662 | ACK; |
| 663 | while (ring_is_not_empty()) { |
| 664 | work-work-work |
| 665 | if quota is exceeded: exit, no touching irq status/mask |
| 666 | } |
| 667 | /* No packets, but new can arrive while we are doing this*/ |
| 668 | CSR5 := read |
| 669 | if (CSR5 is not set) { |
| 670 | /* If something arrives in this narrow window here, |
| 671 | * where the comments are ;-> irq will be generated */ |
| 672 | unmask irqs; |
| 673 | exit poll; |
| 674 | } |
| 675 | } while (rx_status_is_set); |
| 676 | ------------------------ |
| 677 | |
| 678 | CSR5 bit of interest is only the rx status. |
| 679 | If you look at the last if statement: |
| 680 | you just finished grabbing all the packets from the rx ring .. you check if |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 681 | status bit says there are more packets just in ... it says none; you then |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 682 | enable rx interrupts again; if a new packet just came in during this check, |
| 683 | we are counting that CSR5 will be set in that small window of opportunity |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 684 | and that by re-enabling interrupts, we would actually trigger an interrupt |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 685 | to register the new packet for processing. |
| 686 | |
| 687 | [The above description nay be very verbose, if you have better wording |
| 688 | that will make this more understandable, please suggest it.] |
| 689 | |
| 690 | 2) non-capable hardware |
| 691 | |
| 692 | These do not generally respect level triggered IRQs. Normally, |
| 693 | irqs may be lost while being masked and the only way to leave poll is to do |
| 694 | a double check for new input after netif_rx_complete() is invoked |
| 695 | and re-enable polling (after seeing this new input). |
| 696 | |
| 697 | Sample code: |
| 698 | |
| 699 | --------- |
| 700 | . |
| 701 | . |
| 702 | restart_poll: |
| 703 | while (ring_is_not_empty()) { |
| 704 | work-work-work |
| 705 | if quota is exceeded: exit, not touching irq status/mask |
| 706 | } |
| 707 | . |
| 708 | . |
| 709 | . |
| 710 | enable_rx_interrupts() |
| 711 | netif_rx_complete(dev); |
| 712 | if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { |
| 713 | disable_rx_and_rxnobufs() |
| 714 | goto restart_poll |
| 715 | } while (rx_status_is_set); |
| 716 | --------- |
| 717 | |
| 718 | Basically netif_rx_complete() removes us from the poll list, but because a |
| 719 | new packet which will never be caught due to the possibility of a race |
| 720 | might come in, we attempt to re-add ourselves to the poll list. |
| 721 | |
| 722 | |
| 723 | |
| 724 | |
| 725 | APPENDIX 3: Scheduling issues. |
| 726 | ============================== |
| 727 | As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the |
| 728 | general solution to schedule softirq's to run before next interrupt and by putting |
| 729 | them under scheduler control. Also this prevents consecutive softirq's from |
| 730 | monopolize the CPU. This also have the effect that the priority of ksoftirq needs |
| 731 | to be considered when running very CPU-intensive applications and networking to |
| 732 | get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 |
| 733 | (eventually more) is reported cure problems with low network performance at high |
| 734 | CPU load. |
| 735 | |
| 736 | Most used processes in a GIGE router: |
| 737 | USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND |
| 738 | root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) |
| 739 | root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated |
| 740 | |
| 741 | -------------------------------------------------------------------- |
| 742 | |
| 743 | relevant sites: |
| 744 | ================== |
| 745 | ftp://robur.slu.se/pub/Linux/net-development/NAPI/ |
| 746 | |
| 747 | |
| 748 | -------------------------------------------------------------------- |
| 749 | TODO: Write net-skeleton.c driver. |
| 750 | ------------------------------------------------------------- |
| 751 | |
| 752 | Authors: |
| 753 | ======== |
| 754 | Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> |
| 755 | Jamal Hadi Salim <hadi@cyberus.ca> |
| 756 | Robert Olsson <Robert.Olsson@data.slu.se> |
| 757 | |
| 758 | Acknowledgements: |
| 759 | ================ |
| 760 | People who made this document better: |
| 761 | |
| 762 | Lennert Buytenhek <buytenh@gnu.org> |
| 763 | Andrew Morton <akpm@zip.com.au> |
| 764 | Manfred Spraul <manfred@colorfullife.com> |
| 765 | Donald Becker <becker@scyld.com> |
| 766 | Jeff Garzik <jgarzik@pobox.com> |