| HISTORY: |
| February 16/2002 -- revision 0.2.1: |
| COR typo corrected |
| February 10/2002 -- revision 0.2: |
| some spell checking ;-> |
| January 12/2002 -- revision 0.1 |
| This is still work in progress so may change. |
| To keep up to date please watch this space. |
| |
| Introduction to NAPI |
| ==================== |
| |
| NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique |
| to improve network performance on Linux. For more details please |
| read that paper. |
| NAPI provides a "inherent mitigation" which is bound by system capacity |
| as can be seen from the following data collected by Robert on Gigabit |
| ethernet (e1000): |
| |
| Psize Ipps Tput Rxint Txint Done Ndone |
| --------------------------------------------------------------- |
| 60 890000 409362 17 27622 7 6823 |
| 128 758150 464364 21 9301 10 7738 |
| 256 445632 774646 42 15507 21 12906 |
| 512 232666 994445 241292 19147 241192 1062 |
| 1024 119061 1000003 872519 19258 872511 0 |
| 1440 85193 1000003 946576 19505 946569 0 |
| |
| |
| Legend: |
| "Ipps" stands for input packets per second. |
| "Tput" == packets out of total 1M that made it out. |
| "txint" == transmit completion interrupts seen |
| "Done" == The number of times that the poll() managed to pull all |
| packets out of the rx ring. Note from this that the lower the |
| load the more we could clean up the rxring |
| "Ndone" == is the converse of "Done". Note again, that the higher |
| the load the more times we couldn't clean up the rxring. |
| |
| Observe that: |
| when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. |
| The system cant handle the processing at 1 interrupt/packet at that load level. |
| At lower rates on the other hand, rx interrupts go up and therefore the |
| interrupt/packet ratio goes up (as observable from that table). So there is |
| possibility that under low enough input, you get one poll call for each |
| input packet caused by a single interrupt each time. And if the system |
| cant handle interrupt per packet ratio of 1, then it will just have to |
| chug along .... |
| |
| |
| 0) Prerequisites: |
| ================== |
| A driver MAY continue using the old 2.4 technique for interfacing |
| to the network stack and not benefit from the NAPI changes. |
| NAPI additions to the kernel do not break backward compatibility. |
| NAPI, however, requires the following features to be available: |
| |
| A) DMA ring or enough RAM to store packets in software devices. |
| |
| B) Ability to turn off interrupts or maybe events that send packets up |
| the stack. |
| |
| NAPI processes packet events in what is known as dev->poll() method. |
| Typically, only packet receive events are processed in dev->poll(). |
| The rest of the events MAY be processed by the regular interrupt handler |
| to reduce processing latency (justified also because there are not that |
| many of them). |
| Note, however, NAPI does not enforce that dev->poll() only processes |
| receive events. |
| Tests with the tulip driver indicated slightly increased latency if |
| all of the interrupt handler is moved to dev->poll(). Also MII handling |
| gets a little trickier. |
| The example used in this document is to move the receive processing only |
| to dev->poll(); this is shown with the patch for the tulip driver. |
| For an example of code that moves all the interrupt driver to |
| dev->poll() look at the ported e1000 code. |
| |
| There are caveats that might force you to go with moving everything to |
| dev->poll(). Different NICs work differently depending on their status/event |
| acknowledgement setup. |
| There are two types of event register ACK mechanisms. |
| I) what is known as Clear-on-read (COR). |
| when you read the status/event register, it clears everything! |
| The natsemi and sunbmac NICs are known to do this. |
| In this case your only choice is to move all to dev->poll() |
| |
| II) Clear-on-write (COW) |
| i) you clear the status by writing a 1 in the bit-location you want. |
| These are the majority of the NICs and work the best with NAPI. |
| Put only receive events in dev->poll(); leave the rest in |
| the old interrupt handler. |
| ii) whatever you write in the status register clears every thing ;-> |
| Cant seem to find any supported by Linux which do this. If |
| someone knows such a chip email us please. |
| Move all to dev->poll() |
| |
| C) Ability to detect new work correctly. |
| NAPI works by shutting down event interrupts when there's work and |
| turning them on when there's none. |
| New packets might show up in the small window while interrupts were being |
| re-enabled (refer to appendix 2). A packet might sneak in during the period |
| we are enabling interrupts. We only get to know about such a packet when the |
| next new packet arrives and generates an interrupt. |
| Essentially, there is a small window of opportunity for a race condition |
| which for clarity we'll refer to as the "rotting packet". |
| |
| This is a very important topic and appendix 2 is dedicated for more |
| discussion. |
| |
| Locking rules and environmental guarantees |
| ========================================== |
| |
| -Guarantee: Only one CPU at any time can call dev->poll(); this is because |
| only one CPU can pick the initial interrupt and hence the initial |
| netif_rx_schedule(dev); |
| - The core layer invokes devices to send packets in a round robin format. |
| This implies receive is totally lockless because of the guarantee that only |
| one CPU is executing it. |
| - contention can only be the result of some other CPU accessing the rx |
| ring. This happens only in close() and suspend() (when these methods |
| try to clean the rx ring); |
| ****guarantee: driver authors need not worry about this; synchronization |
| is taken care for them by the top net layer. |
| -local interrupts are enabled (if you dont move all to dev->poll()). For |
| example link/MII and txcomplete continue functioning just same old way. |
| This improves the latency of processing these events. It is also assumed that |
| the receive interrupt is the largest cause of noise. Note this might not |
| always be true. |
| [according to Manfred Spraul, the winbond insists on sending one |
| txmitcomplete interrupt for each packet (although this can be mitigated)]. |
| For these broken drivers, move all to dev->poll(). |
| |
| For the rest of this text, we'll assume that dev->poll() only |
| processes receive events. |
| |
| new methods introduce by NAPI |
| ============================= |
| |
| a) netif_rx_schedule(dev) |
| Called by an IRQ handler to schedule a poll for device |
| |
| b) netif_rx_schedule_prep(dev) |
| puts the device in a state which allows for it to be added to the |
| CPU polling list if it is up and running. You can look at this as |
| the first half of netif_rx_schedule(dev) above; the second half |
| being c) below. |
| |
| c) __netif_rx_schedule(dev) |
| Add device to the poll list for this CPU; assuming that _prep above |
| has already been called and returned 1. |
| |
| d) netif_rx_reschedule(dev, undo) |
| Called to reschedule polling for device specifically for some |
| deficient hardware. Read Appendix 2 for more details. |
| |
| e) netif_rx_complete(dev) |
| |
| Remove interface from the CPU poll list: it must be in the poll list |
| on current cpu. This primitive is called by dev->poll(), when |
| it completes its work. The device cannot be out of poll list at this |
| call, if it is then clearly it is a BUG(). You'll know ;-> |
| |
| All these above nethods are used below. So keep reading for clarity. |
| |
| Device driver changes to be made when porting NAPI |
| ================================================== |
| |
| Below we describe what kind of changes are required for NAPI to work. |
| |
| 1) introduction of dev->poll() method |
| ===================================== |
| |
| This is the method that is invoked by the network core when it requests |
| for new packets from the driver. A driver is allowed to send upto |
| dev->quota packets by the current CPU before yielding to the network |
| subsystem (so other devices can also get opportunity to send to the stack). |
| |
| dev->poll() prototype looks as follows: |
| int my_poll(struct net_device *dev, int *budget) |
| |
| budget is the remaining number of packets the network subsystem on the |
| current CPU can send up the stack before yielding to other system tasks. |
| *Each driver is responsible for decrementing budget by the total number of |
| packets sent. |
| Total number of packets cannot exceed dev->quota. |
| |
| dev->poll() method is invoked by the top layer, the driver just sends if it |
| can to the stack the packet quantity requested. |
| |
| more on dev->poll() below after the interrupt changes are explained. |
| |
| 2) registering dev->poll() method |
| =================================== |
| |
| dev->poll should be set in the dev->probe() method. |
| e.g: |
| dev->open = my_open; |
| . |
| . |
| /* two new additions */ |
| /* first register my poll method */ |
| dev->poll = my_poll; |
| /* next register my weight/quanta; can be overridden in /proc */ |
| dev->weight = 16; |
| . |
| . |
| dev->stop = my_close; |
| |
| |
| |
| 3) scheduling dev->poll() |
| ============================= |
| This involves modifying the interrupt handler and the code |
| path which takes the packet off the NIC and sends them to the |
| stack. |
| |
| it's important at this point to introduce the classical D Becker |
| interrupt processor: |
| |
| ------------------ |
| static irqreturn_t |
| netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) |
| { |
| |
| struct net_device *dev = (struct net_device *)dev_instance; |
| struct my_private *tp = (struct my_private *)dev->priv; |
| |
| int work_count = my_work_count; |
| status = read_interrupt_status_reg(); |
| if (status == 0) |
| return IRQ_NONE; /* Shared IRQ: not us */ |
| if (status == 0xffff) |
| return IRQ_HANDLED; /* Hot unplug */ |
| if (status & error) |
| do_some_error_handling() |
| |
| do { |
| acknowledge_ints_ASAP(); |
| |
| if (status & link_interrupt) { |
| spin_lock(&tp->link_lock); |
| do_some_link_stat_stuff(); |
| spin_lock(&tp->link_lock); |
| } |
| |
| if (status & rx_interrupt) { |
| receive_packets(dev); |
| } |
| |
| if (status & rx_nobufs) { |
| make_rx_buffs_avail(); |
| } |
| |
| if (status & tx_related) { |
| spin_lock(&tp->lock); |
| tx_ring_free(dev); |
| if (tx_died) |
| restart_tx(); |
| spin_unlock(&tp->lock); |
| } |
| |
| status = read_interrupt_status_reg(); |
| |
| } while (!(status & error) || more_work_to_be_done); |
| return IRQ_HANDLED; |
| } |
| |
| ---------------------------------------------------------------------- |
| |
| We now change this to what is shown below to NAPI-enable it: |
| |
| ---------------------------------------------------------------------- |
| static irqreturn_t |
| netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) |
| { |
| struct net_device *dev = (struct net_device *)dev_instance; |
| struct my_private *tp = (struct my_private *)dev->priv; |
| |
| status = read_interrupt_status_reg(); |
| if (status == 0) |
| return IRQ_NONE; /* Shared IRQ: not us */ |
| if (status == 0xffff) |
| return IRQ_HANDLED; /* Hot unplug */ |
| if (status & error) |
| do_some_error_handling(); |
| |
| do { |
| /************************ start note *********************************/ |
| acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here |
| /************************ end note *********************************/ |
| |
| if (status & link_interrupt) { |
| spin_lock(&tp->link_lock); |
| do_some_link_stat_stuff(); |
| spin_unlock(&tp->link_lock); |
| } |
| /************************ start note *********************************/ |
| if (status & rx_interrupt || (status & rx_nobuffs)) { |
| if (netif_rx_schedule_prep(dev)) { |
| |
| /* disable interrupts caused |
| * by arriving packets */ |
| disable_rx_and_rxnobuff_ints(); |
| /* tell system we have work to be done. */ |
| __netif_rx_schedule(dev); |
| } else { |
| printk("driver bug! interrupt while in poll\n"); |
| /* FIX by disabling interrupts */ |
| disable_rx_and_rxnobuff_ints(); |
| } |
| } |
| /************************ end note note *********************************/ |
| |
| if (status & tx_related) { |
| spin_lock(&tp->lock); |
| tx_ring_free(dev); |
| |
| if (tx_died) |
| restart_tx(); |
| spin_unlock(&tp->lock); |
| } |
| |
| status = read_interrupt_status_reg(); |
| |
| /************************ start note *********************************/ |
| } while (!(status & error) || more_work_to_be_done(status)); |
| /************************ end note note *********************************/ |
| return IRQ_HANDLED; |
| } |
| |
| --------------------------------------------------------------------- |
| |
| |
| We note several things from above: |
| |
| I) Any interrupt source which is caused by arriving packets is now |
| turned off when it occurs. Depending on the hardware, there could be |
| several reasons that arriving packets would cause interrupts; these are the |
| interrupt sources we wish to avoid. The two common ones are a) a packet |
| arriving (rxint) b) a packet arriving and finding no DMA buffers available |
| (rxnobuff) . |
| This means also acknowledge_ints_ASAP() will not clear the status |
| register for those two items above; clearing is done in the place where |
| proper work is done within NAPI; at the poll() and refill_rx_ring() |
| discussed further below. |
| netif_rx_schedule_prep() returns 1 if device is in running state and |
| gets successfully added to the core poll list. If we get a zero value |
| we can _almost_ assume are already added to the list (instead of not running. |
| Logic based on the fact that you shouldn't get interrupt if not running) |
| We rectify this by disabling rx and rxnobuf interrupts. |
| |
| II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. |
| These functionalities are still around actually...... |
| |
| infact, receive_packets(dev) is very close to my_poll() and |
| make_rx_buffs_avail() is invoked from my_poll() |
| |
| 4) converting receive_packets() to dev->poll() |
| =============================================== |
| |
| We need to convert the classical D Becker receive_packets(dev) to my_poll() |
| |
| First the typical receive_packets() below: |
| ------------------------------------------------------------------- |
| |
| /* this is called by interrupt handler */ |
| static void receive_packets (struct net_device *dev) |
| { |
| |
| struct my_private *tp = (struct my_private *)dev->priv; |
| rx_ring = tp->rx_ring; |
| cur_rx = tp->cur_rx; |
| int entry = cur_rx % RX_RING_SIZE; |
| int received = 0; |
| int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; |
| |
| while (rx_ring_not_empty) { |
| u32 rx_status; |
| unsigned int rx_size; |
| unsigned int pkt_size; |
| struct sk_buff *skb; |
| /* read size+status of next frame from DMA ring buffer */ |
| /* the number 16 and 4 are just examples */ |
| rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); |
| rx_size = rx_status >> 16; |
| pkt_size = rx_size - 4; |
| |
| /* process errors */ |
| if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || |
| (!(rx_status & RxStatusOK))) { |
| netdrv_rx_err (rx_status, dev, tp, ioaddr); |
| return; |
| } |
| |
| if (--rx_work_limit < 0) |
| break; |
| |
| /* grab a skb */ |
| skb = dev_alloc_skb (pkt_size + 2); |
| if (skb) { |
| . |
| . |
| netif_rx (skb); |
| . |
| . |
| } else { /* OOM */ |
| /*seems very driver specific ... some just pass |
| whatever is on the ring already. */ |
| } |
| |
| /* move to the next skb on the ring */ |
| entry = (++tp->cur_rx) % RX_RING_SIZE; |
| received++ ; |
| |
| } |
| |
| /* store current ring pointer state */ |
| tp->cur_rx = cur_rx; |
| |
| /* Refill the Rx ring buffers if they are needed */ |
| refill_rx_ring(); |
| . |
| . |
| |
| } |
| ------------------------------------------------------------------- |
| We change it to a new one below; note the additional parameter in |
| the call. |
| |
| ------------------------------------------------------------------- |
| |
| /* this is called by the network core */ |
| static int my_poll (struct net_device *dev, int *budget) |
| { |
| |
| struct my_private *tp = (struct my_private *)dev->priv; |
| rx_ring = tp->rx_ring; |
| cur_rx = tp->cur_rx; |
| int entry = cur_rx % RX_BUF_LEN; |
| /* maximum packets to send to the stack */ |
| /************************ note note *********************************/ |
| int rx_work_limit = dev->quota; |
| |
| /************************ end note note *********************************/ |
| do { // outer beginning loop starts here |
| |
| clear_rx_status_register_bit(); |
| |
| while (rx_ring_not_empty) { |
| u32 rx_status; |
| unsigned int rx_size; |
| unsigned int pkt_size; |
| struct sk_buff *skb; |
| /* read size+status of next frame from DMA ring buffer */ |
| /* the number 16 and 4 are just examples */ |
| rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); |
| rx_size = rx_status >> 16; |
| pkt_size = rx_size - 4; |
| |
| /* process errors */ |
| if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || |
| (!(rx_status & RxStatusOK))) { |
| netdrv_rx_err (rx_status, dev, tp, ioaddr); |
| return 1; |
| } |
| |
| /************************ note note *********************************/ |
| if (--rx_work_limit < 0) { /* we got packets, but no quota */ |
| /* store current ring pointer state */ |
| tp->cur_rx = cur_rx; |
| |
| /* Refill the Rx ring buffers if they are needed */ |
| refill_rx_ring(dev); |
| goto not_done; |
| } |
| /********************** end note **********************************/ |
| |
| /* grab a skb */ |
| skb = dev_alloc_skb (pkt_size + 2); |
| if (skb) { |
| . |
| . |
| /************************ note note *********************************/ |
| netif_receive_skb (skb); |
| /********************** end note **********************************/ |
| . |
| . |
| } else { /* OOM */ |
| /*seems very driver specific ... common is just pass |
| whatever is on the ring already. */ |
| } |
| |
| /* move to the next skb on the ring */ |
| entry = (++tp->cur_rx) % RX_RING_SIZE; |
| received++ ; |
| |
| } |
| |
| /* store current ring pointer state */ |
| tp->cur_rx = cur_rx; |
| |
| /* Refill the Rx ring buffers if they are needed */ |
| refill_rx_ring(dev); |
| |
| /* no packets on ring; but new ones can arrive since we last |
| checked */ |
| status = read_interrupt_status_reg(); |
| if (rx status is not set) { |
| /* If something arrives in this narrow window, |
| an interrupt will be generated */ |
| goto done; |
| } |
| /* done! at least that's what it looks like ;-> |
| if new packets came in after our last check on status bits |
| they'll be caught by the while check and we go back and clear them |
| since we havent exceeded our quota */ |
| } while (rx_status_is_set); |
| |
| done: |
| |
| /************************ note note *********************************/ |
| dev->quota -= received; |
| *budget -= received; |
| |
| /* If RX ring is not full we are out of memory. */ |
| if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| goto oom; |
| |
| /* we are happy/done, no more packets on ring; put us back |
| to where we can start processing interrupts again */ |
| netif_rx_complete(dev); |
| enable_rx_and_rxnobuf_ints(); |
| |
| /* The last op happens after poll completion. Which means the following: |
| * 1. it can race with disabling irqs in irq handler (which are done to |
| * schedule polls) |
| * 2. it can race with dis/enabling irqs in other poll threads |
| * 3. if an irq raised after the beginning of the outer beginning |
| * loop (marked in the code above), it will be immediately |
| * triggered here. |
| * |
| * Summarizing: the logic may result in some redundant irqs both |
| * due to races in masking and due to too late acking of already |
| * processed irqs. The good news: no events are ever lost. |
| */ |
| |
| return 0; /* done */ |
| |
| not_done: |
| if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || |
| tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| refill_rx_ring(dev); |
| |
| if (!received) { |
| printk("received==0\n"); |
| received = 1; |
| } |
| dev->quota -= received; |
| *budget -= received; |
| return 1; /* not_done */ |
| |
| oom: |
| /* Start timer, stop polling, but do not enable rx interrupts. */ |
| start_poll_timer(dev); |
| return 0; /* we'll take it from here so tell core "done"*/ |
| |
| /************************ End note note *********************************/ |
| } |
| ------------------------------------------------------------------- |
| |
| From above we note that: |
| 0) rx_work_limit = dev->quota |
| 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when |
| it does the work. |
| 2) We have a done and not_done state. |
| 3) instead of netif_rx() we call netif_receive_skb() to pass the skb. |
| 4) we have a new way of handling oom condition |
| 5) A new outer for (;;) loop has been added. This serves the purpose of |
| ensuring that if a new packet has come in, after we are all set and done, |
| and we have not exceeded our quota that we continue sending packets up. |
| |
| |
| ----------------------------------------------------------- |
| Poll timer code will need to do the following: |
| |
| a) |
| |
| if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || |
| tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| refill_rx_ring(dev); |
| |
| /* If RX ring is not full we are still out of memory. |
| Restart the timer again. Else we re-add ourselves |
| to the master poll list. |
| */ |
| |
| if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) |
| restart_timer(); |
| |
| else netif_rx_schedule(dev); /* we are back on the poll list */ |
| |
| 5) dev->close() and dev->suspend() issues |
| ========================================== |
| The driver writer needn't worry about this; the top net layer takes |
| care of it. |
| |
| 6) Adding new Stats to /proc |
| ============================= |
| In order to debug some of the new features, we introduce new stats |
| that need to be collected. |
| TODO: Fill this later. |
| |
| APPENDIX 1: discussion on using ethernet HW FC |
| ============================================== |
| Most chips with FC only send a pause packet when they run out of Rx buffers. |
| Since packets are pulled off the DMA ring by a softirq in NAPI, |
| if the system is slow in grabbing them and we have a high input |
| rate (faster than the system's capacity to remove packets), then theoretically |
| there will only be one rx interrupt for all packets during a given packetstorm. |
| Under low load, we might have a single interrupt per packet. |
| FC should be programmed to apply in the case when the system cant pull out |
| packets fast enough i.e send a pause only when you run out of rx buffers. |
| Note FC in itself is a good solution but we have found it to not be |
| much of a commodity feature (both in NICs and switches) and hence falls |
| under the same category as using NIC based mitigation. Also, experiments |
| indicate that it's much harder to resolve the resource allocation |
| issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness |
| proved harder. In any case, FC works even better with NAPI but is not |
| necessary. |
| |
| |
| APPENDIX 2: the "rotting packet" race-window avoidance scheme |
| ============================================================= |
| |
| There are two types of associations seen here |
| |
| 1) status/int which honors level triggered IRQ |
| |
| If a status bit for receive or rxnobuff is set and the corresponding |
| interrupt-enable bit is not on, then no interrupts will be generated. However, |
| as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is |
| generated. [assuming the status bit was not turned off]. |
| Generally the concept of level triggered IRQs in association with a status and |
| interrupt-enable CSR register set is used to avoid the race. |
| |
| If we take the example of the tulip: |
| "pending work" is indicated by the status bit(CSR5 in tulip). |
| the corresponding interrupt bit (CSR7 in tulip) might be turned off (but |
| the CSR5 will continue to be turned on with new packet arrivals even if |
| we clear it the first time) |
| Very important is the fact that if we turn on the interrupt bit on when |
| status is set that an immediate irq is triggered. |
| |
| If we cleared the rx ring and proclaimed there was "no more work |
| to be done" and then went on to do a few other things; then when we enable |
| interrupts, there is a possibility that a new packet might sneak in during |
| this phase. It helps to look at the pseudo code for the tulip poll |
| routine: |
| |
| -------------------------- |
| do { |
| ACK; |
| while (ring_is_not_empty()) { |
| work-work-work |
| if quota is exceeded: exit, no touching irq status/mask |
| } |
| /* No packets, but new can arrive while we are doing this*/ |
| CSR5 := read |
| if (CSR5 is not set) { |
| /* If something arrives in this narrow window here, |
| * where the comments are ;-> irq will be generated */ |
| unmask irqs; |
| exit poll; |
| } |
| } while (rx_status_is_set); |
| ------------------------ |
| |
| CSR5 bit of interest is only the rx status. |
| If you look at the last if statement: |
| you just finished grabbing all the packets from the rx ring .. you check if |
| status bit says there are more packets just in ... it says none; you then |
| enable rx interrupts again; if a new packet just came in during this check, |
| we are counting that CSR5 will be set in that small window of opportunity |
| and that by re-enabling interrupts, we would actually trigger an interrupt |
| to register the new packet for processing. |
| |
| [The above description nay be very verbose, if you have better wording |
| that will make this more understandable, please suggest it.] |
| |
| 2) non-capable hardware |
| |
| These do not generally respect level triggered IRQs. Normally, |
| irqs may be lost while being masked and the only way to leave poll is to do |
| a double check for new input after netif_rx_complete() is invoked |
| and re-enable polling (after seeing this new input). |
| |
| Sample code: |
| |
| --------- |
| . |
| . |
| restart_poll: |
| while (ring_is_not_empty()) { |
| work-work-work |
| if quota is exceeded: exit, not touching irq status/mask |
| } |
| . |
| . |
| . |
| enable_rx_interrupts() |
| netif_rx_complete(dev); |
| if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { |
| disable_rx_and_rxnobufs() |
| goto restart_poll |
| } while (rx_status_is_set); |
| --------- |
| |
| Basically netif_rx_complete() removes us from the poll list, but because a |
| new packet which will never be caught due to the possibility of a race |
| might come in, we attempt to re-add ourselves to the poll list. |
| |
| |
| |
| |
| APPENDIX 3: Scheduling issues. |
| ============================== |
| As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the |
| general solution to schedule softirq's to run before next interrupt and by putting |
| them under scheduler control. Also this prevents consecutive softirq's from |
| monopolize the CPU. This also have the effect that the priority of ksoftirq needs |
| to be considered when running very CPU-intensive applications and networking to |
| get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 |
| (eventually more) is reported cure problems with low network performance at high |
| CPU load. |
| |
| Most used processes in a GIGE router: |
| USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND |
| root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) |
| root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated |
| |
| -------------------------------------------------------------------- |
| |
| relevant sites: |
| ================== |
| ftp://robur.slu.se/pub/Linux/net-development/NAPI/ |
| |
| |
| -------------------------------------------------------------------- |
| TODO: Write net-skeleton.c driver. |
| ------------------------------------------------------------- |
| |
| Authors: |
| ======== |
| Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> |
| Jamal Hadi Salim <hadi@cyberus.ca> |
| Robert Olsson <Robert.Olsson@data.slu.se> |
| |
| Acknowledgements: |
| ================ |
| People who made this document better: |
| |
| Lennert Buytenhek <buytenh@gnu.org> |
| Andrew Morton <akpm@zip.com.au> |
| Manfred Spraul <manfred@colorfullife.com> |
| Donald Becker <becker@scyld.com> |
| Jeff Garzik <jgarzik@pobox.com> |