Blame - Documentation/networking/NAPI_HOWTO.txt - kernel/msm

blob: 7907435a661c3ed463a2718021f7b5fe4d394f31 [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1	HISTORY:
				2	February 16/2002 -- revision 0.2.1:
				3	COR typo corrected
				4	February 10/2002 -- revision 0.2:
				5	some spell checking ;->
				6	January 12/2002 -- revision 0.1
				7	This is still work in progress so may change.
				8	To keep up to date please watch this space.
				9
				10	Introduction to NAPI
				11	====================
				12
				13	NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
				14	to improve network performance on Linux. For more details please
				15	read that paper.
				16	NAPI provides a "inherent mitigation" which is bound by system capacity
				17	as can be seen from the following data collected by Robert on Gigabit
				18	ethernet (e1000):
				19
				20	Psize Ipps Tput Rxint Txint Done Ndone
				21	---------------------------------------------------------------
				22	60 890000 409362 17 27622 7 6823
				23	128 758150 464364 21 9301 10 7738
				24	256 445632 774646 42 15507 21 12906
				25	512 232666 994445 241292 19147 241192 1062
				26	1024 119061 1000003 872519 19258 872511 0
				27	1440 85193 1000003 946576 19505 946569 0
				28
				29
				30	Legend:
				31	"Ipps" stands for input packets per second.
				32	"Tput" == packets out of total 1M that made it out.
				33	"txint" == transmit completion interrupts seen
				34	"Done" == The number of times that the poll() managed to pull all
				35	packets out of the rx ring. Note from this that the lower the
				36	load the more we could clean up the rxring
				37	"Ndone" == is the converse of "Done". Note again, that the higher
Matt LaPlante	fff9289	2006-10-03 22:47:42 +0200	[diff] [blame]	38	the load the more times we couldn't clean up the rxring.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	39
				40	Observe that:
				41	when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
				42	The system cant handle the processing at 1 interrupt/packet at that load level.
				43	At lower rates on the other hand, rx interrupts go up and therefore the
				44	interrupt/packet ratio goes up (as observable from that table). So there is
				45	possibility that under low enough input, you get one poll call for each
				46	input packet caused by a single interrupt each time. And if the system
				47	cant handle interrupt per packet ratio of 1, then it will just have to
				48	chug along ....
				49
				50
				51	0) Prerequisites:
				52	==================
				53	A driver MAY continue using the old 2.4 technique for interfacing
				54	to the network stack and not benefit from the NAPI changes.
				55	NAPI additions to the kernel do not break backward compatibility.
				56	NAPI, however, requires the following features to be available:
				57
				58	A) DMA ring or enough RAM to store packets in software devices.
				59
				60	B) Ability to turn off interrupts or maybe events that send packets up
				61	the stack.
				62
				63	NAPI processes packet events in what is known as dev->poll() method.
				64	Typically, only packet receive events are processed in dev->poll().
				65	The rest of the events MAY be processed by the regular interrupt handler
				66	to reduce processing latency (justified also because there are not that
				67	many of them).
				68	Note, however, NAPI does not enforce that dev->poll() only processes
				69	receive events.
				70	Tests with the tulip driver indicated slightly increased latency if
				71	all of the interrupt handler is moved to dev->poll(). Also MII handling
				72	gets a little trickier.
				73	The example used in this document is to move the receive processing only
				74	to dev->poll(); this is shown with the patch for the tulip driver.
				75	For an example of code that moves all the interrupt driver to
				76	dev->poll() look at the ported e1000 code.
				77
				78	There are caveats that might force you to go with moving everything to
				79	dev->poll(). Different NICs work differently depending on their status/event
				80	acknowledgement setup.
				81	There are two types of event register ACK mechanisms.
				82	I) what is known as Clear-on-read (COR).
				83	when you read the status/event register, it clears everything!
				84	The natsemi and sunbmac NICs are known to do this.
				85	In this case your only choice is to move all to dev->poll()
				86
				87	II) Clear-on-write (COW)
				88	i) you clear the status by writing a 1 in the bit-location you want.
				89	These are the majority of the NICs and work the best with NAPI.
				90	Put only receive events in dev->poll(); leave the rest in
				91	the old interrupt handler.
				92	ii) whatever you write in the status register clears every thing ;->
				93	Cant seem to find any supported by Linux which do this. If
				94	someone knows such a chip email us please.
				95	Move all to dev->poll()
				96
				97	C) Ability to detect new work correctly.
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	98	NAPI works by shutting down event interrupts when there's work and
				99	turning them on when there's none.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	100	New packets might show up in the small window while interrupts were being
				101	re-enabled (refer to appendix 2). A packet might sneak in during the period
				102	we are enabling interrupts. We only get to know about such a packet when the
				103	next new packet arrives and generates an interrupt.
				104	Essentially, there is a small window of opportunity for a race condition
				105	which for clarity we'll refer to as the "rotting packet".
				106
				107	This is a very important topic and appendix 2 is dedicated for more
				108	discussion.
				109
				110	Locking rules and environmental guarantees
				111	==========================================
				112
				113	-Guarantee: Only one CPU at any time can call dev->poll(); this is because
				114	only one CPU can pick the initial interrupt and hence the initial
				115	netif_rx_schedule(dev);
				116	- The core layer invokes devices to send packets in a round robin format.
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	117	This implies receive is totally lockless because of the guarantee that only
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	118	one CPU is executing it.
				119	- contention can only be the result of some other CPU accessing the rx
				120	ring. This happens only in close() and suspend() (when these methods
				121	try to clean the rx ring);
				122	****guarantee: driver authors need not worry about this; synchronization
				123	is taken care for them by the top net layer.
				124	-local interrupts are enabled (if you dont move all to dev->poll()). For
				125	example link/MII and txcomplete continue functioning just same old way.
				126	This improves the latency of processing these events. It is also assumed that
				127	the receive interrupt is the largest cause of noise. Note this might not
				128	always be true.
				129	[according to Manfred Spraul, the winbond insists on sending one
				130	txmitcomplete interrupt for each packet (although this can be mitigated)].
				131	For these broken drivers, move all to dev->poll().
				132
				133	For the rest of this text, we'll assume that dev->poll() only
				134	processes receive events.
				135
				136	new methods introduce by NAPI
				137	=============================
				138
				139	a) netif_rx_schedule(dev)
				140	Called by an IRQ handler to schedule a poll for device
				141
				142	b) netif_rx_schedule_prep(dev)
				143	puts the device in a state which allows for it to be added to the
				144	CPU polling list if it is up and running. You can look at this as
				145	the first half of netif_rx_schedule(dev) above; the second half
				146	being c) below.
				147
				148	c) __netif_rx_schedule(dev)
				149	Add device to the poll list for this CPU; assuming that _prep above
				150	has already been called and returned 1.
				151
				152	d) netif_rx_reschedule(dev, undo)
				153	Called to reschedule polling for device specifically for some
				154	deficient hardware. Read Appendix 2 for more details.
				155
				156	e) netif_rx_complete(dev)
				157
				158	Remove interface from the CPU poll list: it must be in the poll list
				159	on current cpu. This primitive is called by dev->poll(), when
				160	it completes its work. The device cannot be out of poll list at this
				161	call, if it is then clearly it is a BUG(). You'll know ;->
				162
Matt LaPlante	a982ac0	2007-05-09 07:35:06 +0200	[diff] [blame]	163	All of the above methods are used below, so keep reading for clarity.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	164
				165	Device driver changes to be made when porting NAPI
				166	==================================================
				167
				168	Below we describe what kind of changes are required for NAPI to work.
				169
				170	1) introduction of dev->poll() method
				171	=====================================
				172
				173	This is the method that is invoked by the network core when it requests
				174	for new packets from the driver. A driver is allowed to send upto
				175	dev->quota packets by the current CPU before yielding to the network
				176	subsystem (so other devices can also get opportunity to send to the stack).
				177
				178	dev->poll() prototype looks as follows:
				179	int my_poll(struct net_device dev, int budget)
				180
				181	budget is the remaining number of packets the network subsystem on the
				182	current CPU can send up the stack before yielding to other system tasks.
				183	*Each driver is responsible for decrementing budget by the total number of
				184	packets sent.
				185	Total number of packets cannot exceed dev->quota.
				186
				187	dev->poll() method is invoked by the top layer, the driver just sends if it
				188	can to the stack the packet quantity requested.
				189
				190	more on dev->poll() below after the interrupt changes are explained.
				191
				192	2) registering dev->poll() method
				193	===================================
				194
				195	dev->poll should be set in the dev->probe() method.
				196	e.g:
				197	dev->open = my_open;
				198	.
				199	.
				200	/* two new additions */
				201	/* first register my poll method */
				202	dev->poll = my_poll;
				203	/* next register my weight/quanta; can be overridden in /proc */
				204	dev->weight = 16;
				205	.
				206	.
				207	dev->stop = my_close;
				208
				209
				210
				211	3) scheduling dev->poll()
				212	=============================
				213	This involves modifying the interrupt handler and the code
				214	path which takes the packet off the NIC and sends them to the
				215	stack.
				216
				217	it's important at this point to introduce the classical D Becker
				218	interrupt processor:
				219
				220	------------------
				221	static irqreturn_t
				222	netdevice_interrupt(int irq, void dev_id, struct pt_regs regs)
				223	{
				224
				225	struct net_device dev = (struct net_device )dev_instance;
				226	struct my_private tp = (struct my_private )dev->priv;
				227
				228	int work_count = my_work_count;
				229	status = read_interrupt_status_reg();
				230	if (status == 0)
				231	return IRQ_NONE; /* Shared IRQ: not us */
				232	if (status == 0xffff)
				233	return IRQ_HANDLED; /* Hot unplug */
				234	if (status & error)
				235	do_some_error_handling()
				236
				237	do {
				238	acknowledge_ints_ASAP();
				239
				240	if (status & link_interrupt) {
				241	spin_lock(&tp->link_lock);
				242	do_some_link_stat_stuff();
				243	spin_lock(&tp->link_lock);
				244	}
				245
				246	if (status & rx_interrupt) {
				247	receive_packets(dev);
				248	}
				249
				250	if (status & rx_nobufs) {
				251	make_rx_buffs_avail();
				252	}
				253
				254	if (status & tx_related) {
				255	spin_lock(&tp->lock);
				256	tx_ring_free(dev);
				257	if (tx_died)
				258	restart_tx();
				259	spin_unlock(&tp->lock);
				260	}
				261
				262	status = read_interrupt_status_reg();
				263
				264	} while (!(status & error) \|\| more_work_to_be_done);
				265	return IRQ_HANDLED;
				266	}
				267
				268	----------------------------------------------------------------------
				269
				270	We now change this to what is shown below to NAPI-enable it:
				271
				272	----------------------------------------------------------------------
				273	static irqreturn_t
				274	netdevice_interrupt(int irq, void dev_id, struct pt_regs regs)
				275	{
				276	struct net_device dev = (struct net_device )dev_instance;
				277	struct my_private tp = (struct my_private )dev->priv;
				278
				279	status = read_interrupt_status_reg();
				280	if (status == 0)
				281	return IRQ_NONE; /* Shared IRQ: not us */
				282	if (status == 0xffff)
				283	return IRQ_HANDLED; /* Hot unplug */
				284	if (status & error)
				285	do_some_error_handling();
				286
				287	do {
				288	/********************** start note *******************************/
				289	acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
				290	/********************** end note *******************************/
				291
				292	if (status & link_interrupt) {
				293	spin_lock(&tp->link_lock);
				294	do_some_link_stat_stuff();
				295	spin_unlock(&tp->link_lock);
				296	}
				297	/********************** start note *******************************/
				298	if (status & rx_interrupt \|\| (status & rx_nobuffs)) {
				299	if (netif_rx_schedule_prep(dev)) {
				300
				301	/* disable interrupts caused
				302	* by arriving packets */
				303	disable_rx_and_rxnobuff_ints();
				304	/* tell system we have work to be done. */
				305	__netif_rx_schedule(dev);
				306	} else {
				307	printk("driver bug! interrupt while in poll\n");
				308	/* FIX by disabling interrupts */
				309	disable_rx_and_rxnobuff_ints();
				310	}
				311	}
				312	/********************** end note note *******************************/
				313
				314	if (status & tx_related) {
				315	spin_lock(&tp->lock);
				316	tx_ring_free(dev);
				317
				318	if (tx_died)
				319	restart_tx();
				320	spin_unlock(&tp->lock);
				321	}
				322
				323	status = read_interrupt_status_reg();
				324
				325	/********************** start note *******************************/
				326	} while (!(status & error) \|\| more_work_to_be_done(status));
				327	/********************** end note note *******************************/
				328	return IRQ_HANDLED;
				329	}
				330
				331	---------------------------------------------------------------------
				332
				333
				334	We note several things from above:
				335
				336	I) Any interrupt source which is caused by arriving packets is now
				337	turned off when it occurs. Depending on the hardware, there could be
				338	several reasons that arriving packets would cause interrupts; these are the
				339	interrupt sources we wish to avoid. The two common ones are a) a packet
				340	arriving (rxint) b) a packet arriving and finding no DMA buffers available
				341	(rxnobuff) .
				342	This means also acknowledge_ints_ASAP() will not clear the status
				343	register for those two items above; clearing is done in the place where
				344	proper work is done within NAPI; at the poll() and refill_rx_ring()
				345	discussed further below.
				346	netif_rx_schedule_prep() returns 1 if device is in running state and
				347	gets successfully added to the core poll list. If we get a zero value
				348	we can _almost_ assume are already added to the list (instead of not running.
				349	Logic based on the fact that you shouldn't get interrupt if not running)
				350	We rectify this by disabling rx and rxnobuf interrupts.
				351
				352	II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
				353	These functionalities are still around actually......
				354
				355	infact, receive_packets(dev) is very close to my_poll() and
				356	make_rx_buffs_avail() is invoked from my_poll()
				357
				358	4) converting receive_packets() to dev->poll()
				359	===============================================
				360
				361	We need to convert the classical D Becker receive_packets(dev) to my_poll()
				362
				363	First the typical receive_packets() below:
				364	-------------------------------------------------------------------
				365
				366	/* this is called by interrupt handler */
				367	static void receive_packets (struct net_device *dev)
				368	{
				369
				370	struct my_private tp = (struct my_private )dev->priv;
				371	rx_ring = tp->rx_ring;
				372	cur_rx = tp->cur_rx;
				373	int entry = cur_rx % RX_RING_SIZE;
				374	int received = 0;
				375	int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
				376
				377	while (rx_ring_not_empty) {
				378	u32 rx_status;
				379	unsigned int rx_size;
				380	unsigned int pkt_size;
				381	struct sk_buff *skb;
				382	/* read size+status of next frame from DMA ring buffer */
				383	/* the number 16 and 4 are just examples */
				384	rx_status = le32_to_cpu ((u32 ) (rx_ring + ring_offset));
				385	rx_size = rx_status >> 16;
				386	pkt_size = rx_size - 4;
				387
				388	/* process errors */
				389	if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) \|\|
				390	(!(rx_status & RxStatusOK))) {
				391	netdrv_rx_err (rx_status, dev, tp, ioaddr);
				392	return;
				393	}
				394
				395	if (--rx_work_limit < 0)
				396	break;
				397
				398	/* grab a skb */
				399	skb = dev_alloc_skb (pkt_size + 2);
				400	if (skb) {
				401	.
				402	.
				403	netif_rx (skb);
				404	.
				405	.
				406	} else { /* OOM */
				407	/*seems very driver specific ... some just pass
				408	whatever is on the ring already. */
				409	}
				410
				411	/* move to the next skb on the ring */
				412	entry = (++tp->cur_rx) % RX_RING_SIZE;
				413	received++ ;
				414
				415	}
				416
				417	/* store current ring pointer state */
				418	tp->cur_rx = cur_rx;
				419
				420	/* Refill the Rx ring buffers if they are needed */
				421	refill_rx_ring();
				422	.
				423	.
				424
				425	}
				426	-------------------------------------------------------------------
				427	We change it to a new one below; note the additional parameter in
				428	the call.
				429
				430	-------------------------------------------------------------------
				431
				432	/* this is called by the network core */
				433	static int my_poll (struct net_device dev, int budget)
				434	{
				435
				436	struct my_private tp = (struct my_private )dev->priv;
				437	rx_ring = tp->rx_ring;
				438	cur_rx = tp->cur_rx;
				439	int entry = cur_rx % RX_BUF_LEN;
				440	/* maximum packets to send to the stack */
				441	/********************** note note *******************************/
				442	int rx_work_limit = dev->quota;
				443
				444	/********************** end note note *******************************/
				445	do { // outer beginning loop starts here
				446
				447	clear_rx_status_register_bit();
				448
				449	while (rx_ring_not_empty) {
				450	u32 rx_status;
				451	unsigned int rx_size;
				452	unsigned int pkt_size;
				453	struct sk_buff *skb;
				454	/* read size+status of next frame from DMA ring buffer */
				455	/* the number 16 and 4 are just examples */
				456	rx_status = le32_to_cpu ((u32 ) (rx_ring + ring_offset));
				457	rx_size = rx_status >> 16;
				458	pkt_size = rx_size - 4;
				459
				460	/* process errors */
				461	if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) \|\|
				462	(!(rx_status & RxStatusOK))) {
				463	netdrv_rx_err (rx_status, dev, tp, ioaddr);
				464	return 1;
				465	}
				466
				467	/********************** note note *******************************/
				468	if (--rx_work_limit < 0) { /* we got packets, but no quota */
				469	/* store current ring pointer state */
				470	tp->cur_rx = cur_rx;
				471
				472	/* Refill the Rx ring buffers if they are needed */
				473	refill_rx_ring(dev);
				474	goto not_done;
				475	}
				476	/******************** end note ********************************/
				477
				478	/* grab a skb */
				479	skb = dev_alloc_skb (pkt_size + 2);
				480	if (skb) {
				481	.
				482	.
				483	/********************** note note *******************************/
				484	netif_receive_skb (skb);
				485	/******************** end note ********************************/
				486	.
				487	.
				488	} else { /* OOM */
				489	/*seems very driver specific ... common is just pass
				490	whatever is on the ring already. */
				491	}
				492
				493	/* move to the next skb on the ring */
				494	entry = (++tp->cur_rx) % RX_RING_SIZE;
				495	received++ ;
				496
				497	}
				498
				499	/* store current ring pointer state */
				500	tp->cur_rx = cur_rx;
				501
				502	/* Refill the Rx ring buffers if they are needed */
				503	refill_rx_ring(dev);
				504
				505	/* no packets on ring; but new ones can arrive since we last
				506	checked */
				507	status = read_interrupt_status_reg();
				508	if (rx status is not set) {
				509	/* If something arrives in this narrow window,
				510	an interrupt will be generated */
				511	goto done;
				512	}
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	513	/* done! at least that's what it looks like ;->
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	514	if new packets came in after our last check on status bits
				515	they'll be caught by the while check and we go back and clear them
				516	since we havent exceeded our quota */
				517	} while (rx_status_is_set);
				518
				519	done:
				520
				521	/********************** note note *******************************/
				522	dev->quota -= received;
				523	*budget -= received;
				524
				525	/* If RX ring is not full we are out of memory. */
				526	if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
				527	goto oom;
				528
				529	/* we are happy/done, no more packets on ring; put us back
				530	to where we can start processing interrupts again */
				531	netif_rx_complete(dev);
				532	enable_rx_and_rxnobuf_ints();
				533
				534	/* The last op happens after poll completion. Which means the following:
				535	* 1. it can race with disabling irqs in irq handler (which are done to
				536	* schedule polls)
				537	* 2. it can race with dis/enabling irqs in other poll threads
Matt LaPlante	5d3f083	2006-11-30 05:21:10 +0100	[diff] [blame]	538	* 3. if an irq raised after the beginning of the outer beginning
				539	* loop (marked in the code above), it will be immediately
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	540	* triggered here.
				541	*
Matt LaPlante	5d3f083	2006-11-30 05:21:10 +0100	[diff] [blame]	542	* Summarizing: the logic may result in some redundant irqs both
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	543	* due to races in masking and due to too late acking of already
				544	* processed irqs. The good news: no events are ever lost.
				545	*/
				546
				547	return 0; /* done */
				548
				549	not_done:
				550	if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 \|\|
				551	tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
				552	refill_rx_ring(dev);
				553
				554	if (!received) {
				555	printk("received==0\n");
				556	received = 1;
				557	}
				558	dev->quota -= received;
				559	*budget -= received;
				560	return 1; /* not_done */
				561
				562	oom:
				563	/* Start timer, stop polling, but do not enable rx interrupts. */
				564	start_poll_timer(dev);
				565	return 0; /* we'll take it from here so tell core "done"*/
				566
				567	/********************** End note note *******************************/
				568	}
				569	-------------------------------------------------------------------
				570
				571	From above we note that:
				572	0) rx_work_limit = dev->quota
				573	1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
				574	it does the work.
				575	2) We have a done and not_done state.
				576	3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
				577	4) we have a new way of handling oom condition
				578	5) A new outer for (;;) loop has been added. This serves the purpose of
				579	ensuring that if a new packet has come in, after we are all set and done,
				580	and we have not exceeded our quota that we continue sending packets up.
				581
				582
				583	-----------------------------------------------------------
				584	Poll timer code will need to do the following:
				585
				586	a)
				587
				588	if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 \|\|
				589	tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
				590	refill_rx_ring(dev);
				591
				592	/* If RX ring is not full we are still out of memory.
				593	Restart the timer again. Else we re-add ourselves
				594	to the master poll list.
				595	*/
				596
				597	if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
				598	restart_timer();
				599
				600	else netif_rx_schedule(dev); /* we are back on the poll list */
				601
				602	5) dev->close() and dev->suspend() issues
				603	==========================================
Matt LaPlante	4ae0edc	2006-11-30 04:58:40 +0100	[diff] [blame]	604	The driver writer needn't worry about this; the top net layer takes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	605	care of it.
				606
				607	6) Adding new Stats to /proc
				608	=============================
				609	In order to debug some of the new features, we introduce new stats
				610	that need to be collected.
				611	TODO: Fill this later.
				612
				613	APPENDIX 1: discussion on using ethernet HW FC
				614	==============================================
				615	Most chips with FC only send a pause packet when they run out of Rx buffers.
				616	Since packets are pulled off the DMA ring by a softirq in NAPI,
				617	if the system is slow in grabbing them and we have a high input
				618	rate (faster than the system's capacity to remove packets), then theoretically
				619	there will only be one rx interrupt for all packets during a given packetstorm.
				620	Under low load, we might have a single interrupt per packet.
				621	FC should be programmed to apply in the case when the system cant pull out
				622	packets fast enough i.e send a pause only when you run out of rx buffers.
				623	Note FC in itself is a good solution but we have found it to not be
				624	much of a commodity feature (both in NICs and switches) and hence falls
Matt LaPlante	4ae0edc	2006-11-30 04:58:40 +0100	[diff] [blame]	625	under the same category as using NIC based mitigation. Also, experiments
				626	indicate that it's much harder to resolve the resource allocation
				627	issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	628	proved harder. In any case, FC works even better with NAPI but is not
				629	necessary.
				630
				631
				632	APPENDIX 2: the "rotting packet" race-window avoidance scheme
				633	=============================================================
				634
				635	There are two types of associations seen here
				636
				637	1) status/int which honors level triggered IRQ
				638
				639	If a status bit for receive or rxnobuff is set and the corresponding
				640	interrupt-enable bit is not on, then no interrupts will be generated. However,
				641	as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
				642	generated. [assuming the status bit was not turned off].
				643	Generally the concept of level triggered IRQs in association with a status and
				644	interrupt-enable CSR register set is used to avoid the race.
				645
				646	If we take the example of the tulip:
				647	"pending work" is indicated by the status bit(CSR5 in tulip).
				648	the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
				649	the CSR5 will continue to be turned on with new packet arrivals even if
				650	we clear it the first time)
				651	Very important is the fact that if we turn on the interrupt bit on when
				652	status is set that an immediate irq is triggered.
				653
				654	If we cleared the rx ring and proclaimed there was "no more work
				655	to be done" and then went on to do a few other things; then when we enable
				656	interrupts, there is a possibility that a new packet might sneak in during
				657	this phase. It helps to look at the pseudo code for the tulip poll
				658	routine:
				659
				660	--------------------------
				661	do {
				662	ACK;
				663	while (ring_is_not_empty()) {
				664	work-work-work
				665	if quota is exceeded: exit, no touching irq status/mask
				666	}
				667	/* No packets, but new can arrive while we are doing this*/
				668	CSR5 := read
				669	if (CSR5 is not set) {
				670	/* If something arrives in this narrow window here,
				671	* where the comments are ;-> irq will be generated */
				672	unmask irqs;
				673	exit poll;
				674	}
				675	} while (rx_status_is_set);
				676	------------------------
				677
				678	CSR5 bit of interest is only the rx status.
				679	If you look at the last if statement:
				680	you just finished grabbing all the packets from the rx ring .. you check if
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	681	status bit says there are more packets just in ... it says none; you then
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	682	enable rx interrupts again; if a new packet just came in during this check,
				683	we are counting that CSR5 will be set in that small window of opportunity
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	684	and that by re-enabling interrupts, we would actually trigger an interrupt
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	685	to register the new packet for processing.
				686
				687	[The above description nay be very verbose, if you have better wording
				688	that will make this more understandable, please suggest it.]
				689
				690	2) non-capable hardware
				691
				692	These do not generally respect level triggered IRQs. Normally,
				693	irqs may be lost while being masked and the only way to leave poll is to do
				694	a double check for new input after netif_rx_complete() is invoked
				695	and re-enable polling (after seeing this new input).
				696
				697	Sample code:
				698
				699	---------
				700	.
				701	.
				702	restart_poll:
				703	while (ring_is_not_empty()) {
				704	work-work-work
				705	if quota is exceeded: exit, not touching irq status/mask
				706	}
				707	.
				708	.
				709	.
				710	enable_rx_interrupts()
				711	netif_rx_complete(dev);
				712	if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
				713	disable_rx_and_rxnobufs()
				714	goto restart_poll
				715	} while (rx_status_is_set);
				716	---------
				717
				718	Basically netif_rx_complete() removes us from the poll list, but because a
				719	new packet which will never be caught due to the possibility of a race
				720	might come in, we attempt to re-add ourselves to the poll list.
				721
				722
				723
				724
				725	APPENDIX 3: Scheduling issues.
				726	==============================
				727	As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
				728	general solution to schedule softirq's to run before next interrupt and by putting
				729	them under scheduler control. Also this prevents consecutive softirq's from
				730	monopolize the CPU. This also have the effect that the priority of ksoftirq needs
				731	to be considered when running very CPU-intensive applications and networking to
				732	get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
				733	(eventually more) is reported cure problems with low network performance at high
				734	CPU load.
				735
				736	Most used processes in a GIGE router:
				737	USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
				738	root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
				739	root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
				740
				741	--------------------------------------------------------------------
				742
				743	relevant sites:
				744	==================
				745	ftp://robur.slu.se/pub/Linux/net-development/NAPI/
				746
				747
				748	--------------------------------------------------------------------
				749	TODO: Write net-skeleton.c driver.
				750	-------------------------------------------------------------
				751
				752	Authors:
				753	========
				754	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
				755	Jamal Hadi Salim <hadi@cyberus.ca>
				756	Robert Olsson <Robert.Olsson@data.slu.se>
				757
				758	Acknowledgements:
				759	================
				760	People who made this document better:
				761
				762	Lennert Buytenhek <buytenh@gnu.org>
				763	Andrew Morton <akpm@zip.com.au>
				764	Manfred Spraul <manfred@colorfullife.com>
				765	Donald Becker <becker@scyld.com>
				766	Jeff Garzik <jgarzik@pobox.com>