bnx2: bnx2_tx_int() optimizations

When using bnx2 in a high transmit load, bnx2_tx_int() cost is pretty high.

There are two reasons.

One is an expensive call to bnx2_get_hw_tx_cons(bnapi) for each freed skb

One is cpu stalls when accessing skb_is_gso(skb) / skb_shinfo(skb)->nr_frags
because of two cache line misses.
(One to get skb->end/head to compute skb_shinfo(skb),
 one to get is_gso/nr_frags)

This patch :

1) avoids calling bnx2_get_hw_tx_cons(bnapi) too many times.

2) makes bnx2_start_xmit() cache is_gso & nr_frags into sw_tx_bd descriptor.
   This uses a litle bit more ram (256 longs per device on x86), but helps a lot.

3) uses a prefetch(&skb->end) to speedup dev_kfree_skb(), bringing
  cache line that will be needed in skb_release_data()

result is 5 % bandwidth increase in benchmarks, involving UDP or TCP receive
 & transmits, when a cpu is dedicated to ksoftirqd for bnx2.

bnx2_tx_int going from 3.33 % cpu to 0.5 % cpu in oprofile

Note : skb_dma_unmap() still very expensive but this is for another patch,
not related to bnx2 (2.9 % of cpu, while it does nothing on x86_32)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 5b570e1..026ed1c 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6552,6 +6552,8 @@
 
 struct sw_tx_bd {
 	struct sk_buff		*skb;
+	unsigned short		is_gso;
+	unsigned short		nr_frags;
 };
 
 #define SW_RXBD_RING_SIZE (sizeof(struct sw_bd) * RX_DESC_CNT)