net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 61e0efd..6220d9d 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -2701,7 +2701,7 @@
nesibdev = nesvnic->nesibdev;
nes_debug(NES_DBG_CM, "netdev refcnt = %u.\n",
- atomic_read(&nesvnic->netdev->refcnt));
+ netdev_refcnt_read(nesvnic->netdev));
if (nesqp->active_conn) {
@@ -2791,7 +2791,7 @@
atomic_inc(&cm_accepts);
nes_debug(NES_DBG_CM, "netdev refcnt = %u.\n",
- atomic_read(&nesvnic->netdev->refcnt));
+ netdev_refcnt_read(nesvnic->netdev));
/* allocate the ietf frame and space for private data */
nesqp->ietf_frame = pci_alloc_consistent(nesdev->pcidev,