doc: packet_mmap: update doc to implementation status

This improves the packet_mmap.txt document in the following ways:

 * Add initial information about different TPACKET versions
 * Add initial information about packet fanout
 * Add pointer to BPF document (since this also could be of interest)
 * 'Fix' minor, rather cosmetic things

Information partially taken from related commit messages.

Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Cc: Ulisses Alonso CamarĂ³ <uaca@alumni.uv.es>
Cc: Johann Baudy <johann.baudy@gnu-log.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 7cd879e..94444b1 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -3,9 +3,9 @@
 --------------------------------------------------------------------------------
 
 This file documents the mmap() facility available with the PACKET
-socket interface on 2.4 and 2.6 kernels. This type of sockets is used for 
-capture network traffic with utilities like tcpdump or any other that needs
-raw access to network interface.
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+i) capture network traffic with utilities like tcpdump, ii) transmit network
+traffic, or any other that needs raw access to network interface.
 
 You can find the latest version of this document at:
     http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
@@ -21,19 +21,18 @@
 + Why use PACKET_MMAP
 --------------------------------------------------------------------------------
 
-In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
-inefficient. It uses very limited buffers and requires one system call
-to capture each packet, it requires two if you want to get packet's 
-timestamp (like libpcap always does).
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
 
 In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
 configurable circular buffer mapped in user space that can be used to either
 send or receive packets. This way reading packets just needs to wait for them,
 most of the time there is no need to issue a single system call. Concerning
 transmission, multiple packets can be sent through one system call to get the
-highest bandwidth.
-By using a shared buffer between the kernel and the user also has the benefit
-of minimizing packet copies.
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
 
 It's fine to use PACKET_MMAP to improve the performance of the capture and
 transmission process, but it isn't everything. At least, if you are capturing
@@ -41,7 +40,8 @@
 device driver of your network interface card supports some sort of interrupt
 load mitigation or (even better) if it supports NAPI, also make sure it is
 enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
-supported by devices of your network.
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
 
 --------------------------------------------------------------------------------
 + How to use mmap() to improve capture process
@@ -87,9 +87,7 @@
 socket creation and destruction is straight forward, and is done 
 the same way with or without PACKET_MMAP:
 
-int fd;
-
-fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
 
 where mode is SOCK_RAW for the raw interface were link level
 information can be captured or SOCK_DGRAM for the cooked
@@ -180,7 +178,6 @@
 + PACKET_MMAP settings
 --------------------------------------------------------------------------------
 
-
 To setup PACKET_MMAP from user level code is done with a call like
 
  - Capture process
@@ -214,7 +211,6 @@
 
     frames_per_block * tp_block_nr == tp_frame_nr
 
-
 Lets see an example, with the following values:
 
      tp_block_size= 4096
@@ -240,7 +236,6 @@
 account when choosing the frame_size. See "Mapping and use of the circular 
 buffer (ring)".
 
-
 --------------------------------------------------------------------------------
 + PACKET_MMAP setting constraints
 --------------------------------------------------------------------------------
@@ -277,7 +272,6 @@
 The pagesize can also be determined dynamically with the getpagesize (2) 
 system call. 
 
-
  Block number limit
 --------------------
 
@@ -297,7 +291,6 @@
       v  block #2
      block #1
 
-
 kmalloc allocates any number of bytes of physically contiguous memory from 
 a pool of pre-determined sizes. This pool of memory is maintained by the slab 
 allocator which is at the end the responsible for doing the allocation and 
@@ -312,7 +305,6 @@
 
      131072/4 = 32768 blocks
 
-
  PACKET_MMAP buffer size calculator
 ------------------------------------
 
@@ -353,7 +345,6 @@
 and hence the buffer will have a 262144 MiB size. So it can hold 
 262144 MiB / 2048 bytes = 134217728 frames
 
-
 Actually, this buffer size is not possible with an i386 architecture. 
 Remember that the memory is allocated in kernel space, in the case of 
 an i386 kernel's memory size is limited to 1GiB.
@@ -385,7 +376,6 @@
    - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
    - Pad to align to TPACKET_ALIGNMENT=16
  */
-           
  
  The following are conditions that are checked in packet_set_ring
 
@@ -426,7 +416,6 @@
      #define TP_STATUS_LOSING        4 
      #define TP_STATUS_CSUMNOTREADY  8 
 
-
 TP_STATUS_COPY        : This flag indicates that the frame (and associated
                         meta information) has been truncated because it's 
                         larger than tp_frame_size. This packet can be 
@@ -475,7 +464,6 @@
 It doesn't incur in a race condition to first check the status value and 
 then poll for frames.
 
-
 ++ Transmission process
 Those defines are also used for transmission:
 
@@ -507,6 +495,196 @@
     retval = poll(&pfd, 1, timeout);
 
 -------------------------------------------------------------------------------
++ What TPACKET versions are available and when to use them?
+-------------------------------------------------------------------------------
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+	- Default if not otherwise specified by setsockopt(2)
+	- RX_RING, TX_RING available
+	- VLAN metadata information available for packets
+	  (TP_STATUS_VLAN_VALID)
+
+TPACKET_V1 --> TPACKET_V2:
+	- Made 64 bit clean due to unsigned long usage in TPACKET_V1
+	  structures, thus this also works on 64 bit kernel with 32 bit
+	  userspace and the like
+	- Timestamp resolution in nanoseconds instead of microseconds
+	- RX_RING, TX_RING available
+	- How to switch to TPACKET_V2:
+		1. Replace struct tpacket_hdr by struct tpacket2_hdr
+		2. Query header len and save
+		3. Set protocol version to 2, set up ring as usual
+		4. For getting the sockaddr_ll,
+		   use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
+		   (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+TPACKET_V2 --> TPACKET_V3:
+	- Flexible buffer implementation:
+		1. Blocks can be configured with non-static frame-size
+		2. Read/poll is at a block-level (as opposed to packet-level)
+		3. Added poll timeout to avoid indefinite user-space wait
+		   on idle links
+		4. Added user-configurable knobs:
+			4.1 block::timeout
+			4.2 tpkt_hdr::sk_rxhash
+	- RX Hash data available in user space
+	- Currently only RX_RING available
+
+-------------------------------------------------------------------------------
++ AF_PACKET fanout mode
+-------------------------------------------------------------------------------
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.):
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <net/if.h>
+
+static const char *device_name;
+static int fanout_type;
+static int fanout_id;
+
+#ifndef PACKET_FANOUT
+# define PACKET_FANOUT			18
+# define PACKET_FANOUT_HASH		0
+# define PACKET_FANOUT_LB		1
+#endif
+
+static int setup_socket(void)
+{
+	int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+	struct sockaddr_ll ll;
+	struct ifreq ifr;
+	int fanout_arg;
+
+	if (fd < 0) {
+		perror("socket");
+		return EXIT_FAILURE;
+	}
+
+	memset(&ifr, 0, sizeof(ifr));
+	strcpy(ifr.ifr_name, device_name);
+	err = ioctl(fd, SIOCGIFINDEX, &ifr);
+	if (err < 0) {
+		perror("SIOCGIFINDEX");
+		return EXIT_FAILURE;
+	}
+
+	memset(&ll, 0, sizeof(ll));
+	ll.sll_family = AF_PACKET;
+	ll.sll_ifindex = ifr.ifr_ifindex;
+	err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+	if (err < 0) {
+		perror("bind");
+		return EXIT_FAILURE;
+	}
+
+	fanout_arg = (fanout_id | (fanout_type << 16));
+	err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+			 &fanout_arg, sizeof(fanout_arg));
+	if (err) {
+		perror("setsockopt");
+		return EXIT_FAILURE;
+	}
+
+	return fd;
+}
+
+static void fanout_thread(void)
+{
+	int fd = setup_socket();
+	int limit = 10000;
+
+	if (fd < 0)
+		exit(fd);
+
+	while (limit-- > 0) {
+		char buf[1600];
+		int err;
+
+		err = read(fd, buf, sizeof(buf));
+		if (err < 0) {
+			perror("read");
+			exit(EXIT_FAILURE);
+		}
+		if ((limit % 10) == 0)
+			fprintf(stdout, "(%d) \n", getpid());
+	}
+
+	fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+	close(fd);
+	exit(0);
+}
+
+int main(int argc, char **argp)
+{
+	int fd, err;
+	int i;
+
+	if (argc != 3) {
+		fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+		return EXIT_FAILURE;
+	}
+
+	if (!strcmp(argp[2], "hash"))
+		fanout_type = PACKET_FANOUT_HASH;
+	else if (!strcmp(argp[2], "lb"))
+		fanout_type = PACKET_FANOUT_LB;
+	else {
+		fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+		exit(EXIT_FAILURE);
+	}
+
+	device_name = argp[1];
+	fanout_id = getpid() & 0xffff;
+
+	for (i = 0; i < 4; i++) {
+		pid_t pid = fork();
+
+		switch (pid) {
+		case 0:
+			fanout_thread();
+
+		case -1:
+			perror("fork");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < 4; i++) {
+		int status;
+
+		wait(&status);
+	}
+
+	return 0;
+}
+
+-------------------------------------------------------------------------------
 + PACKET_TIMESTAMP
 -------------------------------------------------------------------------------
 
@@ -532,6 +710,13 @@
 See include/linux/net_tstamp.h and Documentation/networking/timestamping
 for more information on hardware timestamps.
 
+-------------------------------------------------------------------------------
++ Miscellaneous bits
+-------------------------------------------------------------------------------
+
+- Packet sockets work well together with Linux socket filters, thus you also
+  might want to have a look at Documentation/networking/filter.txt
+
 --------------------------------------------------------------------------------
 + THANKS
 --------------------------------------------------------------------------------