blob: 47c1dd9818f2e5e696a34103333867c98bb2c5d8 [file] [log] [blame]
Linus Torvalds1da177e2005-04-16 15:20:36 -07001IP OVER INFINIBAND
2
3 The ib_ipoib driver is an implementation of the IP over InfiniBand
Roland Dreierac83cba2006-06-17 20:37:32 -07004 protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
5 working group. It is a "native" implementation in the sense of
6 setting the interface type to ARPHRD_INFINIBAND and the hardware
7 address length to 20 (earlier proprietary implementations
Linus Torvalds1da177e2005-04-16 15:20:36 -07008 masqueraded to the kernel as ethernet interfaces).
9
10Partitions and P_Keys
11
12 When the IPoIB driver is loaded, it creates one interface for each
13 port using the P_Key at index 0. To create an interface with a
14 different P_Key, write the desired P_Key into the main interface's
15 /sys/class/net/<intf name>/create_child file. For example:
16
17 echo 0x8001 > /sys/class/net/ib0/create_child
18
19 This will create an interface named ib0.8001 with P_Key 0x8001. To
20 remove a subinterface, use the "delete_child" file:
21
22 echo 0x8001 > /sys/class/net/ib0/delete_child
23
24 The P_Key for any interface is given by the "pkey" file, and the
25 main interface for a subinterface is in "parent."
26
Or Gerlitz9baa0b02012-09-13 05:56:36 +000027 Child interface create/delete can also be done using IPoIB's
Kees Cook08559652016-04-26 16:41:21 -070028 rtnl_link_ops, where children created using either way behave the same.
Or Gerlitz9baa0b02012-09-13 05:56:36 +000029
Or Gerlitz6a3335b2009-04-08 13:52:01 -070030Datagram vs Connected modes
31
32 The IPoIB driver supports two modes of operation: datagram and
33 connected. The mode is set and read through an interface's
34 /sys/class/net/<intf name>/mode file.
35
36 In datagram mode, the IB UD (Unreliable Datagram) transport is used
37 and so the interface MTU has is equal to the IB L2 MTU minus the
38 IPoIB encapsulation header (4 bytes). For example, in a typical IB
39 fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
40
41 In connected mode, the IB RC (Reliable Connected) transport is used.
Bart Van Asschef7111822009-12-09 14:21:36 -080042 Connected mode takes advantage of the connected nature of the IB
43 transport and allows an MTU up to the maximal IP packet size of 64K,
44 which reduces the number of IP packets needed for handling large UDP
45 datagrams, TCP segments, etc and increases the performance for large
46 messages.
Or Gerlitz6a3335b2009-04-08 13:52:01 -070047
48 In connected mode, the interface's UD QP is still used for multicast
49 and communication with peers that don't support connected mode. In
50 this case, RX emulation of ICMP PMTU packets is used to cause the
51 networking stack to use the smaller UD MTU for these neighbours.
52
53Stateless offloads
54
55 If the IB HW supports IPoIB stateless offloads, IPoIB advertises
56 TCP/IP checksum and/or Large Send (LSO) offloading capability to the
57 network stack.
58
59 Large Receive (LRO) offloading is also implemented and may be turned
60 on/off using ethtool calls. Currently LRO is supported only for
61 checksum offload capable devices.
62
63 Stateless offloads are supported only in datagram mode.
64
65Interrupt moderation
66
67 If the underlying IB device supports CQ event moderation, one can
68 use ethtool to set interrupt mitigation parameters and thus reduce
69 the overhead incurred by handling interrupts. The main code path of
70 IPoIB doesn't use events for TX completion signaling so only RX
71 moderation is supported.
72
Linus Torvalds1da177e2005-04-16 15:20:36 -070073Debugging Information
74
75 By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
76 to 'y', tracing messages are compiled into the driver. They are
77 turned on by setting the module parameters debug_level and
78 mcast_debug_level to 1. These parameters can be controlled at
79 runtime through files in /sys/module/ib_ipoib/.
80
Roland Dreierb1ed8da2005-04-16 15:26:07 -070081 CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
Linus Torvalds1da177e2005-04-16 15:20:36 -070082 virtual filesystem. By mounting this filesystem, for example with
83
Roland Dreierb1ed8da2005-04-16 15:26:07 -070084 mount -t debugfs none /sys/kernel/debug
Linus Torvalds1da177e2005-04-16 15:20:36 -070085
86 it is possible to get statistics about multicast groups from the
Roland Dreierb1ed8da2005-04-16 15:26:07 -070087 files /sys/kernel/debug/ipoib/ib0_mcg and so on.
Linus Torvalds1da177e2005-04-16 15:20:36 -070088
89 The performance impact of this option is negligible, so it
90 is safe to enable this option with debug_level set to 0 for normal
91 operation.
92
93 CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
94 the data path when data_debug_level is set to 1. However, even with
95 the output disabled, enabling this configuration option will affect
96 performance, because it adds tests to the fast path.
97
98References
99
Roland Dreierac83cba2006-06-17 20:37:32 -0700100 Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
101 http://ietf.org/rfc/rfc4391.txt
102 IP over InfiniBand (IPoIB) Architecture (RFC 4392)
103 http://ietf.org/rfc/rfc4392.txt
Or Gerlitz6a3335b2009-04-08 13:52:01 -0700104 IP over InfiniBand: Connected Mode (RFC 4755)
105 http://ietf.org/rfc/rfc4755.txt