Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 1 | Ethernet switch device driver model (switchdev) |
| 2 | =============================================== |
| 3 | Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> |
| 4 | Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com> |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 5 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 6 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 7 | The Ethernet switch device driver model (switchdev) is an in-kernel driver |
| 8 | model for switch devices which offload the forwarding (data) plane from the |
| 9 | kernel. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 10 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 11 | Figure 1 is a block diagram showing the components of the switchdev model for |
| 12 | an example setup using a data-center-class switch ASIC chip. Other setups |
| 13 | with SR-IOV or soft switches, such as OVS, are possible. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 14 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 15 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 16 | User-space tools |
| 17 | |
| 18 | user space | |
| 19 | +-------------------------------------------------------------------+ |
| 20 | kernel | Netlink |
| 21 | | |
| 22 | +--------------+-------------------------------+ |
| 23 | | Network stack | |
| 24 | | (Linux) | |
| 25 | | | |
| 26 | +----------------------------------------------+ |
| 27 | |
| 28 | sw1p2 sw1p4 sw1p6 |
| 29 | sw1p1 + sw1p3 + sw1p5 + eth1 |
| 30 | + | + | + | + |
| 31 | | | | | | | | |
| 32 | +--+----+----+----+-+--+----+---+ +-----+-----+ |
| 33 | | Switch driver | | mgmt | |
| 34 | | (this document) | | driver | |
| 35 | | | | | |
| 36 | +--------------+----------------+ +-----------+ |
| 37 | | |
| 38 | kernel | HW bus (eg PCI) |
| 39 | +-------------------------------------------------------------------+ |
| 40 | hardware | |
| 41 | +--------------+---+------------+ |
| 42 | | Switch device (sw1) | |
| 43 | | +----+ +--------+ |
| 44 | | | v offloaded data path | mgmt port |
| 45 | | | | | |
| 46 | +--|----|----+----+----+----+---+ |
| 47 | | | | | | | |
| 48 | + + + + + + |
| 49 | p1 p2 p3 p4 p5 p6 |
| 50 | |
| 51 | front-panel ports |
| 52 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 53 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 54 | Fig 1. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 55 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 56 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 57 | Include Files |
| 58 | ------------- |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 59 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 60 | #include <linux/netdevice.h> |
| 61 | #include <net/switchdev.h> |
| 62 | |
| 63 | |
| 64 | Configuration |
| 65 | ------------- |
| 66 | |
| 67 | Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model |
| 68 | support is built for driver. |
| 69 | |
| 70 | |
| 71 | Switch Ports |
| 72 | ------------ |
| 73 | |
| 74 | On switchdev driver initialization, the driver will allocate and register a |
| 75 | struct net_device (using register_netdev()) for each enumerated physical switch |
| 76 | port, called the port netdev. A port netdev is the software representation of |
| 77 | the physical port and provides a conduit for control traffic to/from the |
| 78 | controller (the kernel) and the network, as well as an anchor point for higher |
| 79 | level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using |
| 80 | standard netdev tools (iproute2, ethtool, etc), the port netdev can also |
| 81 | provide to the user access to the physical properties of the switch port such |
| 82 | as PHY link state and I/O statistics. |
| 83 | |
| 84 | There is (currently) no higher-level kernel object for the switch beyond the |
| 85 | port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. |
| 86 | |
| 87 | A switch management port is outside the scope of the switchdev driver model. |
| 88 | Typically, the management port is not participating in offloaded data plane and |
| 89 | is loaded with a different driver, such as a NIC driver, on the management port |
| 90 | device. |
| 91 | |
| 92 | Port Netdev Naming |
| 93 | ^^^^^^^^^^^^^^^^^^ |
| 94 | |
| 95 | Udev rules should be used for port netdev naming, using some unique attribute |
| 96 | of the port as a key, for example the port MAC address or the port PHYS name. |
| 97 | Hard-coding of kernel netdev names within the driver is discouraged; let the |
| 98 | kernel pick the default netdev name, and let udev set the final name based on a |
| 99 | port attribute. |
| 100 | |
| 101 | Using port PHYS name (ndo_get_phys_port_name) for the key is particularly |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 102 | useful for dynamically-named ports where the device names its ports based on |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 103 | external configuration. For example, if a physical 40G port is split logically |
| 104 | into 4 10G ports, resulting in 4 port netdevs, the device can give a unique |
| 105 | name for each port using port PHYS name. The udev rule would be: |
| 106 | |
| 107 | SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ |
| 108 | NAME="$attr{phys_port_name}" |
| 109 | |
| 110 | Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y |
| 111 | is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 |
| 112 | would be sub-port 0 on port 1 on switch 1. |
| 113 | |
| 114 | Switch ID |
| 115 | ^^^^^^^^^ |
| 116 | |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 117 | The switchdev driver must implement the switchdev op switchdev_port_attr_get |
| 118 | for SWITCHDEV_ATTR_PORT_PARENT_ID for each port netdev, returning the same |
| 119 | physical ID for each port of a switch. The ID must be unique between switches |
| 120 | on the same system. The ID does not need to be unique between switches on |
| 121 | different systems. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 122 | |
| 123 | The switch ID is used to locate ports on a switch and to know if aggregated |
| 124 | ports belong to the same switch. |
| 125 | |
| 126 | Port Features |
| 127 | ^^^^^^^^^^^^^ |
| 128 | |
| 129 | NETIF_F_NETNS_LOCAL |
| 130 | |
| 131 | If the switchdev driver (and device) only supports offloading of the default |
| 132 | network namespace (netns), the driver should set this feature flag to prevent |
| 133 | the port netdev from being moved out of the default netns. A netns-aware |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 134 | driver/device would not set this flag and be responsible for partitioning |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 135 | hardware to preserve netns containment. This means hardware cannot forward |
| 136 | traffic from a port in one namespace to another port in another namespace. |
| 137 | |
| 138 | Port Topology |
| 139 | ^^^^^^^^^^^^^ |
| 140 | |
| 141 | The port netdevs representing the physical switch ports can be organized into |
| 142 | higher-level switching constructs. The default construct is a standalone |
| 143 | router port, used to offload L3 forwarding. Two or more ports can be bonded |
| 144 | together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge |
Scott Feldman | d290f1f | 2015-06-03 20:43:41 -0700 | [diff] [blame] | 145 | L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 146 | tunnels can be built on ports. These constructs are built using standard Linux |
| 147 | tools such as the bridge driver, the bonding/team drivers, and netlink-based |
| 148 | tools such as iproute2. |
| 149 | |
| 150 | The switchdev driver can know a particular port's position in the topology by |
| 151 | monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a |
| 152 | bond will see it's upper master change. If that bond is moved into a bridge, |
| 153 | the bond's upper master will change. And so on. The driver will track such |
| 154 | movements to know what position a port is in in the overall topology by |
| 155 | registering for netdevice events and acting on NETDEV_CHANGEUPPER. |
| 156 | |
| 157 | L2 Forwarding Offload |
| 158 | --------------------- |
| 159 | |
| 160 | The idea is to offload the L2 data forwarding (switching) path from the kernel |
| 161 | to the switchdev device by mirroring bridge FDB entries down to the device. An |
| 162 | FDB entry is the {port, MAC, VLAN} tuple forwarding destination. |
| 163 | |
| 164 | To offloading L2 bridging, the switchdev driver/device should support: |
| 165 | |
| 166 | - Static FDB entries installed on a bridge port |
| 167 | - Notification of learned/forgotten src mac/vlans from device |
| 168 | - STP state changes on the port |
| 169 | - VLAN flooding of multicast/broadcast and unknown unicast packets |
| 170 | |
| 171 | Static FDB Entries |
| 172 | ^^^^^^^^^^^^^^^^^^ |
| 173 | |
| 174 | The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump |
| 175 | to support static FDB entries installed to the device. Static bridge FDB |
| 176 | entries are installed, for example, using iproute2 bridge cmd: |
| 177 | |
| 178 | bridge fdb add ADDR dev DEV [vlan VID] [self] |
| 179 | |
Scott Feldman | 4b5364f | 2015-06-03 20:43:42 -0700 | [diff] [blame] | 180 | The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx |
| 181 | ops, and handle add/delete/dump of SWITCHDEV_OBJ_PORT_FDB object using |
| 182 | switchdev_port_obj_xxx ops. |
| 183 | |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 184 | XXX: what should be done if offloading this rule to hardware fails (for |
| 185 | example, due to full capacity in hardware tables) ? |
| 186 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 187 | Note: by default, the bridge does not filter on VLAN and only bridges untagged |
| 188 | traffic. To enable VLAN support, turn on VLAN filtering: |
| 189 | |
| 190 | echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering |
| 191 | |
| 192 | Notification of Learned/Forgotten Source MAC/VLANs |
| 193 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 194 | |
| 195 | The switch device will learn/forget source MAC address/VLAN on ingress packets |
| 196 | and notify the switch driver of the mac/vlan/port tuples. The switch driver, |
| 197 | in turn, will notify the bridge driver using the switchdev notifier call: |
| 198 | |
| 199 | err = call_switchdev_notifiers(val, dev, info); |
| 200 | |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 201 | Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when |
| 202 | forgetting, and info points to a struct switchdev_notifier_fdb_info. On |
| 203 | SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the |
| 204 | bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge |
| 205 | command will label these entries "offload": |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 206 | |
| 207 | $ bridge fdb |
| 208 | 52:54:00:12:35:01 dev sw1p1 master br0 permanent |
| 209 | 00:02:00:00:02:00 dev sw1p1 master br0 offload |
| 210 | 00:02:00:00:02:00 dev sw1p1 self |
| 211 | 52:54:00:12:35:02 dev sw1p2 master br0 permanent |
| 212 | 00:02:00:00:03:00 dev sw1p2 master br0 offload |
| 213 | 00:02:00:00:03:00 dev sw1p2 self |
| 214 | 33:33:00:00:00:01 dev eth0 self permanent |
| 215 | 01:00:5e:00:00:01 dev eth0 self permanent |
| 216 | 33:33:ff:00:00:00 dev eth0 self permanent |
| 217 | 01:80:c2:00:00:0e dev eth0 self permanent |
| 218 | 33:33:00:00:00:01 dev br0 self permanent |
| 219 | 01:00:5e:00:00:01 dev br0 self permanent |
| 220 | 33:33:ff:12:35:01 dev br0 self permanent |
| 221 | |
| 222 | Learning on the port should be disabled on the bridge using the bridge command: |
| 223 | |
| 224 | bridge link set dev DEV learning off |
| 225 | |
| 226 | Learning on the device port should be enabled, as well as learning_sync: |
| 227 | |
| 228 | bridge link set dev DEV learning on self |
| 229 | bridge link set dev DEV learning_sync on self |
| 230 | |
| 231 | Learning_sync attribute enables syncing of the learned/forgotton FDB entry to |
| 232 | the bridge's FDB. It's possible, but not optimal, to enable learning on the |
| 233 | device port and on the bridge port, and disable learning_sync. |
| 234 | |
| 235 | To support learning and learning_sync port attributes, the driver implements |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 236 | switchdev op switchdev_port_attr_get/set for SWITCHDEV_ATTR_PORT_BRIDGE_FLAGS. |
| 237 | The driver should initialize the attributes to the hardware defaults. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 238 | |
| 239 | FDB Ageing |
| 240 | ^^^^^^^^^^ |
| 241 | |
| 242 | There are two FDB ageing models supported: 1) ageing by the device, and 2) |
| 243 | ageing by the kernel. Ageing by the device is preferred if many FDB entries |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 244 | are supported. The driver calls call_switchdev_notifiers(SWITCHDEV_FDB_DEL, |
| 245 | ...) to age out the FDB entry. In this model, ageing by the kernel should be |
| 246 | turned off. XXX: how to turn off ageing in kernel on a per-port basis or |
| 247 | otherwise prevent the kernel from ageing out the FDB entry? |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 248 | |
| 249 | In the kernel ageing model, the standard bridge ageing mechanism is used to age |
| 250 | out stale FDB entries. To keep an FDB entry "alive", the driver should refresh |
| 251 | the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The |
| 252 | notification will reset the FDB entry's last-used time to now. The driver |
| 253 | should rate limit refresh notifications, for example, no more than once a |
Scott Feldman | b4ad7ba | 2015-06-14 11:33:11 -0700 | [diff] [blame] | 254 | second. If the FDB entry expires, fdb_delete is called to remove entry from |
| 255 | the device. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 256 | |
| 257 | STP State Change on Port |
| 258 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 259 | |
| 260 | Internally or with a third-party STP protocol implementation (e.g. mstpd), the |
| 261 | bridge driver maintains the STP state for ports, and will notify the switch |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 262 | driver of STP state change on a port using the switchdev op |
| 263 | switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_STP_UPDATE. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 264 | |
| 265 | State is one of BR_STATE_*. The switch driver can use STP state updates to |
| 266 | update ingress packet filter list for the port. For example, if port is |
| 267 | DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs |
| 268 | and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. |
| 269 | |
| 270 | Note that STP BDPUs are untagged and STP state applies to all VLANs on the port |
| 271 | so packet filters should be applied consistently across untagged and tagged |
| 272 | VLANs on the port. |
| 273 | |
| 274 | Flooding L2 domain |
| 275 | ^^^^^^^^^^^^^^^^^^ |
| 276 | |
| 277 | For a given L2 VLAN domain, the switch device should flood multicast/broadcast |
| 278 | and unknown unicast packets to all ports in domain, if allowed by port's |
| 279 | current STP state. The switch driver, knowing which ports are within which |
| 280 | vlan L2 domain, can program the switch device for flooding. The packet should |
| 281 | also be sent to the port netdev for processing by the bridge driver. The |
Scott Feldman | a48037e | 2015-07-18 18:24:52 -0700 | [diff] [blame] | 282 | bridge should not reflood the packet to the same ports the device flooded, |
| 283 | otherwise there will be duplicate packets on the wire. |
| 284 | |
| 285 | To avoid duplicate packets, the device/driver should mark a packet as already |
| 286 | forwarded using skb->offload_fwd_mark. The same mark is set on the device |
| 287 | ports in the domain using dev->offload_fwd_mark. If the skb->offload_fwd_mark |
| 288 | is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel |
| 289 | will drop the skb right before transmit on the egress port, with the |
| 290 | understanding that the device already forwarded the packet on same egress port. |
| 291 | The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark |
| 292 | for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and |
| 293 | a group ifindex. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 294 | |
| 295 | It is possible for the switch device to not handle flooding and push the |
| 296 | packets up to the bridge driver for flooding. This is not ideal as the number |
| 297 | of ports scale in the L2 domain as the device is much more efficient at |
| 298 | flooding packets that software. |
| 299 | |
| 300 | IGMP Snooping |
| 301 | ^^^^^^^^^^^^^ |
| 302 | |
| 303 | XXX: complete this section |
| 304 | |
| 305 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 306 | L3 Routing Offload |
| 307 | ------------------ |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 308 | |
| 309 | Offloading L3 routing requires that device be programmed with FIB entries from |
| 310 | the kernel, with the device doing the FIB lookup and forwarding. The device |
| 311 | does a longest prefix match (LPM) on FIB entries matching route prefix and |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 312 | forwards the packet to the matching FIB entry's nexthop(s) egress ports. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 313 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 314 | To program the device, the driver implements support for |
| 315 | SWITCHDEV_OBJ_IPV[4|6]_FIB object using switchdev_port_obj_xxx ops. |
| 316 | switchdev_port_obj_add is used for both adding a new FIB entry to the device, |
| 317 | or modifying an existing entry on the device. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 318 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 319 | XXX: Currently, only SWITCHDEV_OBJ_IPV4_FIB objects are supported. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 320 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 321 | SWITCHDEV_OBJ_IPV4_FIB object passes: |
| 322 | |
| 323 | struct switchdev_obj_ipv4_fib { /* IPV4_FIB */ |
| 324 | u32 dst; |
| 325 | int dst_len; |
| 326 | struct fib_info *fi; |
| 327 | u8 tos; |
| 328 | u8 type; |
| 329 | u32 nlflags; |
| 330 | u32 tb_id; |
| 331 | } ipv4_fib; |
| 332 | |
| 333 | to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi |
| 334 | structure holds details on the route and route's nexthops. *dev is one of the |
| 335 | port netdevs mentioned in the routes next hop list. If the output port netdevs |
| 336 | referenced in the route's nexthop list don't all have the same switch ID, the |
| 337 | driver is not called to add/modify/delete the FIB entry. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 338 | |
| 339 | Routes offloaded to the device are labeled with "offload" in the ip route |
| 340 | listing: |
| 341 | |
| 342 | $ ip route show |
| 343 | default via 192.168.0.2 dev eth0 |
| 344 | 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload |
| 345 | 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 346 | 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload |
| 347 | 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 348 | 12.0.0.2 proto zebra metric 30 offload |
| 349 | nexthop via 11.0.0.1 dev sw1p1 weight 1 |
| 350 | nexthop via 11.0.0.9 dev sw1p2 weight 1 |
| 351 | 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 352 | 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 353 | 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 |
| 354 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 355 | XXX: add/mod/del IPv6 FIB API |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 356 | |
| 357 | Nexthop Resolution |
| 358 | ^^^^^^^^^^^^^^^^^^ |
| 359 | |
| 360 | The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for |
| 361 | the switch device to forward the packet with the correct dst mac address, the |
| 362 | nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac |
| 363 | address discovery comes via the ARP (or ND) process and is available via the |
| 364 | arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver |
| 365 | should trigger the kernel's neighbor resolution process. See the rocker |
| 366 | driver's rocker_port_ipv4_resolve() for an example. |
| 367 | |
| 368 | The driver can monitor for updates to arp_tbl using the netevent notifier |
| 369 | NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops |
Scott Feldman | dd19f83 | 2015-08-12 18:45:25 -0700 | [diff] [blame] | 370 | for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy |
| 371 | to know when arp_tbl neighbor entries are purged from the port. |