Jesse Gross | ccb1352 | 2011-10-25 19:26:31 -0700 | [diff] [blame] | 1 | Open vSwitch datapath developer documentation |
| 2 | ============================================= |
| 3 | |
| 4 | The Open vSwitch kernel module allows flexible userspace control over |
| 5 | flow-level packet processing on selected network devices. It can be |
| 6 | used to implement a plain Ethernet switch, network device bonding, |
| 7 | VLAN processing, network access control, flow-based network control, |
| 8 | and so on. |
| 9 | |
| 10 | The kernel module implements multiple "datapaths" (analogous to |
| 11 | bridges), each of which can have multiple "vports" (analogous to ports |
| 12 | within a bridge). Each datapath also has associated with it a "flow |
| 13 | table" that userspace populates with "flows" that map from keys based |
| 14 | on packet headers and metadata to sets of actions. The most common |
| 15 | action forwards the packet to another vport; other actions are also |
| 16 | implemented. |
| 17 | |
| 18 | When a packet arrives on a vport, the kernel module processes it by |
| 19 | extracting its flow key and looking it up in the flow table. If there |
| 20 | is a matching flow, it executes the associated actions. If there is |
| 21 | no match, it queues the packet to userspace for processing (as part of |
| 22 | its processing, userspace will likely set up a flow to handle further |
| 23 | packets of the same type entirely in-kernel). |
| 24 | |
| 25 | |
| 26 | Flow key compatibility |
| 27 | ---------------------- |
| 28 | |
| 29 | Network protocols evolve over time. New protocols become important |
| 30 | and existing protocols lose their prominence. For the Open vSwitch |
| 31 | kernel module to remain relevant, it must be possible for newer |
| 32 | versions to parse additional protocols as part of the flow key. It |
| 33 | might even be desirable, someday, to drop support for parsing |
| 34 | protocols that have become obsolete. Therefore, the Netlink interface |
| 35 | to Open vSwitch is designed to allow carefully written userspace |
| 36 | applications to work with any version of the flow key, past or future. |
| 37 | |
| 38 | To support this forward and backward compatibility, whenever the |
| 39 | kernel module passes a packet to userspace, it also passes along the |
| 40 | flow key that it parsed from the packet. Userspace then extracts its |
| 41 | own notion of a flow key from the packet and compares it against the |
| 42 | kernel-provided version: |
| 43 | |
| 44 | - If userspace's notion of the flow key for the packet matches the |
| 45 | kernel's, then nothing special is necessary. |
| 46 | |
| 47 | - If the kernel's flow key includes more fields than the userspace |
| 48 | version of the flow key, for example if the kernel decoded IPv6 |
| 49 | headers but userspace stopped at the Ethernet type (because it |
| 50 | does not understand IPv6), then again nothing special is |
| 51 | necessary. Userspace can still set up a flow in the usual way, |
| 52 | as long as it uses the kernel-provided flow key to do it. |
| 53 | |
| 54 | - If the userspace flow key includes more fields than the |
| 55 | kernel's, for example if userspace decoded an IPv6 header but |
| 56 | the kernel stopped at the Ethernet type, then userspace can |
| 57 | forward the packet manually, without setting up a flow in the |
| 58 | kernel. This case is bad for performance because every packet |
| 59 | that the kernel considers part of the flow must go to userspace, |
| 60 | but the forwarding behavior is correct. (If userspace can |
| 61 | determine that the values of the extra fields would not affect |
| 62 | forwarding behavior, then it could set up a flow anyway.) |
| 63 | |
| 64 | How flow keys evolve over time is important to making this work, so |
| 65 | the following sections go into detail. |
| 66 | |
| 67 | |
| 68 | Flow key format |
| 69 | --------------- |
| 70 | |
| 71 | A flow key is passed over a Netlink socket as a sequence of Netlink |
| 72 | attributes. Some attributes represent packet metadata, defined as any |
| 73 | information about a packet that cannot be extracted from the packet |
| 74 | itself, e.g. the vport on which the packet was received. Most |
| 75 | attributes, however, are extracted from headers within the packet, |
| 76 | e.g. source and destination addresses from Ethernet, IP, or TCP |
| 77 | headers. |
| 78 | |
| 79 | The <linux/openvswitch.h> header file defines the exact format of the |
| 80 | flow key attributes. For informal explanatory purposes here, we write |
| 81 | them as comma-separated strings, with parentheses indicating arguments |
| 82 | and nesting. For example, the following could represent a flow key |
| 83 | corresponding to a TCP packet that arrived on vport 1: |
| 84 | |
| 85 | in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), |
| 86 | eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, |
| 87 | frag=no), tcp(src=49163, dst=80) |
| 88 | |
| 89 | Often we ellipsize arguments not important to the discussion, e.g.: |
| 90 | |
| 91 | in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) |
| 92 | |
| 93 | |
Andy Zhou | 03f0d91 | 2013-08-07 20:01:00 -0700 | [diff] [blame] | 94 | Wildcarded flow key format |
| 95 | -------------------------- |
| 96 | |
| 97 | A wildcarded flow is described with two sequences of Netlink attributes |
| 98 | passed over the Netlink socket. A flow key, exactly as described above, and an |
| 99 | optional corresponding flow mask. |
| 100 | |
| 101 | A wildcarded flow can represent a group of exact match flows. Each '1' bit |
| 102 | in the mask specifies a exact match with the corresponding bit in the flow key. |
| 103 | A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit |
| 104 | of a incoming packet. Using wildcarded flow can improve the flow set up rate |
| 105 | by reduce the number of new flows need to be processed by the user space program. |
| 106 | |
| 107 | Support for the mask Netlink attribute is optional for both the kernel and user |
| 108 | space program. The kernel can ignore the mask attribute, installing an exact |
| 109 | match flow, or reduce the number of don't care bits in the kernel to less than |
| 110 | what was specified by the user space program. In this case, variations in bits |
| 111 | that the kernel does not implement will simply result in additional flow setups. |
| 112 | The kernel module will also work with user space programs that neither support |
| 113 | nor supply flow mask attributes. |
| 114 | |
| 115 | Since the kernel may ignore or modify wildcard bits, it can be difficult for |
| 116 | the userspace program to know exactly what matches are installed. There are |
| 117 | two possible approaches: reactively install flows as they miss the kernel |
| 118 | flow table (and therefore not attempt to determine wildcard changes at all) |
| 119 | or use the kernel's response messages to determine the installed wildcards. |
| 120 | |
| 121 | When interacting with userspace, the kernel should maintain the match portion |
| 122 | of the key exactly as originally installed. This will provides a handle to |
| 123 | identify the flow for all future operations. However, when reporting the |
| 124 | mask of an installed flow, the mask should include any restrictions imposed |
| 125 | by the kernel. |
| 126 | |
| 127 | The behavior when using overlapping wildcarded flows is undefined. It is the |
| 128 | responsibility of the user space program to ensure that any incoming packet |
| 129 | can match at most one flow, wildcarded or not. The current implementation |
| 130 | performs best-effort detection of overlapping wildcarded flows and may reject |
| 131 | some but not all of them. However, this behavior may change in future versions. |
| 132 | |
| 133 | |
Joe Stringer | 74ed7ab | 2015-01-21 16:42:52 -0800 | [diff] [blame] | 134 | Unique flow identifiers |
| 135 | ----------------------- |
| 136 | |
| 137 | An alternative to using the original match portion of a key as the handle for |
| 138 | flow identification is a unique flow identifier, or "UFID". UFIDs are optional |
| 139 | for both the kernel and user space program. |
| 140 | |
| 141 | User space programs that support UFID are expected to provide it during flow |
| 142 | setup in addition to the flow, then refer to the flow using the UFID for all |
| 143 | future operations. The kernel is not required to index flows by the original |
| 144 | flow key if a UFID is specified. |
| 145 | |
| 146 | |
Jesse Gross | ccb1352 | 2011-10-25 19:26:31 -0700 | [diff] [blame] | 147 | Basic rule for evolving flow keys |
| 148 | --------------------------------- |
| 149 | |
| 150 | Some care is needed to really maintain forward and backward |
| 151 | compatibility for applications that follow the rules listed under |
| 152 | "Flow key compatibility" above. |
| 153 | |
| 154 | The basic rule is obvious: |
| 155 | |
| 156 | ------------------------------------------------------------------ |
| 157 | New network protocol support must only supplement existing flow |
| 158 | key attributes. It must not change the meaning of already defined |
| 159 | flow key attributes. |
| 160 | ------------------------------------------------------------------ |
| 161 | |
| 162 | This rule does have less-obvious consequences so it is worth working |
| 163 | through a few examples. Suppose, for example, that the kernel module |
| 164 | did not already implement VLAN parsing. Instead, it just interpreted |
| 165 | the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the |
| 166 | packet. The flow key for any packet with an 802.1Q header would look |
| 167 | essentially like this, ignoring metadata: |
| 168 | |
| 169 | eth(...), eth_type(0x8100) |
| 170 | |
| 171 | Naively, to add VLAN support, it makes sense to add a new "vlan" flow |
| 172 | key attribute to contain the VLAN tag, then continue to decode the |
| 173 | encapsulated headers beyond the VLAN tag using the existing field |
Leo Alterman | efaac3b | 2012-07-20 14:51:07 -0700 | [diff] [blame] | 174 | definitions. With this change, a TCP packet in VLAN 10 would have a |
Jesse Gross | ccb1352 | 2011-10-25 19:26:31 -0700 | [diff] [blame] | 175 | flow key much like this: |
| 176 | |
| 177 | eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) |
| 178 | |
| 179 | But this change would negatively affect a userspace application that |
| 180 | has not been updated to understand the new "vlan" flow key attribute. |
| 181 | The application could, following the flow compatibility rules above, |
| 182 | ignore the "vlan" attribute that it does not understand and therefore |
| 183 | assume that the flow contained IP packets. This is a bad assumption |
| 184 | (the flow only contains IP packets if one parses and skips over the |
| 185 | 802.1Q header) and it could cause the application's behavior to change |
| 186 | across kernel versions even though it follows the compatibility rules. |
| 187 | |
| 188 | The solution is to use a set of nested attributes. This is, for |
| 189 | example, why 802.1Q support uses nested attributes. A TCP packet in |
| 190 | VLAN 10 is actually expressed as: |
| 191 | |
| 192 | eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), |
| 193 | ip(proto=6, ...), tcp(...))) |
| 194 | |
| 195 | Notice how the "eth_type", "ip", and "tcp" flow key attributes are |
| 196 | nested inside the "encap" attribute. Thus, an application that does |
| 197 | not understand the "vlan" key will not see either of those attributes |
| 198 | and therefore will not misinterpret them. (Also, the outer eth_type |
| 199 | is still 0x8100, not changed to 0x0800.) |
| 200 | |
| 201 | Handling malformed packets |
| 202 | -------------------------- |
| 203 | |
| 204 | Don't drop packets in the kernel for malformed protocol headers, bad |
| 205 | checksums, etc. This would prevent userspace from implementing a |
| 206 | simple Ethernet switch that forwards every packet. |
| 207 | |
| 208 | Instead, in such a case, include an attribute with "empty" content. |
| 209 | It doesn't matter if the empty content could be valid protocol values, |
| 210 | as long as those values are rarely seen in practice, because userspace |
| 211 | can always forward all packets with those values to userspace and |
| 212 | handle them individually. |
| 213 | |
| 214 | For example, consider a packet that contains an IP header that |
| 215 | indicates protocol 6 for TCP, but which is truncated just after the IP |
| 216 | header, so that the TCP header is missing. The flow key for this |
| 217 | packet would include a tcp attribute with all-zero src and dst, like |
| 218 | this: |
| 219 | |
| 220 | eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) |
| 221 | |
| 222 | As another example, consider a packet with an Ethernet type of 0x8100, |
| 223 | indicating that a VLAN TCI should follow, but which is truncated just |
| 224 | after the Ethernet type. The flow key for this packet would include |
| 225 | an all-zero-bits vlan and an empty encap attribute, like this: |
| 226 | |
| 227 | eth(...), eth_type(0x8100), vlan(0), encap() |
| 228 | |
| 229 | Unlike a TCP packet with source and destination ports 0, an |
| 230 | all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka |
| 231 | VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan |
| 232 | attribute expressly to allow this situation to be distinguished. |
| 233 | Thus, the flow key in this second example unambiguously indicates a |
| 234 | missing or malformed VLAN TCI. |
| 235 | |
| 236 | Other rules |
| 237 | ----------- |
| 238 | |
| 239 | The other rules for flow keys are much less subtle: |
| 240 | |
| 241 | - Duplicate attributes are not allowed at a given nesting level. |
| 242 | |
| 243 | - Ordering of attributes is not significant. |
| 244 | |
| 245 | - When the kernel sends a given flow key to userspace, it always |
| 246 | composes it the same way. This allows userspace to hash and |
| 247 | compare entire flow keys that it may not be able to fully |
| 248 | interpret. |