osdl.org!shemminger | aba5acd | 2004-04-15 20:56:59 +0000 | [diff] [blame] | 1 | \documentstyle[12pt,twoside]{article} |
| 2 | \def\TITLE{IPv6 Flow Labels} |
| 3 | \input preamble |
| 4 | \begin{center} |
| 5 | \Large\bf IPv6 Flow Labels in Linux-2.2. |
| 6 | \end{center} |
| 7 | |
| 8 | |
| 9 | \begin{center} |
| 10 | { \large Alexey~N.~Kuznetsov } \\ |
| 11 | \em Institute for Nuclear Research, Moscow \\ |
| 12 | \verb|kuznet@ms2.inr.ac.ru| \\ |
| 13 | \rm April 11, 1999 |
| 14 | \end{center} |
| 15 | |
| 16 | \vspace{5mm} |
| 17 | |
| 18 | \tableofcontents |
| 19 | |
| 20 | \section{Introduction.} |
| 21 | |
| 22 | Every IPv6 packet carries 28 bits of flow information. RFC2460 splits |
| 23 | these bits to two fields: 8 bits of traffic class (or DS field, if you |
| 24 | prefer this term) and 20 bits of flow label. Currently there exist |
| 25 | no well-defined API to manage IPv6 flow information. In this document |
| 26 | I describe an attempt to design the API for Linux-2.2 IPv6 stack. |
| 27 | |
| 28 | \vskip 1mm |
| 29 | |
| 30 | The API must solve the following tasks: |
| 31 | |
| 32 | \begin{enumerate} |
| 33 | |
| 34 | \item To allow user to set traffic class bits. |
| 35 | |
| 36 | \item To allow user to read traffic class bits of received packets. |
| 37 | This feature is not so useful as the first one, however it will be |
| 38 | necessary f.e.\ to implement ECN [RFC2481] for datagram oriented services |
| 39 | or to implement receiver side of SRP or another end-to-end protocol |
| 40 | using traffic class bits. |
| 41 | |
| 42 | \item To assign flow labels to packets sent by user. |
| 43 | |
| 44 | \item To get flow labels of received packets. I do not know |
| 45 | any applications of this feature, but it is possible that receiver will |
| 46 | want to use flow labels to distinguish sub-flows. |
| 47 | |
| 48 | \item To allocate flow labels in the way, compliant to RFC2460. Namely: |
| 49 | |
| 50 | \begin{itemize} |
| 51 | \item |
| 52 | Flow labels must be uniformly distributed (pseudo-)random numbers, |
| 53 | so that any subset of 20 bits can be used as hash key. |
| 54 | |
| 55 | \item |
| 56 | Flows with coinciding source address and flow label must have identical |
| 57 | destination address and not-fragmentable extensions headers (i.e.\ |
| 58 | hop by hop options and all the headers up to and including routing header, |
| 59 | if it is present.) |
| 60 | |
| 61 | \begin{NB} |
| 62 | There is a hole in specs: some hop-by-hop options can be |
| 63 | defined only on per-packet base (f.e.\ jumbo payload option). |
| 64 | Essentially, it means that such options cannot present in packets |
| 65 | with flow labels. |
| 66 | \end{NB} |
| 67 | \begin{NB} |
| 68 | NB notes here and below reflect only my personal opinion, |
| 69 | they should be read with smile or should not be read at all :-). |
| 70 | \end{NB} |
| 71 | |
| 72 | |
| 73 | \item |
| 74 | Flow labels have finite lifetime and source is not allowed to reuse |
| 75 | flow label for another flow within the maximal lifetime has expired, |
| 76 | so that intermediate nodes will be able to invalidate flow state before |
| 77 | the label is taken over by another flow. |
| 78 | Flow state, including lifetime, is propagated along datagram path |
| 79 | by some application specific methods |
| 80 | (f.e.\ in RSVP PATH messages or in some hop-by-hop option). |
| 81 | |
| 82 | |
| 83 | \end{itemize} |
| 84 | |
| 85 | \end{enumerate} |
| 86 | |
| 87 | \section{Sending/receiving flow information.} |
| 88 | |
| 89 | \paragraph{Discussion.} |
| 90 | \addcontentsline{toc}{subsection}{Discussion} |
| 91 | It was proposed (Where? I do not remember any explicit statement) |
| 92 | to solve the first four tasks using |
| 93 | \verb|sin6_flowinfo| field added to \verb|struct| \verb|sockaddr_in6| |
| 94 | (see RFC2553). |
| 95 | |
| 96 | \begin{NB} |
| 97 | This method is difficult to consider as reasonable, because it |
| 98 | puts additional overhead to all the services, despite of only |
| 99 | very small subset of them (none, to be more exact) really use it. |
| 100 | It contradicts both to IETF spirit and the letter. Before RFC2553 |
| 101 | one justification existed, IPv6 address alignment left 4 byte |
| 102 | hole in \verb|sockaddr_in6| in any case. Now it has no justification. |
| 103 | \end{NB} |
| 104 | |
| 105 | We have two problems with this method. The first one is common for all OSes: |
| 106 | if \verb|recvmsg()| initializes \verb|sin6_flowinfo| to flow info |
| 107 | of received packet, we loose one very important property of BSD socket API, |
| 108 | namely, we are not allowed to use received address for reply directly |
| 109 | and have to mangle it, even if we are not interested in flowinfo subtleties. |
| 110 | |
| 111 | \begin{NB} |
| 112 | RFC2553 adds new requirement: to clear \verb|sin6_flowinfo|. |
| 113 | Certainly, it is not solution but rather attempt to force applications |
| 114 | to make unnecessary work. Well, as usually, one mistake in design |
| 115 | is followed by attempts to patch the hole and more mistakes... |
| 116 | \end{NB} |
| 117 | |
| 118 | Another problem is Linux specific. Historically Linux IPv6 did not |
| 119 | initialize \verb|sin6_flowinfo| at all, so that, if kernel does not |
| 120 | support flow labels, this field is not zero, but a random number. |
| 121 | Some applications also did not take care about it. |
| 122 | |
| 123 | \begin{NB} |
| 124 | Following RFC2553 such applications can be considered as broken, |
| 125 | but I still think that they are right: clearing all the address |
| 126 | before filling known fields is robust but stupid solution. |
| 127 | Useless wasting CPU cycles and |
| 128 | memory bandwidth is not a good idea. Such patches are acceptable |
| 129 | as temporary hacks, but not as standard of the future. |
| 130 | \end{NB} |
| 131 | |
| 132 | |
| 133 | \paragraph{Implementation.} |
| 134 | \addcontentsline{toc}{subsection}{Implementation} |
| 135 | By default Linux IPv6 does not read \verb|sin6_flowinfo| field |
| 136 | assuming that common applications are not obliged to initialize it |
| 137 | and are permitted to consider it as pure alignment padding. |
| 138 | In order to tell kernel that application |
| 139 | is aware of this field, it is necessary to set socket option |
| 140 | \verb|IPV6_FLOWINFO_SEND|. |
| 141 | |
| 142 | \begin{verbatim} |
| 143 | int on = 1; |
| 144 | setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO_SEND, |
| 145 | (void*)&on, sizeof(on)); |
| 146 | \end{verbatim} |
| 147 | |
| 148 | Linux kernel never fills \verb|sin6_flowinfo| field, when passing |
| 149 | message to user space, though the kernels which support flow labels |
| 150 | initialize it to zero. If user wants to get received flowinfo, he |
| 151 | will set option \verb|IPV6_FLOWINFO| and after this he will receive |
| 152 | flowinfo as ancillary data object of type \verb|IPV6_FLOWINFO| |
| 153 | (cf.\ RFC2292). |
| 154 | |
| 155 | \begin{verbatim} |
| 156 | int on = 1; |
| 157 | setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO, (void*)&on, sizeof(on)); |
| 158 | \end{verbatim} |
| 159 | |
| 160 | Flowinfo received and latched by a connected TCP socket also may be fetched |
| 161 | with \verb|getsockopt()| \verb|IPV6_PKTOPTIONS| together with |
| 162 | another optional information. |
| 163 | |
| 164 | Besides that, in the spirit of RFC2292 the option \verb|IPV6_FLOWINFO| |
| 165 | may be used as alternative way to send flowinfo with \verb|sendmsg()| or |
| 166 | to latch it with \verb|IPV6_PKTOPTIONS|. |
| 167 | |
| 168 | \paragraph{Note about IPv6 options and destination address.} |
| 169 | \addcontentsline{toc}{subsection}{IPv6 options and destination address} |
| 170 | If \verb|sin6_flowinfo| does contain not zero flow label, |
| 171 | destination address in \verb|sin6_addr| and non-fragmentable |
| 172 | extension headers are ignored. Instead, kernel uses the values |
| 173 | cached at flow setup (see below). However, for connected sockets |
| 174 | kernel prefers the values set at connection time. |
| 175 | |
| 176 | \paragraph{Example.} |
| 177 | \addcontentsline{toc}{subsection}{Example} |
| 178 | After setting socket option \verb|IPV6_FLOWINFO| |
| 179 | flowlabel and DS field are received as ancillary data object |
| 180 | of type \verb|IPV6_FLOWINFO| and level \verb|SOL_IPV6|. |
| 181 | In the cases when it is convenient to use \verb|recvfrom(2)|, |
| 182 | it is possible to replace library variant with your own one, |
| 183 | sort of: |
| 184 | |
| 185 | \begin{verbatim} |
| 186 | #include <sys/socket.h> |
| 187 | #include <netinet/in6.h> |
| 188 | |
| 189 | size_t recvfrom(int fd, char *buf, size_t len, int flags, |
| 190 | struct sockaddr *addr, int *addrlen) |
| 191 | { |
| 192 | size_t cc; |
| 193 | char cbuf[128]; |
| 194 | struct cmsghdr *c; |
| 195 | struct iovec iov = { buf, len }; |
| 196 | struct msghdr msg = { addr, *addrlen, |
| 197 | &iov, 1, |
| 198 | cbuf, sizeof(cbuf), |
| 199 | 0 }; |
| 200 | |
| 201 | cc = recvmsg(fd, &msg, flags); |
| 202 | if (cc < 0) |
| 203 | return cc; |
| 204 | ((struct sockaddr_in6*)addr)->sin6_flowinfo = 0; |
| 205 | *addrlen = msg.msg_namelen; |
| 206 | for (c=CMSG_FIRSTHDR(&msg); c; c = CMSG_NEXTHDR(&msg, c)) { |
| 207 | if (c->cmsg_level != SOL_IPV6 || |
| 208 | c->cmsg_type != IPV6_FLOWINFO) |
| 209 | continue; |
| 210 | ((struct sockaddr_in6*)addr)->sin6_flowinfo = *(__u32*)CMSG_DATA(c); |
| 211 | } |
| 212 | return cc; |
| 213 | } |
| 214 | \end{verbatim} |
| 215 | |
| 216 | |
| 217 | |
| 218 | \section{Flow label management.} |
| 219 | |
| 220 | \paragraph{Discussion.} |
| 221 | \addcontentsline{toc}{subsection}{Discussion} |
| 222 | Requirements of RFC2460 are pretty tough. Particularly, lifetimes |
| 223 | longer than boot time require to store allocated labels at stable |
| 224 | storage, so that the full implementation necessarily includes user space flow |
| 225 | label manager. There are at least three different approaches: |
| 226 | |
| 227 | \begin{enumerate} |
| 228 | \item {\bf ``Cooperative''. } We could leave flow label allocation wholly |
| 229 | to user space. When user needs label he requests manager directly. The approach |
| 230 | is valid, but as any ``cooperative'' approach it suffers of security problems. |
| 231 | |
| 232 | \begin{NB} |
| 233 | One idea is to disallow not privileged user to allocate flow |
| 234 | labels, but instead to pass the socket to manager via \verb|SCM_RIGHTS| |
| 235 | control message, so that it will allocate label and assign it to socket |
| 236 | itself. Hmm... the idea is interesting. |
| 237 | \end{NB} |
| 238 | |
| 239 | \item {\bf ``Indirect''.} Kernel redirects requests to user level daemon |
| 240 | and does not install label until the daemon acknowledged the request. |
| 241 | The approach is the most promising, it is especially pleasant to recognize |
| 242 | parallel with IPsec API [RFC2367,Craig]. Actually, it may share API with |
| 243 | IPsec. |
| 244 | |
| 245 | \item {\bf ``Stupid''.} To allocate labels in kernel space. It is the simplest |
| 246 | method, but it suffers of two serious flaws: the first, |
| 247 | we cannot lease labels with lifetimes longer than boot time, the second, |
| 248 | it is sensitive to DoS attacks. Kernel have to remember all the obsolete |
| 249 | labels until their expiration and malicious user may fastly eat all the |
| 250 | flow label space. |
| 251 | |
| 252 | \end{enumerate} |
| 253 | |
| 254 | Certainly, I choose the most ``stupid'' method. It is the cheapest one |
| 255 | for implementor (i.e.\ me), and taking into account that flow labels |
| 256 | still have no serious applications it is not useful to work on more |
| 257 | advanced API, especially, taking into account that eventually we |
| 258 | will get it for no fee together with IPsec. |
| 259 | |
| 260 | |
| 261 | \paragraph{Implementation.} |
| 262 | \addcontentsline{toc}{subsection}{Implementation} |
| 263 | Socket option \verb|IPV6_FLOWLABEL_MGR| allows to |
| 264 | request flow label manager to allocate new flow label, to reuse |
| 265 | already allocated one or to delete old flow label. |
| 266 | Its argument is \verb|struct| \verb|in6_flowlabel_req|: |
| 267 | |
| 268 | \begin{verbatim} |
| 269 | struct in6_flowlabel_req |
| 270 | { |
| 271 | struct in6_addr flr_dst; |
| 272 | __u32 flr_label; |
| 273 | __u8 flr_action; |
| 274 | __u8 flr_share; |
| 275 | __u16 flr_flags; |
| 276 | __u16 flr_expires; |
| 277 | __u16 flr_linger; |
| 278 | __u32 __flr_reserved; |
| 279 | /* Options in format of IPV6_PKTOPTIONS */ |
| 280 | }; |
| 281 | \end{verbatim} |
| 282 | |
| 283 | \begin{itemize} |
| 284 | |
| 285 | \item \verb|dst| is IPv6 destination address associated with the label. |
| 286 | |
| 287 | \item \verb|label| is flow label value in network byte order. If it is zero, |
| 288 | kernel will allocate new pseudo-random number. Otherwise, kernel will try |
| 289 | to lease flow label ordered by user. In this case, it is user task to provide |
| 290 | necessary flow label randomness. |
| 291 | |
| 292 | \item \verb|action| is requested operation. Currently, only three operations |
| 293 | are defined: |
| 294 | |
| 295 | \begin{verbatim} |
| 296 | #define IPV6_FL_A_GET 0 /* Get flow label */ |
| 297 | #define IPV6_FL_A_PUT 1 /* Release flow label */ |
| 298 | #define IPV6_FL_A_RENEW 2 /* Update expire time */ |
| 299 | \end{verbatim} |
| 300 | |
| 301 | \item \verb|flags| are optional modifiers. Currently |
| 302 | only \verb|IPV6_FL_A_GET| has modifiers: |
| 303 | |
| 304 | \begin{verbatim} |
| 305 | #define IPV6_FL_F_CREATE 1 /* Allowed to create new label */ |
| 306 | #define IPV6_FL_F_EXCL 2 /* Do not create new label */ |
| 307 | \end{verbatim} |
| 308 | |
| 309 | |
| 310 | \item \verb|share| defines who is allowed to reuse the same flow label. |
| 311 | |
| 312 | \begin{verbatim} |
| 313 | #define IPV6_FL_S_NONE 0 /* Not defined */ |
| 314 | #define IPV6_FL_S_EXCL 1 /* Label is private */ |
| 315 | #define IPV6_FL_S_PROCESS 2 /* May be reused by this process */ |
| 316 | #define IPV6_FL_S_USER 3 /* May be reused by this user */ |
| 317 | #define IPV6_FL_S_ANY 255 /* Anyone may reuse it */ |
| 318 | \end{verbatim} |
| 319 | |
| 320 | \item \verb|linger| is time in seconds. After the last user releases flow |
| 321 | label, it will not be reused with different destination and options at least |
| 322 | during this time. If \verb|share| is not \verb|IPV6_FL_S_EXCL| the label |
| 323 | still can be shared by another sockets. Current implementation does not allow |
| 324 | unprivileged user to set linger longer than 60 sec. |
| 325 | |
| 326 | \item \verb|expires| is time in seconds. Flow label will be kept at least |
| 327 | for this time, but it will not be destroyed before user released it explicitly |
| 328 | or closed all the sockets using it. Current implementation does not allow |
| 329 | unprivileged user to set timeout longer than 60 sec. Proviledged applications |
| 330 | MAY set longer lifetimes, but in this case they MUST save allocated |
| 331 | labels at stable storage and restore them back after reboot before the first |
| 332 | application allocates new flow. |
| 333 | |
| 334 | \end{itemize} |
| 335 | |
| 336 | This structure is followed by optional extension headers associated |
| 337 | with this flow label in format of \verb|IPV6_PKTOPTIONS|. Only |
| 338 | \verb|IPV6_HOPOPTS|, \verb|IPV6_RTHDR| and, if \verb|IPV6_RTHDR| presents, |
| 339 | \verb|IPV6_DSTOPTS| are allowed. |
| 340 | |
| 341 | \paragraph{Example.} |
| 342 | \addcontentsline{toc}{subsection}{Example} |
| 343 | The function \verb|get_flow_label| allocates |
| 344 | private flow label. |
| 345 | |
| 346 | \begin{verbatim} |
| 347 | int get_flow_label(int fd, struct sockaddr_in6 *dst, __u32 fl) |
| 348 | { |
| 349 | int on = 1; |
| 350 | struct in6_flowlabel_req freq; |
| 351 | |
| 352 | memset(&freq, 0, sizeof(freq)); |
| 353 | freq.flr_label = htonl(fl); |
| 354 | freq.flr_action = IPV6_FL_A_GET; |
| 355 | freq.flr_flags = IPV6_FL_F_CREATE | IPV6_FL_F_EXCL; |
| 356 | freq.flr_share = IPV6_FL_S_EXCL; |
| 357 | memcpy(&freq.flr_dst, &dst->sin6_addr, 16); |
| 358 | if (setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR, |
| 359 | &freq, sizeof(freq)) == -1) { |
| 360 | perror ("can't lease flowlabel"); |
| 361 | return -1; |
| 362 | } |
| 363 | dst->sin6_flowinfo |= freq.flr_label; |
| 364 | |
| 365 | if (setsockopt(fd, SOL_IPV6, IPV6_FLOWINFO_SEND, |
| 366 | &on, sizeof(on)) == -1) { |
| 367 | perror ("can't send flowinfo"); |
| 368 | |
| 369 | freq.flr_action = IPV6_FL_A_PUT; |
| 370 | setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR, |
| 371 | &freq, sizeof(freq)); |
| 372 | return -1; |
| 373 | } |
| 374 | return 0; |
| 375 | } |
| 376 | \end{verbatim} |
| 377 | |
| 378 | A bit more complicated example using routing header can be found |
| 379 | in \verb|ping6| utility (\verb|iputils| package). Linux rsvpd backend |
| 380 | contains an example of using operation \verb|IPV6_FL_A_RENEW|. |
| 381 | |
| 382 | \paragraph{Listing flow labels.} |
| 383 | \addcontentsline{toc}{subsection}{Listing flow labels} |
| 384 | List of currently allocated |
| 385 | flow labels may be read from \verb|/proc/net/ip6_flowlabel|. |
| 386 | |
| 387 | \begin{verbatim} |
| 388 | Label S Owner Users Linger Expires Dst Opt |
| 389 | A1BE5 1 0 0 6 3 3ffe2400000000010a0020fffe71fb30 0 |
| 390 | \end{verbatim} |
| 391 | |
| 392 | \begin{itemize} |
| 393 | \item \verb|Label| is hexadecimal flow label value. |
| 394 | \item \verb|S| is sharing style. |
| 395 | \item \verb|Owner| is ID of creator, it is zero, pid or uid, depending on |
| 396 | sharing style. |
| 397 | \item \verb|Users| is number of applications using the label now. |
| 398 | \item \verb|Linger| is \verb|linger| of this label in seconds. |
| 399 | \item \verb|Expires| is time until expiration of the label in seconds. It may |
| 400 | be negative, if the label is in use. |
| 401 | \item \verb|Dst| is IPv6 destination address. |
| 402 | \item \verb|Opt| is length of options, associated with the label. Option |
| 403 | data are not accessible. |
| 404 | \end{itemize} |
| 405 | |
| 406 | |
| 407 | \paragraph{Flow labels and RSVP.} |
| 408 | \addcontentsline{toc}{subsection}{Flow labels and RSVP} |
| 409 | RSVP daemon supports IPv6 flow labels |
| 410 | without any modifications to standard ISI RAPI. Sender must allocate |
| 411 | flow label, fill corresponding sender template and submit it to local rsvp |
| 412 | daemon. rsvpd will check the label and start to announce it in PATH |
| 413 | messages. Rsvpd on sender node will renew the flow label, so that it will not |
| 414 | be reused before path state expires and all the intermediate |
| 415 | routers and receiver purge flow state. |
| 416 | |
| 417 | \verb|rtap| utility is modified to parse flow labels. F.e.\ if user allocated |
| 418 | flow label \verb|0xA1234|, he may write: |
| 419 | |
| 420 | \begin{verbatim} |
| 421 | RTAP> sender 3ffe:2400::1/FL0xA1234 <Tspec> |
| 422 | \end{verbatim} |
| 423 | |
| 424 | Receiver makes reservation with command: |
| 425 | \begin{verbatim} |
| 426 | RTAP> reserve ff 3ffe:2400::1/FL0xA1234 <Flowspec> |
| 427 | \end{verbatim} |
| 428 | |
| 429 | \end{document} |