Phil Sutter | 5f4d27d | 2016-03-04 13:11:47 +0100 | [diff] [blame] | 1 | \documentclass[12pt,twoside]{article} |
| 2 | |
| 3 | \usepackage[hidelinks]{hyperref} % \url |
| 4 | \usepackage{booktabs} % nicer tabulars |
| 5 | \usepackage{fancyvrb} |
| 6 | \usepackage{fullpage} |
| 7 | \usepackage{float} |
| 8 | |
| 9 | \newcommand{\iface}{\textit} |
| 10 | \newcommand{\cmd}{\texttt} |
| 11 | \newcommand{\man}{\textit} |
| 12 | \newcommand{\qdisc}{\texttt} |
| 13 | \newcommand{\filter}{\texttt} |
| 14 | |
| 15 | \begin{document} |
| 16 | \title{QoS in Linux with TC and Filters} |
| 17 | \author{Phil Sutter (phil@nwl.cc)} |
| 18 | \date{January 2016} |
| 19 | \maketitle |
| 20 | |
Phil Sutter | 5f4d27d | 2016-03-04 13:11:47 +0100 | [diff] [blame] | 21 | Standard practice when transmitting packets over a medium which may block (due |
| 22 | to congestion, e.g.) is to use a queue which temporarily holds these packets. In |
| 23 | Linux, this queueing approach is where QoS happens: A Queueing Discipline |
| 24 | (qdisc) holds multiple packet queues with different priorities for dequeueing to |
| 25 | the network driver. The classification (i.e. deciding which queue a packet |
| 26 | should go into) is typically done based on Type Of Service (IPv4) or Traffic |
| 27 | Class (IPv6) header fields but depending on qdisc implementation, might be |
| 28 | controlled by the user as well. |
| 29 | |
| 30 | Qdiscs come in two flavors, classful or classless. While classless qdiscs are |
| 31 | not as flexible as classful ones, they also require much less customizing. Often |
| 32 | it is enough to just attach them to an interface, without exact knowledge of |
| 33 | what is done internally. Classful qdiscs are the exact opposite: flexible in |
| 34 | application, they are often not even usable without insightful configuration. |
| 35 | |
| 36 | As the name implies, classful qdiscs provide configurable classes to sort |
| 37 | traffic into. In it's basic form, this is not much different than, say, the |
| 38 | classless \qdisc{pfifo\_fast} which holds three queues and classifies per |
| 39 | packet upon priority field. Though typically classes go beyond that by |
| 40 | supporting nesting and additional characteristics like e.g. maximum traffic |
| 41 | rate or quantum. |
| 42 | |
| 43 | When it comes to controlling the classification process, filters come into play. |
| 44 | They attach to the parent of a set of classes (i.e. either the qdisc itself or |
| 45 | a parent class) and specify how a packet (or it's associated flow) has to look |
| 46 | like in order to suit a given class. To overcome this simplification, it is |
| 47 | possible to attach multiple filters to the same parent, which then consults each |
| 48 | of them in row until the first one accepts the packet. |
| 49 | |
| 50 | Before getting into detail about what filters there are and how to use them, a |
| 51 | simple setup of a qdisc with classes is necessary: |
| 52 | \begin{figure}[H] |
| 53 | \begin{Verbatim} |
| 54 | .-------------------------------------------------------. |
| 55 | | | |
| 56 | | HTB | |
| 57 | | | |
| 58 | | .----------------------------------------------------.| |
| 59 | | | || |
| 60 | | | Class 1:1 || |
| 61 | | | || |
| 62 | | | .---------------..---------------..---------------.|| |
| 63 | | | | || || ||| |
| 64 | | | | Class 1:10 || Class 1:20 || Class 1:30 ||| |
| 65 | | | | || || ||| |
| 66 | | | | .------------.|| .------------.|| .------------.||| |
| 67 | | | | | ||| | ||| | |||| |
| 68 | | | | | fq_codel ||| | fq_codel ||| | fq_codel |||| |
| 69 | | | | | ||| | ||| | |||| |
| 70 | | | | '------------'|| '------------'|| '------------'||| |
| 71 | | | '---------------''---------------''---------------'|| |
| 72 | | '----------------------------------------------------'| |
| 73 | '-------------------------------------------------------' |
| 74 | \end{Verbatim} |
| 75 | \end{figure} |
| 76 | \noindent |
| 77 | The following commands establish the basic setup shown: |
| 78 | \begin{Verbatim} |
| 79 | (1) # tc qdisc replace dev eth0 root handle 1: htb default 30 |
| 80 | (2) # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit |
| 81 | (3) # alias tclass='tc class add dev eth0 parent 1:1' |
| 82 | (4) # tclass classid 1:10 htb rate 1mbit ceil 20mbit prio 1 |
| 83 | (4) # tclass classid 1:20 htb rate 90mbit ceil 95mbit prio 2 |
| 84 | (4) # tclass classid 1:30 htb rate 1mbit ceil 95mbit prio 3 |
| 85 | (5) # tc qdisc add dev eth0 parent 1:10 fq_codel |
| 86 | (5) # tc qdisc add dev eth0 parent 1:20 fq_codel |
| 87 | (5) # tc qdisc add dev eth0 parent 1:30 fq_codel |
| 88 | \end{Verbatim} |
| 89 | A little explanation for the unfamiliar reader: |
| 90 | \begin{enumerate} |
| 91 | \item Replace the root qdisc of \iface{eth0} by an instance of \qdisc{HTB}. |
| 92 | Specifying the handle is necessary so it can be referenced in consecutive |
| 93 | calls to \cmd{tc}. The default class for unclassified traffic is set to |
| 94 | 30. |
| 95 | \item Create a single top-level class with handle 1:1 which limits the total |
| 96 | bandwidth allowed to 95mbit/s. It is assumed that \iface{eth0} is a 100mbit/s link, |
| 97 | staying a little below that helps to keep the main point of enqueueing in |
| 98 | the qdisc layer instead of the interface hardware queue or at another |
| 99 | bottleneck in the network. |
| 100 | \item Define an alias for the common part of the remaining three calls in order |
| 101 | to improve readability. This means all remaining classes are attached to the |
| 102 | common parent class from (2). |
| 103 | \item Create three child classes for different uses: Class 1:10 has highest |
| 104 | priority but is tightly limited in bandwidth - fine for interactive |
| 105 | connections. Class 1:20 has mid priority and high guaranteed bandwidth, for |
| 106 | high priority bulk traffic. Finally, there's the default class 1:30 with |
| 107 | lowest priority, low guaranteed bandwidth and the ability to use the full |
| 108 | link in case it's unused otherwise. This should be fine for uninteresting |
| 109 | traffic not explicitly taken care of. |
| 110 | \item Attach a leaf qdisc to each of the child classes created in (4). Since |
| 111 | \qdisc{HTB} by default attaches \qdisc{pfifo} as leaf qdisc, this step is optional. Still, |
| 112 | the fairness between different flows provided by the classless \qdisc{fq\_codel} is |
| 113 | worth the effort. |
| 114 | \end{enumerate} |
| 115 | More information about the qdiscs and fine-tuning parameters can be found in |
| 116 | \man{tc-htb(8)} and \man{tc-fq\_codel(8)}. |
| 117 | |
| 118 | Without any additional setup done, now all traffic leaving \iface{eth0} is shaped to |
| 119 | 95mbit/s and directed through class 1:30. This can be verified by looking at the |
| 120 | \texttt{Sent} field of the class statistics printed via \cmd{tc -s class show dev eth0}: |
| 121 | Only the root class 1:1 and it's child 1:30 should show any traffic. |
| 122 | |
| 123 | |
| 124 | \section*{Finally time to start filtering!} |
| 125 | |
| 126 | Let's begin with a simple one, i.e. reestablishing what \qdisc{pfifo\_fast} did |
| 127 | automatically based on TOS/Priority field. Linux internally translates the |
| 128 | header field into the priority field of struct skbuff, which |
| 129 | \qdisc{pfifo\_fast} uses for |
| 130 | classification. \man{tc-prio(8)} contains a table listing the priority (and |
| 131 | ultimately, \qdisc{pfifo\_fast} queue index) each TOS value is being translated into. |
| 132 | Here is a shorter version: |
| 133 | \begin{center} |
| 134 | \begin{tabular}{lll} |
| 135 | TOS Values & Linux Priority (Number) & Queue Index \\ |
| 136 | \midrule |
| 137 | 0x0 - 0x6 & Best Effort (0) & 1 \\ |
| 138 | 0x8 - 0xe & Bulk (2) & 2 \\ |
| 139 | 0x10 - 0x16 & Interactive (6) & 0 \\ |
| 140 | 0x18 - 0x1e & Interactive Bulk (4) & 1 \\ |
| 141 | \end{tabular} |
| 142 | \end{center} |
| 143 | Using the \filter{basic} filter, it is possible to match packets based on that skbuff |
| 144 | field, which has the added benefit of being IP version agnostic. Since the |
| 145 | \qdisc{HTB} setup above defaults to class ID 1:30, the Bulk priority can be |
| 146 | ignored. The \filter{basic} filter allows to combine matches, therefore we get along |
| 147 | with only two filters: |
| 148 | \begin{Verbatim} |
| 149 | # tc filter add dev eth0 parent 1: basic \ |
| 150 | match 'meta(priority eq 6)' classid 1:10 |
| 151 | # tc filter add dev eth0 parent 1: basic \ |
| 152 | match 'meta(priority eq 0)' \ |
| 153 | or 'meta(priority eq 4)' classid 1:20 |
| 154 | \end{Verbatim} |
| 155 | A detailed description of the \filter{basic} filter and the ematch syntax it uses can be |
| 156 | found in \man{tc-basic(8)} and \man{tc-ematch(8)}. |
| 157 | |
| 158 | Obviously, this first example cries for optimization. A simple one would be to |
| 159 | just change the default class from 1:30 to 1:20, so filters are only needed for |
| 160 | Bulk and Interactive priorities: |
| 161 | \begin{Verbatim} |
| 162 | # tc filter add dev eth0 parent 1: basic \ |
| 163 | match 'meta(priority eq 6)' classid 1:10 |
| 164 | # tc filter add dev eth0 parent 1: basic \ |
| 165 | match 'meta(priority eq 2)' classid 1:20 |
| 166 | \end{Verbatim} |
| 167 | Given that class IDs are random, choosing them wisely allows for a direct |
| 168 | mapping. So first, recreate the qdisc and classes configuration: |
| 169 | \begin{Verbatim} |
| 170 | # tc qdisc replace dev eth0 root handle 1: htb default 10 |
| 171 | # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit |
| 172 | # alias tclass='tc class add dev eth0 parent 1:1' |
| 173 | # tclass classid 1:16 htb rate 1mbit ceil 20mbit prio 1 |
| 174 | # tclass classid 1:10 htb rate 90mbit ceil 95mbit prio 2 |
| 175 | # tclass classid 1:12 htb rate 1mbit ceil 95mbit prio 3 |
| 176 | # tc qdisc add dev eth0 parent 1:16 fq_codel |
| 177 | # tc qdisc add dev eth0 parent 1:10 fq_codel |
| 178 | # tc qdisc add dev eth0 parent 1:12 fq_codel |
| 179 | \end{Verbatim} |
| 180 | This is basically identical to above, but with changed leaf class IDs and the |
| 181 | second priority class being the default. Using the \filter{flow} filter with it's \texttt{map} |
| 182 | functionality, a single filter command is enough: |
| 183 | \begin{Verbatim} |
| 184 | # tc filter add dev eth0 parent 1: handle 0x1337 flow \ |
| 185 | map key priority baseclass 1:10 |
| 186 | \end{Verbatim} |
| 187 | The \filter{flow} filter now uses the priority value to construct a destination class ID |
| 188 | by adding it to the value of \texttt{baseclass}. While this works for priority values of |
| 189 | 0, 2 and 6, it will result in non-existent class ID 1:14 for Interactive Bulk |
| 190 | traffic. In that case, the \qdisc{HTB} default applies so that traffic goes into class |
| 191 | ID 1:10 just as intended. Please note that specifying a handle is a mandatory |
| 192 | requirement by the \filter{flow} filter, although I didn't see where one would use that |
| 193 | later. For more information about \filter{flow}, see \man{tc-flow(8)}. |
| 194 | |
| 195 | While \filter{flow} and \filter{basic} filters are relatively easy to apply and understand, they |
| 196 | are as well quite limited to their intended purpose. A more flexible option is |
| 197 | the \filter{u32} filter, which allows to match on arbitrary parts of the packet data - |
| 198 | yet only on that, not any meta data associated to it by the kernel (with the |
| 199 | exception of firewall mark value). So in order to continue this little |
| 200 | exercise with \filter{u32}, we have to base classification directly upon the actual TOS |
| 201 | value. An intuitive attempt might look like this: |
| 202 | \begin{Verbatim} |
| 203 | # alias tcfilter='tc filter add dev eth0 parent 1:' |
| 204 | # tcfilter u32 match ip dsfield 0x10 0x1e classid 1:16 |
| 205 | # tcfilter u32 match ip dsfield 0x12 0x1e classid 1:16 |
| 206 | # tcfilter u32 match ip dsfield 0x14 0x1e classid 1:16 |
| 207 | # tcfilter u32 match ip dsfield 0x16 0x1e classid 1:16 |
| 208 | # tcfilter u32 match ip dsfield 0x8 0x1e classid 1:12 |
| 209 | # tcfilter u32 match ip dsfield 0xa 0x1e classid 1:12 |
| 210 | # tcfilter u32 match ip dsfield 0xc 0x1e classid 1:12 |
| 211 | # tcfilter u32 match ip dsfield 0xe 0x1e classid 1:12 |
| 212 | \end{Verbatim} |
| 213 | The obvious drawback here is the amount of filters needed. And without the |
| 214 | default class, eight more filters would be necessary. This also has performance |
| 215 | implications: A packet with TOS value 0xe will be checked eight times in total |
| 216 | in order to determine it's destination class. While there's not much to be done |
| 217 | about the number of filters, at least the performance problem can be eliminated |
| 218 | by using \filter{u32}'s hash table support: |
| 219 | \begin{Verbatim} |
| 220 | # tc filter add dev eth0 parent 1: prio 99 handle 1: u32 divisor 16 |
| 221 | \end{Verbatim} |
| 222 | This creates a hash table with 16 buckets. The table size is arbitrary, but not |
| 223 | random: Since the first bit of the TOS field is not interesting, it can be |
| 224 | ignored and therefore the range of values to consider is just [0;15], i.e. a |
| 225 | number of 16 different values. The next step is to populate the hash table: |
| 226 | \begin{Verbatim} |
| 227 | # alias tcfilter='tc filter add dev eth0 parent 1: prio 99' |
| 228 | # tcfilter u32 match u8 0 0 ht 1:0: classid 1:16 |
| 229 | # tcfilter u32 match u8 0 0 ht 1:1: classid 1:16 |
| 230 | # tcfilter u32 match u8 0 0 ht 1:2: classid 1:16 |
| 231 | # tcfilter u32 match u8 0 0 ht 1:3: classid 1:16 |
| 232 | # tcfilter u32 match u8 0 0 ht 1:4: classid 1:12 |
| 233 | # tcfilter u32 match u8 0 0 ht 1:5: classid 1:12 |
| 234 | # tcfilter u32 match u8 0 0 ht 1:6: classid 1:12 |
| 235 | # tcfilter u32 match u8 0 0 ht 1:7: classid 1:12 |
| 236 | # tcfilter u32 match u8 0 0 ht 1:8: classid 1:16 |
| 237 | # tcfilter u32 match u8 0 0 ht 1:9: classid 1:16 |
| 238 | # tcfilter u32 match u8 0 0 ht 1:a: classid 1:16 |
| 239 | # tcfilter u32 match u8 0 0 ht 1:b: classid 1:16 |
| 240 | # tcfilter u32 match u8 0 0 ht 1:c: classid 1:10 |
| 241 | # tcfilter u32 match u8 0 0 ht 1:d: classid 1:10 |
| 242 | # tcfilter u32 match u8 0 0 ht 1:e: classid 1:10 |
| 243 | # tcfilter u32 match u8 0 0 ht 1:f: classid 1:10 |
| 244 | \end{Verbatim} |
| 245 | The parameter \texttt{ht} denotes the hash table and bucket the filter should be added |
| 246 | to. Since the first TOS bit is ignored, it's value has to be divided by two in |
| 247 | order to get to the bucket it maps to. E.g. a TOS value of 0x10 will therefore |
| 248 | map to bucket 0x8. For the sake of completeness, all possible values are mapped |
| 249 | and therefore a configurable default class is not required. Note that the used |
| 250 | match expression is not necessary, but mandatory. Therefore anything that |
| 251 | matches any packet will suffice. Finally, a filter which links to the defined |
| 252 | hash table is needed: |
| 253 | \begin{Verbatim} |
| 254 | # tc filter add dev eth0 parent 1: prio 1 protocol ip u32 \ |
| 255 | link 1: hashkey mask 0x001e0000 match u8 0 0 |
| 256 | \end{Verbatim} |
| 257 | Here again, the actual match statement is not necessary, but syntactically |
| 258 | required. All the magic lies within the \texttt{hashkey} parameter, which defines which |
| 259 | part of the packet should be used directly as hash key. Here's a drawing of the |
| 260 | first four bytes of the IPv4 header, with the area selected by \texttt{hashkey mask} |
| 261 | highlighted: |
| 262 | \begin{figure}[H] |
| 263 | \begin{Verbatim} |
| 264 | 0 1 2 3 |
| 265 | .-----------------------------------------------------------------. |
| 266 | | | | ######## | | | |
| 267 | | Version| IHL | #DSCP### | ECN| Total Length | |
| 268 | | | | ######## | | | |
| 269 | '-----------------------------------------------------------------' |
| 270 | \end{Verbatim} |
| 271 | \end{figure} |
| 272 | \noindent |
| 273 | This may look confusing at first, but keep in mind that bit- as well as |
| 274 | byte-ordering here is LSB while the mask value is written in MSB we humans use. |
| 275 | Therefore reading the mask is done like so, starting from left: |
| 276 | \begin{enumerate} |
| 277 | \item Skip the first byte (which contains Version and IHL fields). |
| 278 | \item Skip the lowest bit of the second byte (0x1e is even). |
| 279 | \item Mark the four following bits (0x1e is 11110 in binary). |
| 280 | \item Skip the remaining three bits of the second byte as well as the remaining two |
| 281 | bytes. |
| 282 | \end{enumerate} |
| 283 | Before doing the lookup, the kernel right-shifts the masked value by the amount |
| 284 | of zero-bits in \texttt{mask}, which implicitly also does the division by two which the |
| 285 | hash table depends on. With this setup, every packet has to pass exactly two |
| 286 | filters to be classified. Note that this filter is limited to IPv4 packets: Due |
| 287 | to the related Traffic Class field being at a different offset in the packet, it |
| 288 | would not work for IPv6. To use the same setup for IPv6 as well, a second |
| 289 | entry-level filter is necessary: |
| 290 | \begin{Verbatim} |
| 291 | # tc filter add dev eth0 parent 1: prio 2 protocol ipv6 u32 \ |
| 292 | link 1: hashkey mask 0x01e00000 match u8 0 0 |
| 293 | \end{Verbatim} |
| 294 | For illustration purposes, here again is a drawing of the first four bytes of |
| 295 | the IPv6 header, again with masked area highlighted: |
| 296 | \begin{figure}[H] |
| 297 | \begin{Verbatim} |
| 298 | 0 1 2 3 |
| 299 | .-----------------------------------------------------------------. |
| 300 | | | ######## | | |
| 301 | | Version| #Traffic Class| Flow Label | |
| 302 | | | ######## | | |
| 303 | '-----------------------------------------------------------------' |
| 304 | \end{Verbatim} |
| 305 | \end{figure} |
| 306 | \noindent |
| 307 | Reading the mask value is analogous to IPv4 with the added complexity that |
| 308 | Traffic Class spans over two bytes. Yet, for comparison there's a simple trick: |
| 309 | IPv6 has the interesting field shifted by four bits to the left, and the new |
| 310 | mask's value is shifted by the same amount. For further information about |
| 311 | \filter{u32} and what can be done with it, consult it's man page |
| 312 | \man{tc-u32(8)}. |
| 313 | |
| 314 | Of course, the kernel provides many more filters than just \filter{basic}, |
| 315 | \filter{flow} and \filter{u32} which have been presented above. As of now, the |
| 316 | remaining ones are: |
| 317 | \begin{description} |
| 318 | \item[bpf] |
| 319 | Filtering using Berkeley Packet Filter programs. The program's return |
| 320 | code determines the packet's destination class ID. |
| 321 | |
| 322 | \item[cgroup] |
| 323 | Filter packets based on control groups. This is only useful for packets |
| 324 | originating from the local host, as control groups only exist in that |
| 325 | scope. |
| 326 | |
| 327 | \item[flower] |
| 328 | An extended variant of the flow filter. |
| 329 | |
| 330 | \item[fw] |
| 331 | Matches on firewall mark values previously assigned to the packet by |
| 332 | netfilter (or a filter action, see below for details). This allows to |
| 333 | export the classification algorithm into netfilter, which is very |
| 334 | convenient if appropriate rules exist on the same system in there |
| 335 | already. |
| 336 | |
| 337 | \item[route] |
| 338 | Filter packets based on matching routing table entry. Basically |
| 339 | equivalent to the \texttt{fw} filter above, to make use of an already existing |
| 340 | extensive routing table setup. |
| 341 | |
| 342 | \item[rsvp, rsvp6] |
| 343 | Implementation of the Resource Reservation Protocol in Linux, to react |
| 344 | upon requests sent by an RSVP daemon. |
| 345 | |
| 346 | \item[tcindex] |
| 347 | Match packets based on tcindex value, which is usually set by the dsmark |
| 348 | qdisc. This is part of an approach to support Differentiated Services in |
| 349 | Linux, which is another topic on it's own. |
| 350 | \end{description} |
| 351 | |
| 352 | |
| 353 | \section*{Filter Actions} |
| 354 | |
| 355 | The tc filter framework provides the infrastructure to another extensible set of |
| 356 | tools as well, namely tc actions. As the name suggests, they allow to do things |
| 357 | with packets (or associated data). (The list of) Actions are part of a given |
| 358 | filter. If it matches, each action it contains is executed in order before |
| 359 | returning the classification result. Since the action has direct access to the |
| 360 | latter, it is in theory possible for an action to react upon or even change the |
| 361 | filtering result - as long as the packet matched, of course. Yet none of the |
| 362 | currently in-tree actions make use of this. |
| 363 | |
| 364 | The Generic Actions framework originally evolved out of the filters' ability to |
| 365 | police traffic to a given maximum bandwidth. One common use case for that is to |
| 366 | limit ingress traffic, dropping packets which exceed the threshold. A classic |
| 367 | setup example is like so: |
| 368 | \begin{Verbatim} |
| 369 | # tc qdisc add dev eth0 handle ffff: ingress |
| 370 | # tc filter add dev eth0 parent ffff: u32 \ |
| 371 | match u32 0 0 |
| 372 | police rate 1mbit burst 100k |
| 373 | \end{Verbatim} |
| 374 | The ingress qdisc is not a real one, but merely a point of reference for filters |
| 375 | to attach to which should get applied to incoming traffic. The \filter{u32} filter added |
| 376 | above matches on any packet and therefore limits the total incoming bandwidth to |
| 377 | 1mbit/s, allowing bursts of up to 100kbytes. Using the new syntax, the filter |
| 378 | command changes slightly: |
| 379 | \begin{Verbatim} |
| 380 | # tc filter add dev eth0 parent ffff: u32 \ |
| 381 | match u32 0 0 \ |
| 382 | action police rate 1mbit burst 100k |
| 383 | \end{Verbatim} |
| 384 | The important detail is that this syntax allows to define multiple actions. |
| 385 | E.g. for testing purposes, it is possible to redirect exceeding traffic to the |
| 386 | loopback interface instead of dropping it: |
| 387 | \begin{Verbatim} |
| 388 | # tc filter add dev eth0 parent ffff: u32 \ |
| 389 | match u32 0 0 \ |
| 390 | action police rate 1mbit burst 100k conform-exceed pipe \ |
| 391 | action mirred egress redirect dev lo |
| 392 | \end{Verbatim} |
| 393 | The added parameter \texttt{conform-exceed pipe} tells the police action to allow for |
| 394 | further actions to handle the exceeding packet. |
| 395 | |
| 396 | Apart from \texttt{police} and \texttt{mirred} actions, there are a few more. Here's a full |
| 397 | list of the currently implemented ones: |
| 398 | \begin{description} |
| 399 | \item[bpf] |
| 400 | Apply a Berkeley Packet Filter program to the packet. |
| 401 | |
| 402 | \item[connmark] |
| 403 | Set the packet's firewall mark to that of it's connection. This works by |
| 404 | searching the conntrack table for a matching entry. If found, the mark |
| 405 | is restored. |
| 406 | |
| 407 | \item[csum] |
| 408 | Trigger recalculation of packet checksums. The supported protocols are: |
| 409 | IPv4, ICMP, IGMP, TCP, UDP and UDPLite. |
| 410 | |
| 411 | \item[ipt] |
| 412 | Pass the packet to an iptables target. This allows to use iptables |
| 413 | extensions directly instead of having to go the extra mile via setting |
| 414 | an arbitrary firewall mark and matching on that from within netfilter. |
| 415 | |
| 416 | \item[mirred] |
| 417 | Mirror or redirect packets. This is often combined with the ifb pseudo |
| 418 | device to share a common QoS setup between multiple interfaces or even |
| 419 | ingress traffic. |
| 420 | |
| 421 | \item[nat] |
| 422 | Perform stateless Native Address Translation. This is certainly not |
| 423 | complete and therefore inferior to NAT using iptables: Although the |
| 424 | kernel module decides between TCP, UDP and ICMP traffic, it does not |
| 425 | handle typical problematic protocols such as active FTP or SIP. |
| 426 | |
| 427 | \item[pedit] |
| 428 | Generic packet editing. This allows to alter arbitrary bytes of the |
| 429 | packet, either by specifying an offset into the packet or by naming a |
| 430 | packet header and field name to change. Currently, the latter is |
| 431 | implemented only for IPv4 yet. |
| 432 | |
| 433 | \item[police] |
| 434 | Apply a bandwidth rate limiting policy. Packets exceeding it are dropped |
| 435 | by default, but may optionally be handled differently. |
| 436 | |
| 437 | \item[simple] |
| 438 | This is rather an example than real action. All it does is print a |
| 439 | user-defined string together with a packet counter. Useful maybe for |
| 440 | debugging when filter statistics are not available or too complicated. |
| 441 | |
| 442 | \item[skbedit] |
| 443 | Edit associated packet data, supports changing queue mapping, priority |
| 444 | field and firewall mark value. |
| 445 | |
| 446 | \item[vlan] |
| 447 | Add/remove a VLAN header to/from the packet. This might serve as |
| 448 | alternative to using 802.1Q pseudo-interfaces in combination with |
| 449 | routing rules when e.g. packets for a given destination need to be |
| 450 | encapsulated. |
| 451 | \end{description} |
| 452 | |
| 453 | |
| 454 | \section*{Intermediate Functional Block} |
| 455 | |
| 456 | The Intermediate Functional Block (\texttt{ifb}) pseudo network interface acts as a QoS |
| 457 | concentrator for multiple different sources of traffic. Packets from or to other |
| 458 | interfaces have to be redirected to it using the \texttt{mirred} action in order to be |
| 459 | handled, regularly routed traffic will be dropped. This way, a single stack of |
| 460 | qdiscs, classes and filters can be shared between multiple interfaces. |
| 461 | |
| 462 | Here's a simple example to feed incoming traffic from multiple interfaces |
| 463 | through a Stochastic Fairness Queue (\qdisc{sfq}): |
| 464 | \begin{Verbatim} |
| 465 | (1) # modprobe ifb |
| 466 | (2) # ip link set ifb0 up |
| 467 | (3) # tc qdisc add dev ifb0 root sfq |
| 468 | \end{Verbatim} |
| 469 | The first step is to load the \texttt{ifb} kernel module (1). By default, this will |
| 470 | create two ifb devices: \iface{ifb0} and \iface{ifb1}. After setting |
| 471 | \iface{ifb0} up in (2), the root |
| 472 | qdisc is replaced by \qdisc{sfq} in (3). Finally, one can start redirecting ingress |
| 473 | traffic to \iface{ifb0}, e.g. from \iface{eth0}: |
| 474 | \begin{Verbatim} |
| 475 | # tc qdisc add dev eth0 handle ffff: ingress |
| 476 | # tc filter add dev eth0 parent ffff: u32 \ |
| 477 | match u32 0 0 \ |
| 478 | action mirred egress redirect dev ifb0 |
| 479 | \end{Verbatim} |
| 480 | The same can be done for other interfaces, just replacing \iface{eth0} in the two |
| 481 | commands above. One thing to keep in mind here is the asymmetrical routing this |
| 482 | creates within the host doing the QoS: Incoming packets enter the system via |
| 483 | \iface{ifb0}, while corresponding replies leave directly via \iface{eth0}. This can be observed |
| 484 | using \cmd{tcpdump} on \iface{ifb0}, which shows the input part of the traffic only. What's |
| 485 | more confusing is that \cmd{tcpdump} on \iface{eth0} shows both incoming and outgoing traffic, |
| 486 | but the redirection is still effective - a simple prove is setting |
| 487 | \iface{ifb0} down, |
| 488 | which will interrupt the communication. Obviously \cmd{tcpdump} catches the packets to |
| 489 | dump before they enter the ingress qdisc, which is why it sees them while the |
| 490 | kernel itself doesn't. |
| 491 | |
| 492 | |
| 493 | \section*{Conclusion} |
| 494 | |
Phil Sutter | edf35b8 | 2016-03-22 15:48:32 +0100 | [diff] [blame] | 495 | Once the steep learning curve has been mastered, the conglomerate of (classful) |
| 496 | qdiscs, filters and actions provides a highly sophisticated and flexible |
| 497 | infrastructure to perform QoS, which plays nicely along with routing and |
| 498 | firewalling setups. |
Phil Sutter | 5f4d27d | 2016-03-04 13:11:47 +0100 | [diff] [blame] | 499 | |
| 500 | |
| 501 | \section*{Further Reading} |
| 502 | |
| 503 | A good starting point for novice users and experienced ones diving into unknown |
| 504 | areas is the extensive HOWTO at \url{http://lartc.org}. The iproute2 package ships |
| 505 | some examples (usually in /usr/share/doc/, depending on distribution) as well as |
| 506 | man pages for \cmd{tc} in general, qdiscs and filters. The latter have been added |
| 507 | just recently though, so if your distribution does not ship iproute2 version |
| 508 | 4.3.0 yet, these are not in there. Apart from that, the internet is a spring of |
| 509 | HOWTOs and scripts people wrote - though these should be taken with a grain of |
| 510 | salt: The complexity of the matter often leads to copying others' solutions |
| 511 | without much validation, which allows for less optimal or even obsolete |
| 512 | implementations to survive much longer than desired. |
| 513 | |
| 514 | \end{document} |