Daniel Borkmann | cbdd1e6 | 2015-05-22 00:17:01 +0200 | [diff] [blame] | 1 | .TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux" |
| 2 | .SH NAME |
| 3 | BPF \- BPF programmable classifier and actions for ingress/egress |
| 4 | queueing disciplines |
| 5 | .SH SYNOPSIS |
| 6 | .SS eBPF classifier (filter) or action: |
| 7 | .B tc filter ... bpf |
| 8 | [ |
| 9 | .B object-file |
| 10 | OBJ_FILE ] [ |
| 11 | .B section |
| 12 | CLS_NAME ] [ |
| 13 | .B export |
| 14 | UDS_FILE ] [ |
| 15 | .B verbose |
| 16 | ] [ |
Jakub Kicinski | 87e46a5 | 2016-10-12 16:46:36 +0100 | [diff] [blame] | 17 | .B skip_hw |
| 18 | | |
| 19 | .B skip_sw |
| 20 | ] [ |
Daniel Borkmann | cbdd1e6 | 2015-05-22 00:17:01 +0200 | [diff] [blame] | 21 | .B police |
| 22 | POLICE_SPEC ] [ |
| 23 | .B action |
| 24 | ACTION_SPEC ] [ |
| 25 | .B classid |
| 26 | CLASSID ] |
| 27 | .br |
| 28 | .B tc action ... bpf |
| 29 | [ |
| 30 | .B object-file |
| 31 | OBJ_FILE ] [ |
| 32 | .B section |
| 33 | CLS_NAME ] [ |
| 34 | .B export |
| 35 | UDS_FILE ] [ |
| 36 | .B verbose |
| 37 | ] |
| 38 | |
| 39 | .SS cBPF classifier (filter) or action: |
| 40 | .B tc filter ... bpf |
| 41 | [ |
| 42 | .B bytecode-file |
| 43 | BPF_FILE | |
| 44 | .B bytecode |
| 45 | BPF_BYTECODE ] [ |
| 46 | .B police |
| 47 | POLICE_SPEC ] [ |
| 48 | .B action |
| 49 | ACTION_SPEC ] [ |
| 50 | .B classid |
| 51 | CLASSID ] |
| 52 | .br |
| 53 | .B tc action ... bpf |
| 54 | [ |
| 55 | .B bytecode-file |
| 56 | BPF_FILE | |
| 57 | .B bytecode |
| 58 | BPF_BYTECODE ] |
| 59 | |
| 60 | .SH DESCRIPTION |
| 61 | |
| 62 | Extended Berkeley Packet Filter ( |
| 63 | .B eBPF |
| 64 | ) and classic Berkeley Packet Filter |
| 65 | (originally known as BPF, for better distinction referred to as |
| 66 | .B cBPF |
| 67 | here) are both available as a fully programmable and highly efficient |
| 68 | classifier and actions. They both offer a minimal instruction set for |
| 69 | implementing small programs which can safely be loaded into the kernel |
| 70 | and thus executed in a tiny virtual machine from kernel space. An in-kernel |
| 71 | verifier guarantees that a specified program always terminates and neither |
| 72 | crashes nor leaks data from the kernel. |
| 73 | |
| 74 | In Linux, it's generally considered that eBPF is the successor of cBPF. |
| 75 | The kernel internally transforms cBPF expressions into eBPF expressions and |
| 76 | executes the latter. Execution of them can be performed in an interpreter |
| 77 | or at setup time, they can be just-in-time compiled (JIT'ed) to run as |
| 78 | native machine code. Currently, x86_64, ARM64 and s390 architectures have |
| 79 | eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not |
| 80 | (yet) switch to eBPF JIT support. |
| 81 | |
| 82 | eBPF's instruction set has similar underlying principles as the cBPF |
| 83 | instruction set, it however is modelled closer to the underlying |
| 84 | architecture to better mimic native instruction sets with the aim to |
| 85 | achieve a better run-time performance. It is designed to be JIT'ed with |
| 86 | a one to one mapping, which can also open up the possibility for compilers |
| 87 | to generate optimized eBPF code through an eBPF backend that performs |
| 88 | almost as fast as natively compiled code. Given that LLVM provides such |
| 89 | an eBPF backend, eBPF programs can therefore easily be programmed in a |
| 90 | subset of the C language. Other than that, eBPF infrastructure also comes |
| 91 | with a construct called "maps". eBPF maps are key/value stores that are |
| 92 | shared between multiple eBPF programs, but also between eBPF programs and |
| 93 | user space applications. |
| 94 | |
| 95 | For the traffic control subsystem, classifier and actions that can be |
| 96 | attached to ingress and egress qdiscs can be written in eBPF or cBPF. The |
| 97 | advantage over other classifier and actions is that eBPF/cBPF provides the |
| 98 | generic framework, while users can implement their highly specialized use |
| 99 | cases efficiently. This means that the classifier or action written that |
| 100 | way will not suffer from feature bloat, and can therefore execute its task |
| 101 | highly efficient. It allows for non-linear classification and even merging |
| 102 | the action part into the classification. Combined with efficient eBPF map |
| 103 | data structures, user space can push new policies like classids into the |
| 104 | kernel without reloading a classifier, or it can gather statistics that |
| 105 | are pushed into one map and use another one for dynamically load balancing |
| 106 | traffic based on the determined load, just to provide a few examples. |
| 107 | |
| 108 | .SH PARAMETERS |
| 109 | .SS object-file |
| 110 | points to an object file that has an executable and linkable format (ELF) |
| 111 | and contains eBPF opcodes and eBPF map definitions. The LLVM compiler |
| 112 | infrastructure with |
| 113 | .B clang(1) |
| 114 | as a C language front end is one project that supports emitting eBPF object |
| 115 | files that can be passed to the eBPF classifier (more details in the |
| 116 | .B EXAMPLES |
| 117 | section). This option is mandatory when an eBPF classifier or action is |
| 118 | to be loaded. |
| 119 | |
| 120 | .SS section |
| 121 | is the name of the ELF section from the object file, where the eBPF |
| 122 | classifier or action resides. By default the section name for the |
| 123 | classifier is called "classifier", and for the action "action". Given |
| 124 | that a single object file can contain multiple classifier and actions, |
| 125 | the corresponding section name needs to be specified, if it differs |
| 126 | from the defaults. |
| 127 | |
| 128 | .SS export |
| 129 | points to a Unix domain socket file. In case the eBPF object file also |
| 130 | contains a section named "maps" with eBPF map specifications, then the |
| 131 | map file descriptors can be handed off via the Unix domain socket to |
| 132 | an eBPF "agent" herding all descriptors after tc lifetime. This can be |
| 133 | some third party application implementing the IPC counterpart for the |
| 134 | import, that uses them for calling into |
| 135 | .B bpf(2) |
| 136 | system call to read out or update eBPF map data from user space, for |
| 137 | example, for monitoring purposes or to push down new policies. |
| 138 | |
| 139 | .SS verbose |
| 140 | if set, it will dump the eBPF verifier output, even if loading the eBPF |
| 141 | program was successful. By default, only on error, the verifier log is |
| 142 | being emitted to the user. |
| 143 | |
Jakub Kicinski | 87e46a5 | 2016-10-12 16:46:36 +0100 | [diff] [blame] | 144 | .SS skip_hw | skip_sw |
| 145 | hardware offload control flags. By default TC will try to offload |
| 146 | filters to hardware if possible. |
| 147 | .B skip_hw |
| 148 | explicitly disables the attempt to offload. |
| 149 | .B skip_sw |
| 150 | forces the offload and disables running the eBPF program in the kernel. |
| 151 | If hardware offload is not possible and this flag was set kernel will |
| 152 | report an error and filter will not be installed at all. |
| 153 | |
Daniel Borkmann | cbdd1e6 | 2015-05-22 00:17:01 +0200 | [diff] [blame] | 154 | .SS police |
| 155 | is an optional parameter for an eBPF/cBPF classifier that specifies a |
| 156 | police in |
| 157 | .B tc(1) |
| 158 | which is attached to the classifier, for example, on an ingress qdisc. |
| 159 | |
| 160 | .SS action |
| 161 | is an optional parameter for an eBPF/cBPF classifier that specifies a |
| 162 | subsequent action in |
| 163 | .B tc(1) |
| 164 | which is attached to a classifier. |
| 165 | |
| 166 | .SS classid |
| 167 | .SS flowid |
| 168 | provides the default traffic control class identifier for this eBPF/cBPF |
| 169 | classifier. The default class identifier can also be overwritten by the |
| 170 | return code of the eBPF/cBPF program. A default return code of |
| 171 | .B -1 |
| 172 | specifies the here provided default class identifier to be used. A return |
| 173 | code of the eBPF/cBPF program of 0 implies that no match took place, and |
| 174 | a return code other than these two will override the default classid. This |
| 175 | allows for efficient, non-linear classification with only a single eBPF/cBPF |
| 176 | program as opposed to having multiple individual programs for various class |
| 177 | identifiers which would need to reparse packet contents. |
| 178 | |
| 179 | .SS bytecode |
| 180 | is being used for loading cBPF classifier and actions only. The cBPF bytecode |
| 181 | is directly passed as a text string in the form of |
| 182 | .B \'s,c t f k,c t f k,c t f k,...\' |
| 183 | , where |
| 184 | .B s |
| 185 | denotes the number of subsequent 4-tuples. One such 4-tuple consists of |
| 186 | .B c t f k |
| 187 | decimals, where |
| 188 | .B c |
| 189 | represents the cBPF opcode, |
| 190 | .B t |
| 191 | the jump true offset target, |
| 192 | .B f |
| 193 | the jump false offset target and |
| 194 | .B k |
| 195 | the immediate constant/literal. There are various tools that generate code |
| 196 | in this loadable format, for example, |
| 197 | .B bpf_asm |
| 198 | that ships with the Linux kernel source tree under |
| 199 | .B tools/net/ |
| 200 | , so it is certainly not expected to hack this by hand. The |
| 201 | .B bytecode |
| 202 | or |
| 203 | .B bytecode-file |
| 204 | option is mandatory when a cBPF classifier or action is to be loaded. |
| 205 | |
| 206 | .SS bytecode-file |
| 207 | also being used to load a cBPF classifier or action. It's effectively the |
| 208 | same as |
| 209 | .B bytecode |
| 210 | only that the cBPF bytecode is not passed directly via command line, but |
| 211 | rather resides in a text file. |
| 212 | |
| 213 | .SH EXAMPLES |
| 214 | .SS eBPF TOOLING |
| 215 | A full blown example including eBPF agent code can be found inside the |
| 216 | iproute2 source package under: |
| 217 | .B examples/bpf/ |
| 218 | |
| 219 | As prerequisites, the kernel needs to have the eBPF system call namely |
| 220 | .B bpf(2) |
| 221 | enabled and ships with |
| 222 | .B cls_bpf |
| 223 | and |
| 224 | .B act_bpf |
| 225 | kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT |
| 226 | support, depending which of the two the given architecture supports: |
| 227 | |
| 228 | .in +4n |
| 229 | .B echo 1 > /proc/sys/net/core/bpf_jit_enable |
| 230 | .in |
| 231 | |
| 232 | A given restricted C file can be compiled via LLVM as: |
| 233 | |
| 234 | .in +4n |
| 235 | .B clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o |
| 236 | .in |
| 237 | |
| 238 | The compiler invocation might still simplify in future, so for now, |
| 239 | it's quite handy to alias this construct in one way or another, for |
| 240 | example: |
| 241 | .in +4n |
| 242 | .nf |
| 243 | .sp |
| 244 | __bcc() { |
| 245 | clang -O2 -emit-llvm -c $1 -o - | \\ |
| 246 | llc -march=bpf -filetype=obj -o "`basename $1 .c`.o" |
| 247 | } |
| 248 | |
| 249 | alias bcc=__bcc |
| 250 | .fi |
| 251 | .in |
| 252 | |
| 253 | A minimal, stand-alone unit, which matches on all traffic with the |
| 254 | default classid (return code of -1) looks like: |
| 255 | |
| 256 | .in +4n |
| 257 | .nf |
| 258 | .sp |
| 259 | #include <linux/bpf.h> |
| 260 | |
| 261 | #ifndef __section |
| 262 | # define __section(x) __attribute__((section(x), used)) |
| 263 | #endif |
| 264 | |
| 265 | __section("classifier") int cls_main(struct __sk_buff *skb) |
| 266 | { |
| 267 | return -1; |
| 268 | } |
| 269 | |
| 270 | char __license[] __section("license") = "GPL"; |
| 271 | .fi |
| 272 | .in |
| 273 | |
| 274 | More examples can be found further below in subsection |
| 275 | .B eBPF PROGRAMMING |
| 276 | as focus here will be on tooling. |
| 277 | |
| 278 | There can be various other sections, for example, also for actions. |
| 279 | Thus, an object file in eBPF can contain multiple entrance points. |
| 280 | Always a specific entrance point, however, must be specified when |
| 281 | configuring with tc. A license must be part of the restricted C code |
| 282 | and the license string syntax is the same as with Linux kernel modules. |
| 283 | The kernel reserves its right that some eBPF helper functions can be |
| 284 | restricted to GPL compatible licenses only, and thus may reject a program |
| 285 | from loading into the kernel when such a license mismatch occurs. |
| 286 | |
| 287 | The resulting object file from the compilation can be inspected with |
| 288 | the usual set of tools that also operate on normal object files, for |
| 289 | example |
| 290 | .B objdump(1) |
| 291 | for inspecting ELF section headers: |
| 292 | |
| 293 | .in +4n |
| 294 | .nf |
| 295 | .sp |
| 296 | objdump -h bpf.o |
| 297 | [...] |
| 298 | 3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3 |
| 299 | CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 300 | 4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3 |
| 301 | CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 302 | 5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3 |
| 303 | CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 304 | 6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2 |
| 305 | CONTENTS, ALLOC, LOAD, DATA |
| 306 | 7 license 00000004 0000000000000000 0000000000000000 00000988 2**0 |
| 307 | CONTENTS, ALLOC, LOAD, DATA |
| 308 | [...] |
| 309 | .fi |
| 310 | .in |
| 311 | |
| 312 | Adding an eBPF classifier from an object file that contains a classifier |
| 313 | in the default ELF section is trivial (note that instead of "object-file" |
| 314 | also shortcuts such as "obj" can be used): |
| 315 | |
| 316 | .in +4n |
| 317 | .B bcc bpf.c |
| 318 | .br |
| 319 | .B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 |
| 320 | .in |
| 321 | |
| 322 | In case the classifier resides in ELF section "mycls", then that same |
| 323 | command needs to be invoked as: |
| 324 | |
| 325 | .in +4n |
| 326 | .B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1 |
| 327 | .in |
| 328 | |
| 329 | Dumping the classifier configuration will tell the location of the |
| 330 | classifier, in other words that it's from object file "bpf.o" under |
| 331 | section "mycls": |
| 332 | |
| 333 | .in +4n |
| 334 | .B tc filter show dev em1 |
| 335 | .br |
| 336 | .B filter parent 1: protocol all pref 49152 bpf |
| 337 | .br |
| 338 | .B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls] |
| 339 | .in |
| 340 | |
| 341 | The same program can also be installed on ingress qdisc side as opposed |
| 342 | to egress ... |
| 343 | |
| 344 | .in +4n |
| 345 | .B tc qdisc add dev em1 handle ffff: ingress |
| 346 | .br |
| 347 | .B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1 |
| 348 | .in |
| 349 | |
| 350 | \&... and again dumped from there: |
| 351 | |
| 352 | .in +4n |
| 353 | .B tc filter show dev em1 parent ffff: |
| 354 | .br |
| 355 | .B filter protocol all pref 49152 bpf |
| 356 | .br |
| 357 | .B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls] |
| 358 | .in |
| 359 | |
| 360 | Attaching a classifier and action on ingress has the restriction that |
| 361 | it doesn't have an actual underlying queueing discipline. What ingress |
| 362 | can do is to classify, mangle, redirect or drop packets. When queueing |
| 363 | is required on ingress side, then ingress must redirect packets to the |
| 364 | .B ifb |
| 365 | device, otherwise policing can be used. Moreover, ingress can be used to |
| 366 | have an early drop point of unwanted packets before they hit upper layers |
| 367 | of the networking stack, perform network accounting with eBPF maps that |
| 368 | could be shared with egress, or have an early mangle and/or redirection |
| 369 | point to different networking devices. |
| 370 | |
| 371 | Multiple eBPF actions and classifier can be placed into a single |
| 372 | object file within various sections. In that case, non-default section |
| 373 | names must be provided, which is the case for both actions in this |
| 374 | example: |
| 375 | |
| 376 | .in +4n |
| 377 | .B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e |
| 378 | .br |
| 379 | .in +25n |
| 380 | .B action bpf obj bpf.o sec action-mark \e |
| 381 | .br |
| 382 | .B action bpf obj bpf.o sec action-rand ok |
| 383 | .in -25n |
| 384 | .in -4n |
| 385 | |
| 386 | The advantage of this is that the classifier and the two actions can |
| 387 | then share eBPF maps with each other, if implemented in the programs. |
| 388 | |
| 389 | In order to access eBPF maps from user space beyond |
| 390 | .B tc(8) |
| 391 | setup lifetime, the ownership can be transferred to an eBPF agent via |
| 392 | Unix domain sockets. There are two possibilities for implementing this: |
| 393 | |
| 394 | .B 1) |
| 395 | implementation of an own eBPF agent that takes care of setting up |
| 396 | the Unix domain socket and implementing the protocol that |
| 397 | .B tc(8) |
| 398 | dictates. A code example of this can be found inside the iproute2 |
| 399 | source package under: |
| 400 | .B examples/bpf/ |
| 401 | |
| 402 | .B 2) |
| 403 | use |
| 404 | .B tc exec |
| 405 | for transferring the eBPF map file descriptors through a Unix domain |
| 406 | socket, and spawning an application such as |
| 407 | .B sh(1) |
| 408 | \&. This approach's advantage is that tc will place the file descriptors |
| 409 | into the environment and thus make them available just like stdin, stdout, |
| 410 | stderr file descriptors, meaning, in case user applications run from within |
Ville Skyttä | ac0817e | 2015-11-07 11:53:00 +0200 | [diff] [blame] | 411 | this fd-owner shell, they can terminate and restart without losing eBPF |
Daniel Borkmann | cbdd1e6 | 2015-05-22 00:17:01 +0200 | [diff] [blame] | 412 | maps file descriptors. Example invocation with the previous classifier and |
| 413 | action mixture: |
| 414 | |
| 415 | .in +4n |
| 416 | .B tc exec bpf imp /tmp/bpf |
| 417 | .br |
| 418 | .B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e |
| 419 | .br |
| 420 | .in +25n |
| 421 | .B action bpf obj bpf.o sec action-mark \e |
| 422 | .br |
| 423 | .B action bpf obj bpf.o sec action-rand ok |
| 424 | .in -25n |
| 425 | .in -4n |
| 426 | |
| 427 | Assuming that eBPF maps are shared with classifier and actions, it's |
| 428 | enough to export them once, for example, from within the classifier |
| 429 | or action command. tc will setup all eBPF map file descriptors at the |
| 430 | time when the object file is first parsed. |
| 431 | |
| 432 | When a shell has been spawned, the environment will have a couple of |
| 433 | eBPF related variables. BPF_NUM_MAPS provides the total number of maps |
| 434 | that have been transferred over the Unix domain socket. BPF_MAP<X>'s |
| 435 | value is the file descriptor number that can be accessed in eBPF agent |
| 436 | applications, in other words, it can directly be used as the file |
| 437 | descriptor value for the |
| 438 | .B bpf(2) |
| 439 | system call to retrieve or alter eBPF map values. <X> denotes the |
| 440 | identifier of the eBPF map. It corresponds to the |
| 441 | .B id |
| 442 | member of |
| 443 | .B struct bpf_elf_map |
| 444 | \& from the tc eBPF map specification. |
| 445 | |
| 446 | The environment in this example looks as follows: |
| 447 | |
| 448 | .in +4n |
| 449 | .nf |
| 450 | .sp |
| 451 | sh# env | grep BPF |
| 452 | BPF_NUM_MAPS=3 |
| 453 | BPF_MAP1=6 |
| 454 | BPF_MAP0=5 |
| 455 | BPF_MAP2=7 |
| 456 | sh# ls -la /proc/self/fd |
| 457 | [...] |
| 458 | lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map |
| 459 | lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map |
| 460 | lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map |
| 461 | sh# my_bpf_agent |
| 462 | .fi |
| 463 | .in |
| 464 | |
| 465 | eBPF agents are very useful in that they can prepopulate eBPF maps from |
| 466 | user space, monitor statistics via maps and based on that feedback, for |
| 467 | example, rewrite classids in eBPF map values during runtime. Given that eBPF |
| 468 | agents are implemented as normal applications, they can also dynamically |
| 469 | receive traffic control policies from external controllers and thus push |
| 470 | them down into eBPF maps to dynamically adapt to network conditions. Moreover, |
| 471 | eBPF maps can also be shared with other eBPF program types (e.g. tracing), |
| 472 | thus very powerful combination can therefore be implemented. |
| 473 | |
| 474 | .SS eBPF PROGRAMMING |
| 475 | |
| 476 | eBPF classifier and actions are being implemented in restricted C syntax |
| 477 | (in future, there could additionally be new language frontends supported). |
| 478 | |
| 479 | The header file |
| 480 | .B linux/bpf.h |
| 481 | provides eBPF helper functions that can be called from an eBPF program. |
| 482 | This man page will only provide two minimal, stand-alone examples, have a |
| 483 | look at |
| 484 | .B examples/bpf |
| 485 | from the iproute2 source package for a fully fledged flow dissector |
| 486 | example to better demonstrate some of the possibilities with eBPF. |
| 487 | |
| 488 | Supported 32 bit classifier return codes from the C program and their meanings: |
| 489 | .in +4n |
| 490 | .B 0 |
| 491 | , denotes a mismatch |
| 492 | .br |
| 493 | .B -1 |
| 494 | , denotes the default classid configured from the command line |
| 495 | .br |
| 496 | .B else |
| 497 | , everything else will override the default classid to provide a facility for |
| 498 | non-linear matching |
| 499 | .in |
| 500 | |
| 501 | Supported 32 bit action return codes from the C program and their meanings ( |
| 502 | .B linux/pkt_cls.h |
| 503 | ): |
| 504 | .in +4n |
| 505 | .B TC_ACT_OK (0) |
| 506 | , will terminate the packet processing pipeline and allows the packet to |
| 507 | proceed |
| 508 | .br |
| 509 | .B TC_ACT_SHOT (2) |
| 510 | , will terminate the packet processing pipeline and drops the packet |
| 511 | .br |
| 512 | .B TC_ACT_UNSPEC (-1) |
| 513 | , will use the default action configured from tc (similarly as returning |
| 514 | .B -1 |
| 515 | from a classifier) |
| 516 | .br |
| 517 | .B TC_ACT_PIPE (3) |
| 518 | , will iterate to the next action, if available |
| 519 | .br |
| 520 | .B TC_ACT_RECLASSIFY (1) |
| 521 | , will terminate the packet processing pipeline and start classification |
| 522 | from the beginning |
| 523 | .br |
| 524 | .B else |
| 525 | , everything else is an unspecified return code |
| 526 | .in |
| 527 | |
| 528 | Both classifier and action return codes are supported in eBPF and cBPF |
| 529 | programs. |
| 530 | |
| 531 | To demonstrate restricted C syntax, a minimal toy classifier example is |
| 532 | provided, which assumes that egress packets, for instance originating |
| 533 | from a container, have previously been marked in interval [0, 255]. The |
| 534 | program keeps statistics on different marks for user space and maps the |
| 535 | classid to the root qdisc with the marking itself as the minor handle: |
| 536 | |
| 537 | .in +4n |
| 538 | .nf |
| 539 | .sp |
| 540 | #include <stdint.h> |
| 541 | #include <asm/types.h> |
| 542 | |
| 543 | #include <linux/bpf.h> |
| 544 | #include <linux/pkt_sched.h> |
| 545 | |
| 546 | #include "helpers.h" |
| 547 | |
| 548 | struct tuple { |
| 549 | long packets; |
| 550 | long bytes; |
| 551 | }; |
| 552 | |
| 553 | #define BPF_MAP_ID_STATS 1 /* agent's map identifier */ |
| 554 | #define BPF_MAX_MARK 256 |
| 555 | |
| 556 | struct bpf_elf_map __section("maps") map_stats = { |
| 557 | .type = BPF_MAP_TYPE_ARRAY, |
| 558 | .id = BPF_MAP_ID_STATS, |
| 559 | .size_key = sizeof(uint32_t), |
| 560 | .size_value = sizeof(struct tuple), |
| 561 | .max_elem = BPF_MAX_MARK, |
| 562 | }; |
| 563 | |
| 564 | static inline void cls_update_stats(const struct __sk_buff *skb, |
| 565 | uint32_t mark) |
| 566 | { |
| 567 | struct tuple *tu; |
| 568 | |
| 569 | tu = bpf_map_lookup_elem(&map_stats, &mark); |
| 570 | if (likely(tu)) { |
| 571 | __sync_fetch_and_add(&tu->packets, 1); |
| 572 | __sync_fetch_and_add(&tu->bytes, skb->len); |
| 573 | } |
| 574 | } |
| 575 | |
| 576 | __section("cls") int cls_main(struct __sk_buff *skb) |
| 577 | { |
| 578 | uint32_t mark = skb->mark; |
| 579 | |
| 580 | if (unlikely(mark >= BPF_MAX_MARK)) |
| 581 | return 0; |
| 582 | |
| 583 | cls_update_stats(skb, mark); |
| 584 | |
| 585 | return TC_H_MAKE(TC_H_ROOT, mark); |
| 586 | } |
| 587 | |
| 588 | char __license[] __section("license") = "GPL"; |
| 589 | .fi |
| 590 | .in |
| 591 | |
| 592 | Another small example is a port redirector which demuxes destination port |
| 593 | 80 into the interval [8080, 8087] steered by RSS, that can then be attached |
| 594 | to ingress qdisc. The exercise of adding the egress counterpart and IPv6 |
| 595 | support is left to the reader: |
| 596 | |
| 597 | .in +4n |
| 598 | .nf |
| 599 | .sp |
| 600 | #include <asm/types.h> |
| 601 | #include <asm/byteorder.h> |
| 602 | |
| 603 | #include <linux/bpf.h> |
| 604 | #include <linux/filter.h> |
| 605 | #include <linux/in.h> |
| 606 | #include <linux/if_ether.h> |
| 607 | #include <linux/ip.h> |
| 608 | #include <linux/tcp.h> |
| 609 | |
| 610 | #include "helpers.h" |
| 611 | |
| 612 | static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off, |
| 613 | __u16 old_port, __u16 new_port) |
| 614 | { |
| 615 | bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check), |
| 616 | old_port, new_port, sizeof(new_port)); |
| 617 | bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest), |
| 618 | &new_port, sizeof(new_port), 0); |
| 619 | } |
| 620 | |
| 621 | static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off) |
| 622 | { |
| 623 | __u16 dport, dport_new = 8080, off; |
| 624 | __u8 ip_proto, ip_vl; |
| 625 | |
| 626 | ip_proto = load_byte(skb, nh_off + |
| 627 | offsetof(struct iphdr, protocol)); |
| 628 | if (ip_proto != IPPROTO_TCP) |
| 629 | return 0; |
| 630 | |
| 631 | ip_vl = load_byte(skb, nh_off); |
| 632 | if (likely(ip_vl == 0x45)) |
| 633 | nh_off += sizeof(struct iphdr); |
| 634 | else |
| 635 | nh_off += (ip_vl & 0xF) << 2; |
| 636 | |
| 637 | dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest)); |
| 638 | if (dport != 80) |
| 639 | return 0; |
| 640 | |
| 641 | off = skb->queue_mapping & 7; |
| 642 | set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80), |
| 643 | __cpu_to_be16(dport_new + off)); |
| 644 | return -1; |
| 645 | } |
| 646 | |
| 647 | __section("lb") int lb_main(struct __sk_buff *skb) |
| 648 | { |
| 649 | int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN; |
| 650 | |
| 651 | if (likely(skb->protocol == __constant_htons(ETH_P_IP))) |
| 652 | ret = lb_do_ipv4(skb, nh_off); |
| 653 | |
| 654 | return ret; |
| 655 | } |
| 656 | |
| 657 | char __license[] __section("license") = "GPL"; |
| 658 | .fi |
| 659 | .in |
| 660 | |
| 661 | The related helper header file |
| 662 | .B helpers.h |
| 663 | in both examples was: |
| 664 | |
| 665 | .in +4n |
| 666 | .nf |
| 667 | .sp |
| 668 | /* Misc helper macros. */ |
| 669 | #define __section(x) __attribute__((section(x), used)) |
| 670 | #define offsetof(x, y) __builtin_offsetof(x, y) |
| 671 | #define likely(x) __builtin_expect(!!(x), 1) |
| 672 | #define unlikely(x) __builtin_expect(!!(x), 0) |
| 673 | |
| 674 | /* Used map structure */ |
| 675 | struct bpf_elf_map { |
| 676 | __u32 type; |
| 677 | __u32 size_key; |
| 678 | __u32 size_value; |
| 679 | __u32 max_elem; |
| 680 | __u32 id; |
| 681 | }; |
| 682 | |
| 683 | /* Some used BPF function calls. */ |
| 684 | static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, |
| 685 | int len, int flags) = |
| 686 | (void *) BPF_FUNC_skb_store_bytes; |
| 687 | static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, |
| 688 | int to, int flags) = |
| 689 | (void *) BPF_FUNC_l4_csum_replace; |
| 690 | static void *(*bpf_map_lookup_elem)(void *map, void *key) = |
| 691 | (void *) BPF_FUNC_map_lookup_elem; |
| 692 | |
| 693 | /* Some used BPF intrinsics. */ |
| 694 | unsigned long long load_byte(void *skb, unsigned long long off) |
| 695 | asm ("llvm.bpf.load.byte"); |
| 696 | unsigned long long load_half(void *skb, unsigned long long off) |
| 697 | asm ("llvm.bpf.load.half"); |
| 698 | .fi |
| 699 | .in |
| 700 | |
| 701 | Best practice, we recommend to only have a single eBPF classifier loaded |
| 702 | in tc and perform |
| 703 | .B all |
| 704 | necessary matching and mangling from there instead of a list of individual |
| 705 | classifier and separate actions. Just a single classifier tailored for a |
| 706 | given use-case will be most efficient to run. |
| 707 | |
| 708 | .SS eBPF DEBUGGING |
| 709 | |
| 710 | Both tc |
| 711 | .B filter |
| 712 | and |
| 713 | .B action |
| 714 | commands for |
| 715 | .B bpf |
| 716 | support an optional |
| 717 | .B verbose |
| 718 | parameter that can be used to inspect the eBPF verifier log. It is dumped |
| 719 | by default in case of an error. |
| 720 | |
| 721 | In case the eBPF/cBPF JIT compiler has been enabled, it can also be |
| 722 | instructed to emit a debug output of the resulting opcode image into |
| 723 | the kernel log, which can be read via |
| 724 | .B dmesg(1) |
| 725 | : |
| 726 | |
| 727 | .in +4n |
| 728 | .B echo 2 > /proc/sys/net/core/bpf_jit_enable |
| 729 | .in |
| 730 | |
| 731 | The Linux kernel source tree ships additionally under |
| 732 | .B tools/net/ |
| 733 | a small helper called |
| 734 | .B bpf_jit_disasm |
| 735 | that reads out the opcode image dump from the kernel log and dumps the |
| 736 | resulting disassembly: |
| 737 | |
| 738 | .in +4n |
| 739 | .B bpf_jit_disasm -o |
| 740 | .in |
| 741 | |
| 742 | Other than that, the Linux kernel also contains an extensive eBPF/cBPF |
| 743 | test suite module called |
| 744 | .B test_bpf |
| 745 | \&. Upon ... |
| 746 | |
| 747 | .in +4n |
| 748 | .B modprobe test_bpf |
| 749 | .in |
| 750 | |
| 751 | \&... it performs a diversity of test cases and dumps the results into |
| 752 | the kernel log that can be inspected with |
| 753 | .B dmesg(1) |
| 754 | \&. The results can differ depending on whether the JIT compiler is enabled |
| 755 | or not. In case of failed test cases, the module will fail to load. In |
| 756 | such cases, we urge you to file a bug report to the related JIT authors, |
| 757 | Linux kernel and networking mailing lists. |
| 758 | |
| 759 | .SS cBPF |
| 760 | |
| 761 | Although we generally recommend switching to implementing |
| 762 | .B eBPF |
| 763 | classifier and actions, for the sake of completeness, a few words on how to |
| 764 | program in cBPF will be lost here. |
| 765 | |
| 766 | Likewise, the |
| 767 | .B bpf_jit_enable |
| 768 | switch can be enabled as mentioned already. Tooling such as |
| 769 | .B bpf_jit_disasm |
| 770 | is also independent whether eBPF or cBPF code is being loaded. |
| 771 | |
| 772 | Unlike in eBPF, classifier and action are not implemented in restricted C, |
| 773 | but rather in a minimal assembler-like language or with the help of other |
| 774 | tooling. |
| 775 | |
| 776 | The raw interface with tc takes opcodes directly. For example, the most |
| 777 | minimal classifier matching on every packet resulting in the default |
| 778 | classid of 1:1 looks like: |
| 779 | |
| 780 | .in +4n |
| 781 | .B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1 |
| 782 | .in |
| 783 | |
| 784 | The first decimal of the bytecode sequence denotes the number of subsequent |
| 785 | 4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of |
| 786 | .B c t f k |
| 787 | decimals, where |
| 788 | .B c |
| 789 | represents the cBPF opcode, |
| 790 | .B t |
| 791 | the jump true offset target, |
| 792 | .B f |
| 793 | the jump false offset target and |
| 794 | .B k |
| 795 | the immediate constant/literal. Here, this denotes an unconditional return |
| 796 | from the program with immediate value of -1. |
| 797 | |
| 798 | Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone |
| 799 | helper tool under the GNU General Public License version 2 for |
| 800 | .B iptables(8) |
| 801 | BPF extension, which abuses the |
| 802 | .B libpcap |
| 803 | internal classic BPF compiler, his code derived here for usage with |
| 804 | .B tc(8) |
| 805 | : |
| 806 | |
| 807 | .in +4n |
| 808 | .nf |
| 809 | .sp |
| 810 | #include <pcap.h> |
| 811 | #include <stdio.h> |
| 812 | |
| 813 | int main(int argc, char **argv) |
| 814 | { |
| 815 | struct bpf_program prog; |
| 816 | struct bpf_insn *ins; |
| 817 | int i, ret, dlt = DLT_RAW; |
| 818 | |
| 819 | if (argc < 2 || argc > 3) |
| 820 | return 1; |
| 821 | if (argc == 3) { |
| 822 | dlt = pcap_datalink_name_to_val(argv[1]); |
| 823 | if (dlt == -1) |
| 824 | return 1; |
| 825 | } |
| 826 | |
| 827 | ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1], |
| 828 | 1, PCAP_NETMASK_UNKNOWN); |
| 829 | if (ret) |
| 830 | return 1; |
| 831 | |
| 832 | printf("%d,", prog.bf_len); |
| 833 | ins = prog.bf_insns; |
| 834 | |
| 835 | for (i = 0; i < prog.bf_len - 1; ++ins, ++i) |
| 836 | printf("%u %u %u %u,", ins->code, |
| 837 | ins->jt, ins->jf, ins->k); |
| 838 | printf("%u %u %u %u", |
| 839 | ins->code, ins->jt, ins->jf, ins->k); |
| 840 | |
| 841 | pcap_freecode(&prog); |
| 842 | return 0; |
| 843 | } |
| 844 | .fi |
| 845 | .in |
| 846 | |
| 847 | Given this small helper, any |
| 848 | .B tcpdump(8) |
| 849 | filter expression can be abused as a classifier where a match will |
| 850 | result in the default classid: |
| 851 | |
| 852 | .in +4n |
| 853 | .B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn |
| 854 | .br |
| 855 | .B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 |
| 856 | .in |
| 857 | |
| 858 | Basically, such a minimal generator is equivalent to: |
| 859 | |
| 860 | .in +4n |
Ville Skyttä | 85e3c87 | 2015-11-07 11:52:59 +0200 | [diff] [blame] | 861 | .B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\\\n' ',' > /var/bpf/tcp-syn |
Daniel Borkmann | cbdd1e6 | 2015-05-22 00:17:01 +0200 | [diff] [blame] | 862 | .in |
| 863 | |
| 864 | Since |
| 865 | .B libpcap |
| 866 | does not support all Linux' specific cBPF extensions in its compiler, the |
| 867 | Linux kernel also ships under |
| 868 | .B tools/net/ |
| 869 | a minimal BPF assembler called |
| 870 | .B bpf_asm |
| 871 | for providing full control. For detailed syntax and semantics on implementing |
| 872 | such programs by hand, see references under |
| 873 | .B FURTHER READING |
| 874 | \&. |
| 875 | |
| 876 | Trivial toy example in |
| 877 | .B bpf_asm |
| 878 | for classifying IPv4/TCP packets, saved in a text file called |
| 879 | .B foobar |
| 880 | : |
| 881 | |
| 882 | .in +4n |
| 883 | .nf |
| 884 | .sp |
| 885 | ldh [12] |
| 886 | jne #0x800, drop |
| 887 | ldb [23] |
| 888 | jneq #6, drop |
| 889 | ret #-1 |
| 890 | drop: ret #0 |
| 891 | .fi |
| 892 | .in |
| 893 | |
| 894 | Similarly, such a classifier can be loaded as: |
| 895 | |
| 896 | .in +4n |
| 897 | .B bpf_asm foobar > /var/bpf/tcp-syn |
| 898 | .br |
| 899 | .B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 |
| 900 | .in |
| 901 | |
| 902 | For BPF classifiers, the Linux kernel provides additionally under |
| 903 | .B tools/net/ |
| 904 | a small BPF debugger called |
| 905 | .B bpf_dbg |
| 906 | , which can be used to test a classifier against pcap files, single-step |
| 907 | or add various breakpoints into the classifier program and dump register |
| 908 | contents during runtime. |
| 909 | |
| 910 | Implementing an action in classic BPF is rather limited in the sense that |
| 911 | packet mangling is not supported. Therefore, it's generally recommended to |
| 912 | make the switch to eBPF, whenever possible. |
| 913 | |
| 914 | .SH FURTHER READING |
| 915 | Further and more technical details about the BPF architecture can be found |
| 916 | in the Linux kernel source tree under |
| 917 | .B Documentation/networking/filter.txt |
| 918 | \&. |
| 919 | |
| 920 | Further details on eBPF |
| 921 | .B tc(8) |
| 922 | examples can be found in the iproute2 source |
| 923 | tree under |
| 924 | .B examples/bpf/ |
| 925 | \&. |
| 926 | |
| 927 | .SH SEE ALSO |
| 928 | .BR tc (8), |
| 929 | .BR tc-ematch (8) |
| 930 | .BR bpf (2) |
| 931 | .BR bpf (4) |
| 932 | |
| 933 | .SH AUTHORS |
| 934 | Manpage written by Daniel Borkmann. |
| 935 | |
| 936 | Please report corrections or improvements to the Linux kernel networking |
| 937 | mailing list: |
| 938 | .B <netdev@vger.kernel.org> |