blob: e371964d06ab9630945306f8fea27221b7232bab [file] [log] [blame]
Daniel Borkmanncbdd1e62015-05-22 00:17:01 +02001.TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux"
2.SH NAME
3BPF \- BPF programmable classifier and actions for ingress/egress
4queueing disciplines
5.SH SYNOPSIS
6.SS eBPF classifier (filter) or action:
7.B tc filter ... bpf
8[
9.B object-file
10OBJ_FILE ] [
11.B section
12CLS_NAME ] [
13.B export
14UDS_FILE ] [
15.B verbose
16] [
Jakub Kicinski87e46a52016-10-12 16:46:36 +010017.B skip_hw
18|
19.B skip_sw
20] [
Daniel Borkmanncbdd1e62015-05-22 00:17:01 +020021.B police
22POLICE_SPEC ] [
23.B action
24ACTION_SPEC ] [
25.B classid
26CLASSID ]
27.br
28.B tc action ... bpf
29[
30.B object-file
31OBJ_FILE ] [
32.B section
33CLS_NAME ] [
34.B export
35UDS_FILE ] [
36.B verbose
37]
38
39.SS cBPF classifier (filter) or action:
40.B tc filter ... bpf
41[
42.B bytecode-file
43BPF_FILE |
44.B bytecode
45BPF_BYTECODE ] [
46.B police
47POLICE_SPEC ] [
48.B action
49ACTION_SPEC ] [
50.B classid
51CLASSID ]
52.br
53.B tc action ... bpf
54[
55.B bytecode-file
56BPF_FILE |
57.B bytecode
58BPF_BYTECODE ]
59
60.SH DESCRIPTION
61
62Extended Berkeley Packet Filter (
63.B eBPF
64) and classic Berkeley Packet Filter
65(originally known as BPF, for better distinction referred to as
66.B cBPF
67here) are both available as a fully programmable and highly efficient
68classifier and actions. They both offer a minimal instruction set for
69implementing small programs which can safely be loaded into the kernel
70and thus executed in a tiny virtual machine from kernel space. An in-kernel
71verifier guarantees that a specified program always terminates and neither
72crashes nor leaks data from the kernel.
73
74In Linux, it's generally considered that eBPF is the successor of cBPF.
75The kernel internally transforms cBPF expressions into eBPF expressions and
76executes the latter. Execution of them can be performed in an interpreter
77or at setup time, they can be just-in-time compiled (JIT'ed) to run as
78native machine code. Currently, x86_64, ARM64 and s390 architectures have
79eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not
80(yet) switch to eBPF JIT support.
81
82eBPF's instruction set has similar underlying principles as the cBPF
83instruction set, it however is modelled closer to the underlying
84architecture to better mimic native instruction sets with the aim to
85achieve a better run-time performance. It is designed to be JIT'ed with
86a one to one mapping, which can also open up the possibility for compilers
87to generate optimized eBPF code through an eBPF backend that performs
88almost as fast as natively compiled code. Given that LLVM provides such
89an eBPF backend, eBPF programs can therefore easily be programmed in a
90subset of the C language. Other than that, eBPF infrastructure also comes
91with a construct called "maps". eBPF maps are key/value stores that are
92shared between multiple eBPF programs, but also between eBPF programs and
93user space applications.
94
95For the traffic control subsystem, classifier and actions that can be
96attached to ingress and egress qdiscs can be written in eBPF or cBPF. The
97advantage over other classifier and actions is that eBPF/cBPF provides the
98generic framework, while users can implement their highly specialized use
99cases efficiently. This means that the classifier or action written that
100way will not suffer from feature bloat, and can therefore execute its task
101highly efficient. It allows for non-linear classification and even merging
102the action part into the classification. Combined with efficient eBPF map
103data structures, user space can push new policies like classids into the
104kernel without reloading a classifier, or it can gather statistics that
105are pushed into one map and use another one for dynamically load balancing
106traffic based on the determined load, just to provide a few examples.
107
108.SH PARAMETERS
109.SS object-file
110points to an object file that has an executable and linkable format (ELF)
111and contains eBPF opcodes and eBPF map definitions. The LLVM compiler
112infrastructure with
113.B clang(1)
114as a C language front end is one project that supports emitting eBPF object
115files that can be passed to the eBPF classifier (more details in the
116.B EXAMPLES
117section). This option is mandatory when an eBPF classifier or action is
118to be loaded.
119
120.SS section
121is the name of the ELF section from the object file, where the eBPF
122classifier or action resides. By default the section name for the
123classifier is called "classifier", and for the action "action". Given
124that a single object file can contain multiple classifier and actions,
125the corresponding section name needs to be specified, if it differs
126from the defaults.
127
128.SS export
129points to a Unix domain socket file. In case the eBPF object file also
130contains a section named "maps" with eBPF map specifications, then the
131map file descriptors can be handed off via the Unix domain socket to
132an eBPF "agent" herding all descriptors after tc lifetime. This can be
133some third party application implementing the IPC counterpart for the
134import, that uses them for calling into
135.B bpf(2)
136system call to read out or update eBPF map data from user space, for
137example, for monitoring purposes or to push down new policies.
138
139.SS verbose
140if set, it will dump the eBPF verifier output, even if loading the eBPF
141program was successful. By default, only on error, the verifier log is
142being emitted to the user.
143
Jakub Kicinski87e46a52016-10-12 16:46:36 +0100144.SS skip_hw | skip_sw
145hardware offload control flags. By default TC will try to offload
146filters to hardware if possible.
147.B skip_hw
148explicitly disables the attempt to offload.
149.B skip_sw
150forces the offload and disables running the eBPF program in the kernel.
151If hardware offload is not possible and this flag was set kernel will
152report an error and filter will not be installed at all.
153
Daniel Borkmanncbdd1e62015-05-22 00:17:01 +0200154.SS police
155is an optional parameter for an eBPF/cBPF classifier that specifies a
156police in
157.B tc(1)
158which is attached to the classifier, for example, on an ingress qdisc.
159
160.SS action
161is an optional parameter for an eBPF/cBPF classifier that specifies a
162subsequent action in
163.B tc(1)
164which is attached to a classifier.
165
166.SS classid
167.SS flowid
168provides the default traffic control class identifier for this eBPF/cBPF
169classifier. The default class identifier can also be overwritten by the
170return code of the eBPF/cBPF program. A default return code of
171.B -1
172specifies the here provided default class identifier to be used. A return
173code of the eBPF/cBPF program of 0 implies that no match took place, and
174a return code other than these two will override the default classid. This
175allows for efficient, non-linear classification with only a single eBPF/cBPF
176program as opposed to having multiple individual programs for various class
177identifiers which would need to reparse packet contents.
178
179.SS bytecode
180is being used for loading cBPF classifier and actions only. The cBPF bytecode
181is directly passed as a text string in the form of
182.B \'s,c t f k,c t f k,c t f k,...\'
183, where
184.B s
185denotes the number of subsequent 4-tuples. One such 4-tuple consists of
186.B c t f k
187decimals, where
188.B c
189represents the cBPF opcode,
190.B t
191the jump true offset target,
192.B f
193the jump false offset target and
194.B k
195the immediate constant/literal. There are various tools that generate code
196in this loadable format, for example,
197.B bpf_asm
198that ships with the Linux kernel source tree under
199.B tools/net/
200, so it is certainly not expected to hack this by hand. The
201.B bytecode
202or
203.B bytecode-file
204option is mandatory when a cBPF classifier or action is to be loaded.
205
206.SS bytecode-file
207also being used to load a cBPF classifier or action. It's effectively the
208same as
209.B bytecode
210only that the cBPF bytecode is not passed directly via command line, but
211rather resides in a text file.
212
213.SH EXAMPLES
214.SS eBPF TOOLING
215A full blown example including eBPF agent code can be found inside the
216iproute2 source package under:
217.B examples/bpf/
218
219As prerequisites, the kernel needs to have the eBPF system call namely
220.B bpf(2)
221enabled and ships with
222.B cls_bpf
223and
224.B act_bpf
225kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT
226support, depending which of the two the given architecture supports:
227
228.in +4n
229.B echo 1 > /proc/sys/net/core/bpf_jit_enable
230.in
231
232A given restricted C file can be compiled via LLVM as:
233
234.in +4n
235.B clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o
236.in
237
238The compiler invocation might still simplify in future, so for now,
239it's quite handy to alias this construct in one way or another, for
240example:
241.in +4n
242.nf
243.sp
244__bcc() {
245 clang -O2 -emit-llvm -c $1 -o - | \\
246 llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
247}
248
249alias bcc=__bcc
250.fi
251.in
252
253A minimal, stand-alone unit, which matches on all traffic with the
254default classid (return code of -1) looks like:
255
256.in +4n
257.nf
258.sp
259#include <linux/bpf.h>
260
261#ifndef __section
262# define __section(x) __attribute__((section(x), used))
263#endif
264
265__section("classifier") int cls_main(struct __sk_buff *skb)
266{
267 return -1;
268}
269
270char __license[] __section("license") = "GPL";
271.fi
272.in
273
274More examples can be found further below in subsection
275.B eBPF PROGRAMMING
276as focus here will be on tooling.
277
278There can be various other sections, for example, also for actions.
279Thus, an object file in eBPF can contain multiple entrance points.
280Always a specific entrance point, however, must be specified when
281configuring with tc. A license must be part of the restricted C code
282and the license string syntax is the same as with Linux kernel modules.
283The kernel reserves its right that some eBPF helper functions can be
284restricted to GPL compatible licenses only, and thus may reject a program
285from loading into the kernel when such a license mismatch occurs.
286
287The resulting object file from the compilation can be inspected with
288the usual set of tools that also operate on normal object files, for
289example
290.B objdump(1)
291for inspecting ELF section headers:
292
293.in +4n
294.nf
295.sp
296objdump -h bpf.o
297[...]
2983 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3
299 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3004 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3
301 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3025 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3
303 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3046 maps 00000030 0000000000000000 0000000000000000 00000958 2**2
305 CONTENTS, ALLOC, LOAD, DATA
3067 license 00000004 0000000000000000 0000000000000000 00000988 2**0
307 CONTENTS, ALLOC, LOAD, DATA
308[...]
309.fi
310.in
311
312Adding an eBPF classifier from an object file that contains a classifier
313in the default ELF section is trivial (note that instead of "object-file"
314also shortcuts such as "obj" can be used):
315
316.in +4n
317.B bcc bpf.c
318.br
319.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
320.in
321
322In case the classifier resides in ELF section "mycls", then that same
323command needs to be invoked as:
324
325.in +4n
326.B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
327.in
328
329Dumping the classifier configuration will tell the location of the
330classifier, in other words that it's from object file "bpf.o" under
331section "mycls":
332
333.in +4n
334.B tc filter show dev em1
335.br
336.B filter parent 1: protocol all pref 49152 bpf
337.br
338.B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]
339.in
340
341The same program can also be installed on ingress qdisc side as opposed
342to egress ...
343
344.in +4n
345.B tc qdisc add dev em1 handle ffff: ingress
346.br
347.B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1
348.in
349
350\&... and again dumped from there:
351
352.in +4n
353.B tc filter show dev em1 parent ffff:
354.br
355.B filter protocol all pref 49152 bpf
356.br
357.B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]
358.in
359
360Attaching a classifier and action on ingress has the restriction that
361it doesn't have an actual underlying queueing discipline. What ingress
362can do is to classify, mangle, redirect or drop packets. When queueing
363is required on ingress side, then ingress must redirect packets to the
364.B ifb
365device, otherwise policing can be used. Moreover, ingress can be used to
366have an early drop point of unwanted packets before they hit upper layers
367of the networking stack, perform network accounting with eBPF maps that
368could be shared with egress, or have an early mangle and/or redirection
369point to different networking devices.
370
371Multiple eBPF actions and classifier can be placed into a single
372object file within various sections. In that case, non-default section
373names must be provided, which is the case for both actions in this
374example:
375
376.in +4n
377.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e
378.br
379.in +25n
380.B action bpf obj bpf.o sec action-mark \e
381.br
382.B action bpf obj bpf.o sec action-rand ok
383.in -25n
384.in -4n
385
386The advantage of this is that the classifier and the two actions can
387then share eBPF maps with each other, if implemented in the programs.
388
389In order to access eBPF maps from user space beyond
390.B tc(8)
391setup lifetime, the ownership can be transferred to an eBPF agent via
392Unix domain sockets. There are two possibilities for implementing this:
393
394.B 1)
395implementation of an own eBPF agent that takes care of setting up
396the Unix domain socket and implementing the protocol that
397.B tc(8)
398dictates. A code example of this can be found inside the iproute2
399source package under:
400.B examples/bpf/
401
402.B 2)
403use
404.B tc exec
405for transferring the eBPF map file descriptors through a Unix domain
406socket, and spawning an application such as
407.B sh(1)
408\&. This approach's advantage is that tc will place the file descriptors
409into the environment and thus make them available just like stdin, stdout,
410stderr file descriptors, meaning, in case user applications run from within
Ville Skyttäac0817e2015-11-07 11:53:00 +0200411this fd-owner shell, they can terminate and restart without losing eBPF
Daniel Borkmanncbdd1e62015-05-22 00:17:01 +0200412maps file descriptors. Example invocation with the previous classifier and
413action mixture:
414
415.in +4n
416.B tc exec bpf imp /tmp/bpf
417.br
418.B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e
419.br
420.in +25n
421.B action bpf obj bpf.o sec action-mark \e
422.br
423.B action bpf obj bpf.o sec action-rand ok
424.in -25n
425.in -4n
426
427Assuming that eBPF maps are shared with classifier and actions, it's
428enough to export them once, for example, from within the classifier
429or action command. tc will setup all eBPF map file descriptors at the
430time when the object file is first parsed.
431
432When a shell has been spawned, the environment will have a couple of
433eBPF related variables. BPF_NUM_MAPS provides the total number of maps
434that have been transferred over the Unix domain socket. BPF_MAP<X>'s
435value is the file descriptor number that can be accessed in eBPF agent
436applications, in other words, it can directly be used as the file
437descriptor value for the
438.B bpf(2)
439system call to retrieve or alter eBPF map values. <X> denotes the
440identifier of the eBPF map. It corresponds to the
441.B id
442member of
443.B struct bpf_elf_map
444\& from the tc eBPF map specification.
445
446The environment in this example looks as follows:
447
448.in +4n
449.nf
450.sp
451sh# env | grep BPF
452 BPF_NUM_MAPS=3
453 BPF_MAP1=6
454 BPF_MAP0=5
455 BPF_MAP2=7
456sh# ls -la /proc/self/fd
457 [...]
458 lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
459 lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
460 lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
461sh# my_bpf_agent
462.fi
463.in
464
465eBPF agents are very useful in that they can prepopulate eBPF maps from
466user space, monitor statistics via maps and based on that feedback, for
467example, rewrite classids in eBPF map values during runtime. Given that eBPF
468agents are implemented as normal applications, they can also dynamically
469receive traffic control policies from external controllers and thus push
470them down into eBPF maps to dynamically adapt to network conditions. Moreover,
471eBPF maps can also be shared with other eBPF program types (e.g. tracing),
472thus very powerful combination can therefore be implemented.
473
474.SS eBPF PROGRAMMING
475
476eBPF classifier and actions are being implemented in restricted C syntax
477(in future, there could additionally be new language frontends supported).
478
479The header file
480.B linux/bpf.h
481provides eBPF helper functions that can be called from an eBPF program.
482This man page will only provide two minimal, stand-alone examples, have a
483look at
484.B examples/bpf
485from the iproute2 source package for a fully fledged flow dissector
486example to better demonstrate some of the possibilities with eBPF.
487
488Supported 32 bit classifier return codes from the C program and their meanings:
489.in +4n
490.B 0
491, denotes a mismatch
492.br
493.B -1
494, denotes the default classid configured from the command line
495.br
496.B else
497, everything else will override the default classid to provide a facility for
498non-linear matching
499.in
500
501Supported 32 bit action return codes from the C program and their meanings (
502.B linux/pkt_cls.h
503):
504.in +4n
505.B TC_ACT_OK (0)
506, will terminate the packet processing pipeline and allows the packet to
507proceed
508.br
509.B TC_ACT_SHOT (2)
510, will terminate the packet processing pipeline and drops the packet
511.br
512.B TC_ACT_UNSPEC (-1)
513, will use the default action configured from tc (similarly as returning
514.B -1
515from a classifier)
516.br
517.B TC_ACT_PIPE (3)
518, will iterate to the next action, if available
519.br
520.B TC_ACT_RECLASSIFY (1)
521, will terminate the packet processing pipeline and start classification
522from the beginning
523.br
524.B else
525, everything else is an unspecified return code
526.in
527
528Both classifier and action return codes are supported in eBPF and cBPF
529programs.
530
531To demonstrate restricted C syntax, a minimal toy classifier example is
532provided, which assumes that egress packets, for instance originating
533from a container, have previously been marked in interval [0, 255]. The
534program keeps statistics on different marks for user space and maps the
535classid to the root qdisc with the marking itself as the minor handle:
536
537.in +4n
538.nf
539.sp
540#include <stdint.h>
541#include <asm/types.h>
542
543#include <linux/bpf.h>
544#include <linux/pkt_sched.h>
545
546#include "helpers.h"
547
548struct tuple {
549 long packets;
550 long bytes;
551};
552
553#define BPF_MAP_ID_STATS 1 /* agent's map identifier */
554#define BPF_MAX_MARK 256
555
556struct bpf_elf_map __section("maps") map_stats = {
557 .type = BPF_MAP_TYPE_ARRAY,
558 .id = BPF_MAP_ID_STATS,
559 .size_key = sizeof(uint32_t),
560 .size_value = sizeof(struct tuple),
561 .max_elem = BPF_MAX_MARK,
562};
563
564static inline void cls_update_stats(const struct __sk_buff *skb,
565 uint32_t mark)
566{
567 struct tuple *tu;
568
569 tu = bpf_map_lookup_elem(&map_stats, &mark);
570 if (likely(tu)) {
571 __sync_fetch_and_add(&tu->packets, 1);
572 __sync_fetch_and_add(&tu->bytes, skb->len);
573 }
574}
575
576__section("cls") int cls_main(struct __sk_buff *skb)
577{
578 uint32_t mark = skb->mark;
579
580 if (unlikely(mark >= BPF_MAX_MARK))
581 return 0;
582
583 cls_update_stats(skb, mark);
584
585 return TC_H_MAKE(TC_H_ROOT, mark);
586}
587
588char __license[] __section("license") = "GPL";
589.fi
590.in
591
592Another small example is a port redirector which demuxes destination port
59380 into the interval [8080, 8087] steered by RSS, that can then be attached
594to ingress qdisc. The exercise of adding the egress counterpart and IPv6
595support is left to the reader:
596
597.in +4n
598.nf
599.sp
600#include <asm/types.h>
601#include <asm/byteorder.h>
602
603#include <linux/bpf.h>
604#include <linux/filter.h>
605#include <linux/in.h>
606#include <linux/if_ether.h>
607#include <linux/ip.h>
608#include <linux/tcp.h>
609
610#include "helpers.h"
611
612static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
613 __u16 old_port, __u16 new_port)
614{
615 bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
616 old_port, new_port, sizeof(new_port));
617 bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
618 &new_port, sizeof(new_port), 0);
619}
620
621static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
622{
623 __u16 dport, dport_new = 8080, off;
624 __u8 ip_proto, ip_vl;
625
626 ip_proto = load_byte(skb, nh_off +
627 offsetof(struct iphdr, protocol));
628 if (ip_proto != IPPROTO_TCP)
629 return 0;
630
631 ip_vl = load_byte(skb, nh_off);
632 if (likely(ip_vl == 0x45))
633 nh_off += sizeof(struct iphdr);
634 else
635 nh_off += (ip_vl & 0xF) << 2;
636
637 dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
638 if (dport != 80)
639 return 0;
640
641 off = skb->queue_mapping & 7;
642 set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
643 __cpu_to_be16(dport_new + off));
644 return -1;
645}
646
647__section("lb") int lb_main(struct __sk_buff *skb)
648{
649 int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
650
651 if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
652 ret = lb_do_ipv4(skb, nh_off);
653
654 return ret;
655}
656
657char __license[] __section("license") = "GPL";
658.fi
659.in
660
661The related helper header file
662.B helpers.h
663in both examples was:
664
665.in +4n
666.nf
667.sp
668/* Misc helper macros. */
669#define __section(x) __attribute__((section(x), used))
670#define offsetof(x, y) __builtin_offsetof(x, y)
671#define likely(x) __builtin_expect(!!(x), 1)
672#define unlikely(x) __builtin_expect(!!(x), 0)
673
674/* Used map structure */
675struct bpf_elf_map {
676 __u32 type;
677 __u32 size_key;
678 __u32 size_value;
679 __u32 max_elem;
680 __u32 id;
681};
682
683/* Some used BPF function calls. */
684static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
685 int len, int flags) =
686 (void *) BPF_FUNC_skb_store_bytes;
687static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
688 int to, int flags) =
689 (void *) BPF_FUNC_l4_csum_replace;
690static void *(*bpf_map_lookup_elem)(void *map, void *key) =
691 (void *) BPF_FUNC_map_lookup_elem;
692
693/* Some used BPF intrinsics. */
694unsigned long long load_byte(void *skb, unsigned long long off)
695 asm ("llvm.bpf.load.byte");
696unsigned long long load_half(void *skb, unsigned long long off)
697 asm ("llvm.bpf.load.half");
698.fi
699.in
700
701Best practice, we recommend to only have a single eBPF classifier loaded
702in tc and perform
703.B all
704necessary matching and mangling from there instead of a list of individual
705classifier and separate actions. Just a single classifier tailored for a
706given use-case will be most efficient to run.
707
708.SS eBPF DEBUGGING
709
710Both tc
711.B filter
712and
713.B action
714commands for
715.B bpf
716support an optional
717.B verbose
718parameter that can be used to inspect the eBPF verifier log. It is dumped
719by default in case of an error.
720
721In case the eBPF/cBPF JIT compiler has been enabled, it can also be
722instructed to emit a debug output of the resulting opcode image into
723the kernel log, which can be read via
724.B dmesg(1)
725:
726
727.in +4n
728.B echo 2 > /proc/sys/net/core/bpf_jit_enable
729.in
730
731The Linux kernel source tree ships additionally under
732.B tools/net/
733a small helper called
734.B bpf_jit_disasm
735that reads out the opcode image dump from the kernel log and dumps the
736resulting disassembly:
737
738.in +4n
739.B bpf_jit_disasm -o
740.in
741
742Other than that, the Linux kernel also contains an extensive eBPF/cBPF
743test suite module called
744.B test_bpf
745\&. Upon ...
746
747.in +4n
748.B modprobe test_bpf
749.in
750
751\&... it performs a diversity of test cases and dumps the results into
752the kernel log that can be inspected with
753.B dmesg(1)
754\&. The results can differ depending on whether the JIT compiler is enabled
755or not. In case of failed test cases, the module will fail to load. In
756such cases, we urge you to file a bug report to the related JIT authors,
757Linux kernel and networking mailing lists.
758
759.SS cBPF
760
761Although we generally recommend switching to implementing
762.B eBPF
763classifier and actions, for the sake of completeness, a few words on how to
764program in cBPF will be lost here.
765
766Likewise, the
767.B bpf_jit_enable
768switch can be enabled as mentioned already. Tooling such as
769.B bpf_jit_disasm
770is also independent whether eBPF or cBPF code is being loaded.
771
772Unlike in eBPF, classifier and action are not implemented in restricted C,
773but rather in a minimal assembler-like language or with the help of other
774tooling.
775
776The raw interface with tc takes opcodes directly. For example, the most
777minimal classifier matching on every packet resulting in the default
778classid of 1:1 looks like:
779
780.in +4n
781.B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1
782.in
783
784The first decimal of the bytecode sequence denotes the number of subsequent
7854-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of
786.B c t f k
787decimals, where
788.B c
789represents the cBPF opcode,
790.B t
791the jump true offset target,
792.B f
793the jump false offset target and
794.B k
795the immediate constant/literal. Here, this denotes an unconditional return
796from the program with immediate value of -1.
797
798Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone
799helper tool under the GNU General Public License version 2 for
800.B iptables(8)
801BPF extension, which abuses the
802.B libpcap
803internal classic BPF compiler, his code derived here for usage with
804.B tc(8)
805:
806
807.in +4n
808.nf
809.sp
810#include <pcap.h>
811#include <stdio.h>
812
813int main(int argc, char **argv)
814{
815 struct bpf_program prog;
816 struct bpf_insn *ins;
817 int i, ret, dlt = DLT_RAW;
818
819 if (argc < 2 || argc > 3)
820 return 1;
821 if (argc == 3) {
822 dlt = pcap_datalink_name_to_val(argv[1]);
823 if (dlt == -1)
824 return 1;
825 }
826
827 ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
828 1, PCAP_NETMASK_UNKNOWN);
829 if (ret)
830 return 1;
831
832 printf("%d,", prog.bf_len);
833 ins = prog.bf_insns;
834
835 for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
836 printf("%u %u %u %u,", ins->code,
837 ins->jt, ins->jf, ins->k);
838 printf("%u %u %u %u",
839 ins->code, ins->jt, ins->jf, ins->k);
840
841 pcap_freecode(&prog);
842 return 0;
843}
844.fi
845.in
846
847Given this small helper, any
848.B tcpdump(8)
849filter expression can be abused as a classifier where a match will
850result in the default classid:
851
852.in +4n
853.B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
854.br
855.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
856.in
857
858Basically, such a minimal generator is equivalent to:
859
860.in +4n
Ville Skyttä85e3c872015-11-07 11:52:59 +0200861.B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\\\n' ',' > /var/bpf/tcp-syn
Daniel Borkmanncbdd1e62015-05-22 00:17:01 +0200862.in
863
864Since
865.B libpcap
866does not support all Linux' specific cBPF extensions in its compiler, the
867Linux kernel also ships under
868.B tools/net/
869a minimal BPF assembler called
870.B bpf_asm
871for providing full control. For detailed syntax and semantics on implementing
872such programs by hand, see references under
873.B FURTHER READING
874\&.
875
876Trivial toy example in
877.B bpf_asm
878for classifying IPv4/TCP packets, saved in a text file called
879.B foobar
880:
881
882.in +4n
883.nf
884.sp
885ldh [12]
886jne #0x800, drop
887ldb [23]
888jneq #6, drop
889ret #-1
890drop: ret #0
891.fi
892.in
893
894Similarly, such a classifier can be loaded as:
895
896.in +4n
897.B bpf_asm foobar > /var/bpf/tcp-syn
898.br
899.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
900.in
901
902For BPF classifiers, the Linux kernel provides additionally under
903.B tools/net/
904a small BPF debugger called
905.B bpf_dbg
906, which can be used to test a classifier against pcap files, single-step
907or add various breakpoints into the classifier program and dump register
908contents during runtime.
909
910Implementing an action in classic BPF is rather limited in the sense that
911packet mangling is not supported. Therefore, it's generally recommended to
912make the switch to eBPF, whenever possible.
913
914.SH FURTHER READING
915Further and more technical details about the BPF architecture can be found
916in the Linux kernel source tree under
917.B Documentation/networking/filter.txt
918\&.
919
920Further details on eBPF
921.B tc(8)
922examples can be found in the iproute2 source
923tree under
924.B examples/bpf/
925\&.
926
927.SH SEE ALSO
928.BR tc (8),
929.BR tc-ematch (8)
930.BR bpf (2)
931.BR bpf (4)
932
933.SH AUTHORS
934Manpage written by Daniel Borkmann.
935
936Please report corrections or improvements to the Linux kernel networking
937mailing list:
938.B <netdev@vger.kernel.org>