| .TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux" |
| .SH NAME |
| BPF \- BPF programmable classifier and actions for ingress/egress |
| queueing disciplines |
| .SH SYNOPSIS |
| .SS eBPF classifier (filter) or action: |
| .B tc filter ... bpf |
| [ |
| .B object-file |
| OBJ_FILE ] [ |
| .B section |
| CLS_NAME ] [ |
| .B export |
| UDS_FILE ] [ |
| .B verbose |
| ] [ |
| .B skip_hw |
| | |
| .B skip_sw |
| ] [ |
| .B police |
| POLICE_SPEC ] [ |
| .B action |
| ACTION_SPEC ] [ |
| .B classid |
| CLASSID ] |
| .br |
| .B tc action ... bpf |
| [ |
| .B object-file |
| OBJ_FILE ] [ |
| .B section |
| CLS_NAME ] [ |
| .B export |
| UDS_FILE ] [ |
| .B verbose |
| ] |
| |
| .SS cBPF classifier (filter) or action: |
| .B tc filter ... bpf |
| [ |
| .B bytecode-file |
| BPF_FILE | |
| .B bytecode |
| BPF_BYTECODE ] [ |
| .B police |
| POLICE_SPEC ] [ |
| .B action |
| ACTION_SPEC ] [ |
| .B classid |
| CLASSID ] |
| .br |
| .B tc action ... bpf |
| [ |
| .B bytecode-file |
| BPF_FILE | |
| .B bytecode |
| BPF_BYTECODE ] |
| |
| .SH DESCRIPTION |
| |
| Extended Berkeley Packet Filter ( |
| .B eBPF |
| ) and classic Berkeley Packet Filter |
| (originally known as BPF, for better distinction referred to as |
| .B cBPF |
| here) are both available as a fully programmable and highly efficient |
| classifier and actions. They both offer a minimal instruction set for |
| implementing small programs which can safely be loaded into the kernel |
| and thus executed in a tiny virtual machine from kernel space. An in-kernel |
| verifier guarantees that a specified program always terminates and neither |
| crashes nor leaks data from the kernel. |
| |
| In Linux, it's generally considered that eBPF is the successor of cBPF. |
| The kernel internally transforms cBPF expressions into eBPF expressions and |
| executes the latter. Execution of them can be performed in an interpreter |
| or at setup time, they can be just-in-time compiled (JIT'ed) to run as |
| native machine code. Currently, x86_64, ARM64 and s390 architectures have |
| eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not |
| (yet) switch to eBPF JIT support. |
| |
| eBPF's instruction set has similar underlying principles as the cBPF |
| instruction set, it however is modelled closer to the underlying |
| architecture to better mimic native instruction sets with the aim to |
| achieve a better run-time performance. It is designed to be JIT'ed with |
| a one to one mapping, which can also open up the possibility for compilers |
| to generate optimized eBPF code through an eBPF backend that performs |
| almost as fast as natively compiled code. Given that LLVM provides such |
| an eBPF backend, eBPF programs can therefore easily be programmed in a |
| subset of the C language. Other than that, eBPF infrastructure also comes |
| with a construct called "maps". eBPF maps are key/value stores that are |
| shared between multiple eBPF programs, but also between eBPF programs and |
| user space applications. |
| |
| For the traffic control subsystem, classifier and actions that can be |
| attached to ingress and egress qdiscs can be written in eBPF or cBPF. The |
| advantage over other classifier and actions is that eBPF/cBPF provides the |
| generic framework, while users can implement their highly specialized use |
| cases efficiently. This means that the classifier or action written that |
| way will not suffer from feature bloat, and can therefore execute its task |
| highly efficient. It allows for non-linear classification and even merging |
| the action part into the classification. Combined with efficient eBPF map |
| data structures, user space can push new policies like classids into the |
| kernel without reloading a classifier, or it can gather statistics that |
| are pushed into one map and use another one for dynamically load balancing |
| traffic based on the determined load, just to provide a few examples. |
| |
| .SH PARAMETERS |
| .SS object-file |
| points to an object file that has an executable and linkable format (ELF) |
| and contains eBPF opcodes and eBPF map definitions. The LLVM compiler |
| infrastructure with |
| .B clang(1) |
| as a C language front end is one project that supports emitting eBPF object |
| files that can be passed to the eBPF classifier (more details in the |
| .B EXAMPLES |
| section). This option is mandatory when an eBPF classifier or action is |
| to be loaded. |
| |
| .SS section |
| is the name of the ELF section from the object file, where the eBPF |
| classifier or action resides. By default the section name for the |
| classifier is called "classifier", and for the action "action". Given |
| that a single object file can contain multiple classifier and actions, |
| the corresponding section name needs to be specified, if it differs |
| from the defaults. |
| |
| .SS export |
| points to a Unix domain socket file. In case the eBPF object file also |
| contains a section named "maps" with eBPF map specifications, then the |
| map file descriptors can be handed off via the Unix domain socket to |
| an eBPF "agent" herding all descriptors after tc lifetime. This can be |
| some third party application implementing the IPC counterpart for the |
| import, that uses them for calling into |
| .B bpf(2) |
| system call to read out or update eBPF map data from user space, for |
| example, for monitoring purposes or to push down new policies. |
| |
| .SS verbose |
| if set, it will dump the eBPF verifier output, even if loading the eBPF |
| program was successful. By default, only on error, the verifier log is |
| being emitted to the user. |
| |
| .SS skip_hw | skip_sw |
| hardware offload control flags. By default TC will try to offload |
| filters to hardware if possible. |
| .B skip_hw |
| explicitly disables the attempt to offload. |
| .B skip_sw |
| forces the offload and disables running the eBPF program in the kernel. |
| If hardware offload is not possible and this flag was set kernel will |
| report an error and filter will not be installed at all. |
| |
| .SS police |
| is an optional parameter for an eBPF/cBPF classifier that specifies a |
| police in |
| .B tc(1) |
| which is attached to the classifier, for example, on an ingress qdisc. |
| |
| .SS action |
| is an optional parameter for an eBPF/cBPF classifier that specifies a |
| subsequent action in |
| .B tc(1) |
| which is attached to a classifier. |
| |
| .SS classid |
| .SS flowid |
| provides the default traffic control class identifier for this eBPF/cBPF |
| classifier. The default class identifier can also be overwritten by the |
| return code of the eBPF/cBPF program. A default return code of |
| .B -1 |
| specifies the here provided default class identifier to be used. A return |
| code of the eBPF/cBPF program of 0 implies that no match took place, and |
| a return code other than these two will override the default classid. This |
| allows for efficient, non-linear classification with only a single eBPF/cBPF |
| program as opposed to having multiple individual programs for various class |
| identifiers which would need to reparse packet contents. |
| |
| .SS bytecode |
| is being used for loading cBPF classifier and actions only. The cBPF bytecode |
| is directly passed as a text string in the form of |
| .B \'s,c t f k,c t f k,c t f k,...\' |
| , where |
| .B s |
| denotes the number of subsequent 4-tuples. One such 4-tuple consists of |
| .B c t f k |
| decimals, where |
| .B c |
| represents the cBPF opcode, |
| .B t |
| the jump true offset target, |
| .B f |
| the jump false offset target and |
| .B k |
| the immediate constant/literal. There are various tools that generate code |
| in this loadable format, for example, |
| .B bpf_asm |
| that ships with the Linux kernel source tree under |
| .B tools/net/ |
| , so it is certainly not expected to hack this by hand. The |
| .B bytecode |
| or |
| .B bytecode-file |
| option is mandatory when a cBPF classifier or action is to be loaded. |
| |
| .SS bytecode-file |
| also being used to load a cBPF classifier or action. It's effectively the |
| same as |
| .B bytecode |
| only that the cBPF bytecode is not passed directly via command line, but |
| rather resides in a text file. |
| |
| .SH EXAMPLES |
| .SS eBPF TOOLING |
| A full blown example including eBPF agent code can be found inside the |
| iproute2 source package under: |
| .B examples/bpf/ |
| |
| As prerequisites, the kernel needs to have the eBPF system call namely |
| .B bpf(2) |
| enabled and ships with |
| .B cls_bpf |
| and |
| .B act_bpf |
| kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT |
| support, depending which of the two the given architecture supports: |
| |
| .in +4n |
| .B echo 1 > /proc/sys/net/core/bpf_jit_enable |
| .in |
| |
| A given restricted C file can be compiled via LLVM as: |
| |
| .in +4n |
| .B clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o |
| .in |
| |
| The compiler invocation might still simplify in future, so for now, |
| it's quite handy to alias this construct in one way or another, for |
| example: |
| .in +4n |
| .nf |
| .sp |
| __bcc() { |
| clang -O2 -emit-llvm -c $1 -o - | \\ |
| llc -march=bpf -filetype=obj -o "`basename $1 .c`.o" |
| } |
| |
| alias bcc=__bcc |
| .fi |
| .in |
| |
| A minimal, stand-alone unit, which matches on all traffic with the |
| default classid (return code of -1) looks like: |
| |
| .in +4n |
| .nf |
| .sp |
| #include <linux/bpf.h> |
| |
| #ifndef __section |
| # define __section(x) __attribute__((section(x), used)) |
| #endif |
| |
| __section("classifier") int cls_main(struct __sk_buff *skb) |
| { |
| return -1; |
| } |
| |
| char __license[] __section("license") = "GPL"; |
| .fi |
| .in |
| |
| More examples can be found further below in subsection |
| .B eBPF PROGRAMMING |
| as focus here will be on tooling. |
| |
| There can be various other sections, for example, also for actions. |
| Thus, an object file in eBPF can contain multiple entrance points. |
| Always a specific entrance point, however, must be specified when |
| configuring with tc. A license must be part of the restricted C code |
| and the license string syntax is the same as with Linux kernel modules. |
| The kernel reserves its right that some eBPF helper functions can be |
| restricted to GPL compatible licenses only, and thus may reject a program |
| from loading into the kernel when such a license mismatch occurs. |
| |
| The resulting object file from the compilation can be inspected with |
| the usual set of tools that also operate on normal object files, for |
| example |
| .B objdump(1) |
| for inspecting ELF section headers: |
| |
| .in +4n |
| .nf |
| .sp |
| objdump -h bpf.o |
| [...] |
| 3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3 |
| CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3 |
| CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3 |
| CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE |
| 6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2 |
| CONTENTS, ALLOC, LOAD, DATA |
| 7 license 00000004 0000000000000000 0000000000000000 00000988 2**0 |
| CONTENTS, ALLOC, LOAD, DATA |
| [...] |
| .fi |
| .in |
| |
| Adding an eBPF classifier from an object file that contains a classifier |
| in the default ELF section is trivial (note that instead of "object-file" |
| also shortcuts such as "obj" can be used): |
| |
| .in +4n |
| .B bcc bpf.c |
| .br |
| .B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 |
| .in |
| |
| In case the classifier resides in ELF section "mycls", then that same |
| command needs to be invoked as: |
| |
| .in +4n |
| .B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1 |
| .in |
| |
| Dumping the classifier configuration will tell the location of the |
| classifier, in other words that it's from object file "bpf.o" under |
| section "mycls": |
| |
| .in +4n |
| .B tc filter show dev em1 |
| .br |
| .B filter parent 1: protocol all pref 49152 bpf |
| .br |
| .B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls] |
| .in |
| |
| The same program can also be installed on ingress qdisc side as opposed |
| to egress ... |
| |
| .in +4n |
| .B tc qdisc add dev em1 handle ffff: ingress |
| .br |
| .B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1 |
| .in |
| |
| \&... and again dumped from there: |
| |
| .in +4n |
| .B tc filter show dev em1 parent ffff: |
| .br |
| .B filter protocol all pref 49152 bpf |
| .br |
| .B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls] |
| .in |
| |
| Attaching a classifier and action on ingress has the restriction that |
| it doesn't have an actual underlying queueing discipline. What ingress |
| can do is to classify, mangle, redirect or drop packets. When queueing |
| is required on ingress side, then ingress must redirect packets to the |
| .B ifb |
| device, otherwise policing can be used. Moreover, ingress can be used to |
| have an early drop point of unwanted packets before they hit upper layers |
| of the networking stack, perform network accounting with eBPF maps that |
| could be shared with egress, or have an early mangle and/or redirection |
| point to different networking devices. |
| |
| Multiple eBPF actions and classifier can be placed into a single |
| object file within various sections. In that case, non-default section |
| names must be provided, which is the case for both actions in this |
| example: |
| |
| .in +4n |
| .B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e |
| .br |
| .in +25n |
| .B action bpf obj bpf.o sec action-mark \e |
| .br |
| .B action bpf obj bpf.o sec action-rand ok |
| .in -25n |
| .in -4n |
| |
| The advantage of this is that the classifier and the two actions can |
| then share eBPF maps with each other, if implemented in the programs. |
| |
| In order to access eBPF maps from user space beyond |
| .B tc(8) |
| setup lifetime, the ownership can be transferred to an eBPF agent via |
| Unix domain sockets. There are two possibilities for implementing this: |
| |
| .B 1) |
| implementation of an own eBPF agent that takes care of setting up |
| the Unix domain socket and implementing the protocol that |
| .B tc(8) |
| dictates. A code example of this can be found inside the iproute2 |
| source package under: |
| .B examples/bpf/ |
| |
| .B 2) |
| use |
| .B tc exec |
| for transferring the eBPF map file descriptors through a Unix domain |
| socket, and spawning an application such as |
| .B sh(1) |
| \&. This approach's advantage is that tc will place the file descriptors |
| into the environment and thus make them available just like stdin, stdout, |
| stderr file descriptors, meaning, in case user applications run from within |
| this fd-owner shell, they can terminate and restart without losing eBPF |
| maps file descriptors. Example invocation with the previous classifier and |
| action mixture: |
| |
| .in +4n |
| .B tc exec bpf imp /tmp/bpf |
| .br |
| .B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e |
| .br |
| .in +25n |
| .B action bpf obj bpf.o sec action-mark \e |
| .br |
| .B action bpf obj bpf.o sec action-rand ok |
| .in -25n |
| .in -4n |
| |
| Assuming that eBPF maps are shared with classifier and actions, it's |
| enough to export them once, for example, from within the classifier |
| or action command. tc will setup all eBPF map file descriptors at the |
| time when the object file is first parsed. |
| |
| When a shell has been spawned, the environment will have a couple of |
| eBPF related variables. BPF_NUM_MAPS provides the total number of maps |
| that have been transferred over the Unix domain socket. BPF_MAP<X>'s |
| value is the file descriptor number that can be accessed in eBPF agent |
| applications, in other words, it can directly be used as the file |
| descriptor value for the |
| .B bpf(2) |
| system call to retrieve or alter eBPF map values. <X> denotes the |
| identifier of the eBPF map. It corresponds to the |
| .B id |
| member of |
| .B struct bpf_elf_map |
| \& from the tc eBPF map specification. |
| |
| The environment in this example looks as follows: |
| |
| .in +4n |
| .nf |
| .sp |
| sh# env | grep BPF |
| BPF_NUM_MAPS=3 |
| BPF_MAP1=6 |
| BPF_MAP0=5 |
| BPF_MAP2=7 |
| sh# ls -la /proc/self/fd |
| [...] |
| lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map |
| lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map |
| lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map |
| sh# my_bpf_agent |
| .fi |
| .in |
| |
| eBPF agents are very useful in that they can prepopulate eBPF maps from |
| user space, monitor statistics via maps and based on that feedback, for |
| example, rewrite classids in eBPF map values during runtime. Given that eBPF |
| agents are implemented as normal applications, they can also dynamically |
| receive traffic control policies from external controllers and thus push |
| them down into eBPF maps to dynamically adapt to network conditions. Moreover, |
| eBPF maps can also be shared with other eBPF program types (e.g. tracing), |
| thus very powerful combination can therefore be implemented. |
| |
| .SS eBPF PROGRAMMING |
| |
| eBPF classifier and actions are being implemented in restricted C syntax |
| (in future, there could additionally be new language frontends supported). |
| |
| The header file |
| .B linux/bpf.h |
| provides eBPF helper functions that can be called from an eBPF program. |
| This man page will only provide two minimal, stand-alone examples, have a |
| look at |
| .B examples/bpf |
| from the iproute2 source package for a fully fledged flow dissector |
| example to better demonstrate some of the possibilities with eBPF. |
| |
| Supported 32 bit classifier return codes from the C program and their meanings: |
| .in +4n |
| .B 0 |
| , denotes a mismatch |
| .br |
| .B -1 |
| , denotes the default classid configured from the command line |
| .br |
| .B else |
| , everything else will override the default classid to provide a facility for |
| non-linear matching |
| .in |
| |
| Supported 32 bit action return codes from the C program and their meanings ( |
| .B linux/pkt_cls.h |
| ): |
| .in +4n |
| .B TC_ACT_OK (0) |
| , will terminate the packet processing pipeline and allows the packet to |
| proceed |
| .br |
| .B TC_ACT_SHOT (2) |
| , will terminate the packet processing pipeline and drops the packet |
| .br |
| .B TC_ACT_UNSPEC (-1) |
| , will use the default action configured from tc (similarly as returning |
| .B -1 |
| from a classifier) |
| .br |
| .B TC_ACT_PIPE (3) |
| , will iterate to the next action, if available |
| .br |
| .B TC_ACT_RECLASSIFY (1) |
| , will terminate the packet processing pipeline and start classification |
| from the beginning |
| .br |
| .B else |
| , everything else is an unspecified return code |
| .in |
| |
| Both classifier and action return codes are supported in eBPF and cBPF |
| programs. |
| |
| To demonstrate restricted C syntax, a minimal toy classifier example is |
| provided, which assumes that egress packets, for instance originating |
| from a container, have previously been marked in interval [0, 255]. The |
| program keeps statistics on different marks for user space and maps the |
| classid to the root qdisc with the marking itself as the minor handle: |
| |
| .in +4n |
| .nf |
| .sp |
| #include <stdint.h> |
| #include <asm/types.h> |
| |
| #include <linux/bpf.h> |
| #include <linux/pkt_sched.h> |
| |
| #include "helpers.h" |
| |
| struct tuple { |
| long packets; |
| long bytes; |
| }; |
| |
| #define BPF_MAP_ID_STATS 1 /* agent's map identifier */ |
| #define BPF_MAX_MARK 256 |
| |
| struct bpf_elf_map __section("maps") map_stats = { |
| .type = BPF_MAP_TYPE_ARRAY, |
| .id = BPF_MAP_ID_STATS, |
| .size_key = sizeof(uint32_t), |
| .size_value = sizeof(struct tuple), |
| .max_elem = BPF_MAX_MARK, |
| }; |
| |
| static inline void cls_update_stats(const struct __sk_buff *skb, |
| uint32_t mark) |
| { |
| struct tuple *tu; |
| |
| tu = bpf_map_lookup_elem(&map_stats, &mark); |
| if (likely(tu)) { |
| __sync_fetch_and_add(&tu->packets, 1); |
| __sync_fetch_and_add(&tu->bytes, skb->len); |
| } |
| } |
| |
| __section("cls") int cls_main(struct __sk_buff *skb) |
| { |
| uint32_t mark = skb->mark; |
| |
| if (unlikely(mark >= BPF_MAX_MARK)) |
| return 0; |
| |
| cls_update_stats(skb, mark); |
| |
| return TC_H_MAKE(TC_H_ROOT, mark); |
| } |
| |
| char __license[] __section("license") = "GPL"; |
| .fi |
| .in |
| |
| Another small example is a port redirector which demuxes destination port |
| 80 into the interval [8080, 8087] steered by RSS, that can then be attached |
| to ingress qdisc. The exercise of adding the egress counterpart and IPv6 |
| support is left to the reader: |
| |
| .in +4n |
| .nf |
| .sp |
| #include <asm/types.h> |
| #include <asm/byteorder.h> |
| |
| #include <linux/bpf.h> |
| #include <linux/filter.h> |
| #include <linux/in.h> |
| #include <linux/if_ether.h> |
| #include <linux/ip.h> |
| #include <linux/tcp.h> |
| |
| #include "helpers.h" |
| |
| static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off, |
| __u16 old_port, __u16 new_port) |
| { |
| bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check), |
| old_port, new_port, sizeof(new_port)); |
| bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest), |
| &new_port, sizeof(new_port), 0); |
| } |
| |
| static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off) |
| { |
| __u16 dport, dport_new = 8080, off; |
| __u8 ip_proto, ip_vl; |
| |
| ip_proto = load_byte(skb, nh_off + |
| offsetof(struct iphdr, protocol)); |
| if (ip_proto != IPPROTO_TCP) |
| return 0; |
| |
| ip_vl = load_byte(skb, nh_off); |
| if (likely(ip_vl == 0x45)) |
| nh_off += sizeof(struct iphdr); |
| else |
| nh_off += (ip_vl & 0xF) << 2; |
| |
| dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest)); |
| if (dport != 80) |
| return 0; |
| |
| off = skb->queue_mapping & 7; |
| set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80), |
| __cpu_to_be16(dport_new + off)); |
| return -1; |
| } |
| |
| __section("lb") int lb_main(struct __sk_buff *skb) |
| { |
| int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN; |
| |
| if (likely(skb->protocol == __constant_htons(ETH_P_IP))) |
| ret = lb_do_ipv4(skb, nh_off); |
| |
| return ret; |
| } |
| |
| char __license[] __section("license") = "GPL"; |
| .fi |
| .in |
| |
| The related helper header file |
| .B helpers.h |
| in both examples was: |
| |
| .in +4n |
| .nf |
| .sp |
| /* Misc helper macros. */ |
| #define __section(x) __attribute__((section(x), used)) |
| #define offsetof(x, y) __builtin_offsetof(x, y) |
| #define likely(x) __builtin_expect(!!(x), 1) |
| #define unlikely(x) __builtin_expect(!!(x), 0) |
| |
| /* Used map structure */ |
| struct bpf_elf_map { |
| __u32 type; |
| __u32 size_key; |
| __u32 size_value; |
| __u32 max_elem; |
| __u32 id; |
| }; |
| |
| /* Some used BPF function calls. */ |
| static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, |
| int len, int flags) = |
| (void *) BPF_FUNC_skb_store_bytes; |
| static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, |
| int to, int flags) = |
| (void *) BPF_FUNC_l4_csum_replace; |
| static void *(*bpf_map_lookup_elem)(void *map, void *key) = |
| (void *) BPF_FUNC_map_lookup_elem; |
| |
| /* Some used BPF intrinsics. */ |
| unsigned long long load_byte(void *skb, unsigned long long off) |
| asm ("llvm.bpf.load.byte"); |
| unsigned long long load_half(void *skb, unsigned long long off) |
| asm ("llvm.bpf.load.half"); |
| .fi |
| .in |
| |
| Best practice, we recommend to only have a single eBPF classifier loaded |
| in tc and perform |
| .B all |
| necessary matching and mangling from there instead of a list of individual |
| classifier and separate actions. Just a single classifier tailored for a |
| given use-case will be most efficient to run. |
| |
| .SS eBPF DEBUGGING |
| |
| Both tc |
| .B filter |
| and |
| .B action |
| commands for |
| .B bpf |
| support an optional |
| .B verbose |
| parameter that can be used to inspect the eBPF verifier log. It is dumped |
| by default in case of an error. |
| |
| In case the eBPF/cBPF JIT compiler has been enabled, it can also be |
| instructed to emit a debug output of the resulting opcode image into |
| the kernel log, which can be read via |
| .B dmesg(1) |
| : |
| |
| .in +4n |
| .B echo 2 > /proc/sys/net/core/bpf_jit_enable |
| .in |
| |
| The Linux kernel source tree ships additionally under |
| .B tools/net/ |
| a small helper called |
| .B bpf_jit_disasm |
| that reads out the opcode image dump from the kernel log and dumps the |
| resulting disassembly: |
| |
| .in +4n |
| .B bpf_jit_disasm -o |
| .in |
| |
| Other than that, the Linux kernel also contains an extensive eBPF/cBPF |
| test suite module called |
| .B test_bpf |
| \&. Upon ... |
| |
| .in +4n |
| .B modprobe test_bpf |
| .in |
| |
| \&... it performs a diversity of test cases and dumps the results into |
| the kernel log that can be inspected with |
| .B dmesg(1) |
| \&. The results can differ depending on whether the JIT compiler is enabled |
| or not. In case of failed test cases, the module will fail to load. In |
| such cases, we urge you to file a bug report to the related JIT authors, |
| Linux kernel and networking mailing lists. |
| |
| .SS cBPF |
| |
| Although we generally recommend switching to implementing |
| .B eBPF |
| classifier and actions, for the sake of completeness, a few words on how to |
| program in cBPF will be lost here. |
| |
| Likewise, the |
| .B bpf_jit_enable |
| switch can be enabled as mentioned already. Tooling such as |
| .B bpf_jit_disasm |
| is also independent whether eBPF or cBPF code is being loaded. |
| |
| Unlike in eBPF, classifier and action are not implemented in restricted C, |
| but rather in a minimal assembler-like language or with the help of other |
| tooling. |
| |
| The raw interface with tc takes opcodes directly. For example, the most |
| minimal classifier matching on every packet resulting in the default |
| classid of 1:1 looks like: |
| |
| .in +4n |
| .B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1 |
| .in |
| |
| The first decimal of the bytecode sequence denotes the number of subsequent |
| 4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of |
| .B c t f k |
| decimals, where |
| .B c |
| represents the cBPF opcode, |
| .B t |
| the jump true offset target, |
| .B f |
| the jump false offset target and |
| .B k |
| the immediate constant/literal. Here, this denotes an unconditional return |
| from the program with immediate value of -1. |
| |
| Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone |
| helper tool under the GNU General Public License version 2 for |
| .B iptables(8) |
| BPF extension, which abuses the |
| .B libpcap |
| internal classic BPF compiler, his code derived here for usage with |
| .B tc(8) |
| : |
| |
| .in +4n |
| .nf |
| .sp |
| #include <pcap.h> |
| #include <stdio.h> |
| |
| int main(int argc, char **argv) |
| { |
| struct bpf_program prog; |
| struct bpf_insn *ins; |
| int i, ret, dlt = DLT_RAW; |
| |
| if (argc < 2 || argc > 3) |
| return 1; |
| if (argc == 3) { |
| dlt = pcap_datalink_name_to_val(argv[1]); |
| if (dlt == -1) |
| return 1; |
| } |
| |
| ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1], |
| 1, PCAP_NETMASK_UNKNOWN); |
| if (ret) |
| return 1; |
| |
| printf("%d,", prog.bf_len); |
| ins = prog.bf_insns; |
| |
| for (i = 0; i < prog.bf_len - 1; ++ins, ++i) |
| printf("%u %u %u %u,", ins->code, |
| ins->jt, ins->jf, ins->k); |
| printf("%u %u %u %u", |
| ins->code, ins->jt, ins->jf, ins->k); |
| |
| pcap_freecode(&prog); |
| return 0; |
| } |
| .fi |
| .in |
| |
| Given this small helper, any |
| .B tcpdump(8) |
| filter expression can be abused as a classifier where a match will |
| result in the default classid: |
| |
| .in +4n |
| .B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn |
| .br |
| .B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 |
| .in |
| |
| Basically, such a minimal generator is equivalent to: |
| |
| .in +4n |
| .B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\\\n' ',' > /var/bpf/tcp-syn |
| .in |
| |
| Since |
| .B libpcap |
| does not support all Linux' specific cBPF extensions in its compiler, the |
| Linux kernel also ships under |
| .B tools/net/ |
| a minimal BPF assembler called |
| .B bpf_asm |
| for providing full control. For detailed syntax and semantics on implementing |
| such programs by hand, see references under |
| .B FURTHER READING |
| \&. |
| |
| Trivial toy example in |
| .B bpf_asm |
| for classifying IPv4/TCP packets, saved in a text file called |
| .B foobar |
| : |
| |
| .in +4n |
| .nf |
| .sp |
| ldh [12] |
| jne #0x800, drop |
| ldb [23] |
| jneq #6, drop |
| ret #-1 |
| drop: ret #0 |
| .fi |
| .in |
| |
| Similarly, such a classifier can be loaded as: |
| |
| .in +4n |
| .B bpf_asm foobar > /var/bpf/tcp-syn |
| .br |
| .B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 |
| .in |
| |
| For BPF classifiers, the Linux kernel provides additionally under |
| .B tools/net/ |
| a small BPF debugger called |
| .B bpf_dbg |
| , which can be used to test a classifier against pcap files, single-step |
| or add various breakpoints into the classifier program and dump register |
| contents during runtime. |
| |
| Implementing an action in classic BPF is rather limited in the sense that |
| packet mangling is not supported. Therefore, it's generally recommended to |
| make the switch to eBPF, whenever possible. |
| |
| .SH FURTHER READING |
| Further and more technical details about the BPF architecture can be found |
| in the Linux kernel source tree under |
| .B Documentation/networking/filter.txt |
| \&. |
| |
| Further details on eBPF |
| .B tc(8) |
| examples can be found in the iproute2 source |
| tree under |
| .B examples/bpf/ |
| \&. |
| |
| .SH SEE ALSO |
| .BR tc (8), |
| .BR tc-ematch (8) |
| .BR bpf (2) |
| .BR bpf (4) |
| |
| .SH AUTHORS |
| Manpage written by Daniel Borkmann. |
| |
| Please report corrections or improvements to the Linux kernel networking |
| mailing list: |
| .B <netdev@vger.kernel.org> |