Blame - man/man8/tc-bpf.8 - platform/external/iproute2

blob: e371964d06ab9630945306f8fea27221b7232bab [file] [log] [blame]

Daniel Borkmann	cbdd1e6	2015-05-22 00:17:01 +0200	[diff] [blame]	1	.TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux"
				2	.SH NAME
				3	BPF \- BPF programmable classifier and actions for ingress/egress
				4	queueing disciplines
				5	.SH SYNOPSIS
				6	.SS eBPF classifier (filter) or action:
				7	.B tc filter ... bpf
				8	[
				9	.B object-file
				10	OBJ_FILE ] [
				11	.B section
				12	CLS_NAME ] [
				13	.B export
				14	UDS_FILE ] [
				15	.B verbose
				16	] [
Jakub Kicinski	87e46a5	2016-10-12 16:46:36 +0100	[diff] [blame]	17	.B skip_hw
				18	\|
				19	.B skip_sw
				20	] [
Daniel Borkmann	cbdd1e6	2015-05-22 00:17:01 +0200	[diff] [blame]	21	.B police
				22	POLICE_SPEC ] [
				23	.B action
				24	ACTION_SPEC ] [
				25	.B classid
				26	CLASSID ]
				27	.br
				28	.B tc action ... bpf
				29	[
				30	.B object-file
				31	OBJ_FILE ] [
				32	.B section
				33	CLS_NAME ] [
				34	.B export
				35	UDS_FILE ] [
				36	.B verbose
				37	]
				38
				39	.SS cBPF classifier (filter) or action:
				40	.B tc filter ... bpf
				41	[
				42	.B bytecode-file
				43	BPF_FILE \|
				44	.B bytecode
				45	BPF_BYTECODE ] [
				46	.B police
				47	POLICE_SPEC ] [
				48	.B action
				49	ACTION_SPEC ] [
				50	.B classid
				51	CLASSID ]
				52	.br
				53	.B tc action ... bpf
				54	[
				55	.B bytecode-file
				56	BPF_FILE \|
				57	.B bytecode
				58	BPF_BYTECODE ]
				59
				60	.SH DESCRIPTION
				61
				62	Extended Berkeley Packet Filter (
				63	.B eBPF
				64	) and classic Berkeley Packet Filter
				65	(originally known as BPF, for better distinction referred to as
				66	.B cBPF
				67	here) are both available as a fully programmable and highly efficient
				68	classifier and actions. They both offer a minimal instruction set for
				69	implementing small programs which can safely be loaded into the kernel
				70	and thus executed in a tiny virtual machine from kernel space. An in-kernel
				71	verifier guarantees that a specified program always terminates and neither
				72	crashes nor leaks data from the kernel.
				73
				74	In Linux, it's generally considered that eBPF is the successor of cBPF.
				75	The kernel internally transforms cBPF expressions into eBPF expressions and
				76	executes the latter. Execution of them can be performed in an interpreter
				77	or at setup time, they can be just-in-time compiled (JIT'ed) to run as
				78	native machine code. Currently, x86_64, ARM64 and s390 architectures have
				79	eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not
				80	(yet) switch to eBPF JIT support.
				81
				82	eBPF's instruction set has similar underlying principles as the cBPF
				83	instruction set, it however is modelled closer to the underlying
				84	architecture to better mimic native instruction sets with the aim to
				85	achieve a better run-time performance. It is designed to be JIT'ed with
				86	a one to one mapping, which can also open up the possibility for compilers
				87	to generate optimized eBPF code through an eBPF backend that performs
				88	almost as fast as natively compiled code. Given that LLVM provides such
				89	an eBPF backend, eBPF programs can therefore easily be programmed in a
				90	subset of the C language. Other than that, eBPF infrastructure also comes
				91	with a construct called "maps". eBPF maps are key/value stores that are
				92	shared between multiple eBPF programs, but also between eBPF programs and
				93	user space applications.
				94
				95	For the traffic control subsystem, classifier and actions that can be
				96	attached to ingress and egress qdiscs can be written in eBPF or cBPF. The
				97	advantage over other classifier and actions is that eBPF/cBPF provides the
				98	generic framework, while users can implement their highly specialized use
				99	cases efficiently. This means that the classifier or action written that
				100	way will not suffer from feature bloat, and can therefore execute its task
				101	highly efficient. It allows for non-linear classification and even merging
				102	the action part into the classification. Combined with efficient eBPF map
				103	data structures, user space can push new policies like classids into the
				104	kernel without reloading a classifier, or it can gather statistics that
				105	are pushed into one map and use another one for dynamically load balancing
				106	traffic based on the determined load, just to provide a few examples.
				107
				108	.SH PARAMETERS
				109	.SS object-file
				110	points to an object file that has an executable and linkable format (ELF)
				111	and contains eBPF opcodes and eBPF map definitions. The LLVM compiler
				112	infrastructure with
				113	.B clang(1)
				114	as a C language front end is one project that supports emitting eBPF object
				115	files that can be passed to the eBPF classifier (more details in the
				116	.B EXAMPLES
				117	section). This option is mandatory when an eBPF classifier or action is
				118	to be loaded.
				119
				120	.SS section
				121	is the name of the ELF section from the object file, where the eBPF
				122	classifier or action resides. By default the section name for the
				123	classifier is called "classifier", and for the action "action". Given
				124	that a single object file can contain multiple classifier and actions,
				125	the corresponding section name needs to be specified, if it differs
				126	from the defaults.
				127
				128	.SS export
				129	points to a Unix domain socket file. In case the eBPF object file also
				130	contains a section named "maps" with eBPF map specifications, then the
				131	map file descriptors can be handed off via the Unix domain socket to
				132	an eBPF "agent" herding all descriptors after tc lifetime. This can be
				133	some third party application implementing the IPC counterpart for the
				134	import, that uses them for calling into
				135	.B bpf(2)
				136	system call to read out or update eBPF map data from user space, for
				137	example, for monitoring purposes or to push down new policies.
				138
				139	.SS verbose
				140	if set, it will dump the eBPF verifier output, even if loading the eBPF
				141	program was successful. By default, only on error, the verifier log is
				142	being emitted to the user.
				143
Jakub Kicinski	87e46a5	2016-10-12 16:46:36 +0100	[diff] [blame]	144	.SS skip_hw \| skip_sw
				145	hardware offload control flags. By default TC will try to offload
				146	filters to hardware if possible.
				147	.B skip_hw
				148	explicitly disables the attempt to offload.
				149	.B skip_sw
				150	forces the offload and disables running the eBPF program in the kernel.
				151	If hardware offload is not possible and this flag was set kernel will
				152	report an error and filter will not be installed at all.
				153
Daniel Borkmann	cbdd1e6	2015-05-22 00:17:01 +0200	[diff] [blame]	154	.SS police
				155	is an optional parameter for an eBPF/cBPF classifier that specifies a
				156	police in
				157	.B tc(1)
				158	which is attached to the classifier, for example, on an ingress qdisc.
				159
				160	.SS action
				161	is an optional parameter for an eBPF/cBPF classifier that specifies a
				162	subsequent action in
				163	.B tc(1)
				164	which is attached to a classifier.
				165
				166	.SS classid
				167	.SS flowid
				168	provides the default traffic control class identifier for this eBPF/cBPF
				169	classifier. The default class identifier can also be overwritten by the
				170	return code of the eBPF/cBPF program. A default return code of
				171	.B -1
				172	specifies the here provided default class identifier to be used. A return
				173	code of the eBPF/cBPF program of 0 implies that no match took place, and
				174	a return code other than these two will override the default classid. This
				175	allows for efficient, non-linear classification with only a single eBPF/cBPF
				176	program as opposed to having multiple individual programs for various class
				177	identifiers which would need to reparse packet contents.
				178
				179	.SS bytecode
				180	is being used for loading cBPF classifier and actions only. The cBPF bytecode
				181	is directly passed as a text string in the form of
				182	.B \'s,c t f k,c t f k,c t f k,...\'
				183	, where
				184	.B s
				185	denotes the number of subsequent 4-tuples. One such 4-tuple consists of
				186	.B c t f k
				187	decimals, where
				188	.B c
				189	represents the cBPF opcode,
				190	.B t
				191	the jump true offset target,
				192	.B f
				193	the jump false offset target and
				194	.B k
				195	the immediate constant/literal. There are various tools that generate code
				196	in this loadable format, for example,
				197	.B bpf_asm
				198	that ships with the Linux kernel source tree under
				199	.B tools/net/
				200	, so it is certainly not expected to hack this by hand. The
				201	.B bytecode
				202	or
				203	.B bytecode-file
				204	option is mandatory when a cBPF classifier or action is to be loaded.
				205
				206	.SS bytecode-file
				207	also being used to load a cBPF classifier or action. It's effectively the
				208	same as
				209	.B bytecode
				210	only that the cBPF bytecode is not passed directly via command line, but
				211	rather resides in a text file.
				212
				213	.SH EXAMPLES
				214	.SS eBPF TOOLING
				215	A full blown example including eBPF agent code can be found inside the
				216	iproute2 source package under:
				217	.B examples/bpf/
				218
				219	As prerequisites, the kernel needs to have the eBPF system call namely
				220	.B bpf(2)
				221	enabled and ships with
				222	.B cls_bpf
				223	and
				224	.B act_bpf
				225	kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT
				226	support, depending which of the two the given architecture supports:
				227
				228	.in +4n
				229	.B echo 1 > /proc/sys/net/core/bpf_jit_enable
				230	.in
				231
				232	A given restricted C file can be compiled via LLVM as:
				233
				234	.in +4n
				235	.B clang -O2 -emit-llvm -c bpf.c -o - \| llc -march=bpf -filetype=obj -o bpf.o
				236	.in
				237
				238	The compiler invocation might still simplify in future, so for now,
				239	it's quite handy to alias this construct in one way or another, for
				240	example:
				241	.in +4n
				242	.nf
				243	.sp
				244	__bcc() {
				245	clang -O2 -emit-llvm -c $1 -o - \| \\
				246	llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
				247	}
				248
				249	alias bcc=__bcc
				250	.fi
				251	.in
				252
				253	A minimal, stand-alone unit, which matches on all traffic with the
				254	default classid (return code of -1) looks like:
				255
				256	.in +4n
				257	.nf
				258	.sp
				259	#include <linux/bpf.h>
				260
				261	#ifndef __section
				262	# define __section(x) __attribute__((section(x), used))
				263	#endif
				264
				265	__section("classifier") int cls_main(struct __sk_buff *skb)
				266	{
				267	return -1;
				268	}
				269
				270	char __license[] __section("license") = "GPL";
				271	.fi
				272	.in
				273
				274	More examples can be found further below in subsection
				275	.B eBPF PROGRAMMING
				276	as focus here will be on tooling.
				277
				278	There can be various other sections, for example, also for actions.
				279	Thus, an object file in eBPF can contain multiple entrance points.
				280	Always a specific entrance point, however, must be specified when
				281	configuring with tc. A license must be part of the restricted C code
				282	and the license string syntax is the same as with Linux kernel modules.
				283	The kernel reserves its right that some eBPF helper functions can be
				284	restricted to GPL compatible licenses only, and thus may reject a program
				285	from loading into the kernel when such a license mismatch occurs.
				286
				287	The resulting object file from the compilation can be inspected with
				288	the usual set of tools that also operate on normal object files, for
				289	example
				290	.B objdump(1)
				291	for inspecting ELF section headers:
				292
				293	.in +4n
				294	.nf
				295	.sp
				296	objdump -h bpf.o
				297	[...]
				298	3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3
				299	CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
				300	4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3
				301	CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
				302	5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3
				303	CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
				304	6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2
				305	CONTENTS, ALLOC, LOAD, DATA
				306	7 license 00000004 0000000000000000 0000000000000000 00000988 2**0
				307	CONTENTS, ALLOC, LOAD, DATA
				308	[...]
				309	.fi
				310	.in
				311
				312	Adding an eBPF classifier from an object file that contains a classifier
				313	in the default ELF section is trivial (note that instead of "object-file"
				314	also shortcuts such as "obj" can be used):
				315
				316	.in +4n
				317	.B bcc bpf.c
				318	.br
				319	.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
				320	.in
				321
				322	In case the classifier resides in ELF section "mycls", then that same
				323	command needs to be invoked as:
				324
				325	.in +4n
				326	.B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
				327	.in
				328
				329	Dumping the classifier configuration will tell the location of the
				330	classifier, in other words that it's from object file "bpf.o" under
				331	section "mycls":
				332
				333	.in +4n
				334	.B tc filter show dev em1
				335	.br
				336	.B filter parent 1: protocol all pref 49152 bpf
				337	.br
				338	.B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]
				339	.in
				340
				341	The same program can also be installed on ingress qdisc side as opposed
				342	to egress ...
				343
				344	.in +4n
				345	.B tc qdisc add dev em1 handle ffff: ingress
				346	.br
				347	.B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1
				348	.in
				349
				350	\&... and again dumped from there:
				351
				352	.in +4n
				353	.B tc filter show dev em1 parent ffff:
				354	.br
				355	.B filter protocol all pref 49152 bpf
				356	.br
				357	.B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]
				358	.in
				359
				360	Attaching a classifier and action on ingress has the restriction that
				361	it doesn't have an actual underlying queueing discipline. What ingress
				362	can do is to classify, mangle, redirect or drop packets. When queueing
				363	is required on ingress side, then ingress must redirect packets to the
				364	.B ifb
				365	device, otherwise policing can be used. Moreover, ingress can be used to
				366	have an early drop point of unwanted packets before they hit upper layers
				367	of the networking stack, perform network accounting with eBPF maps that
				368	could be shared with egress, or have an early mangle and/or redirection
				369	point to different networking devices.
				370
				371	Multiple eBPF actions and classifier can be placed into a single
				372	object file within various sections. In that case, non-default section
				373	names must be provided, which is the case for both actions in this
				374	example:
				375
				376	.in +4n
				377	.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e
				378	.br
				379	.in +25n
				380	.B action bpf obj bpf.o sec action-mark \e
				381	.br
				382	.B action bpf obj bpf.o sec action-rand ok
				383	.in -25n
				384	.in -4n
				385
				386	The advantage of this is that the classifier and the two actions can
				387	then share eBPF maps with each other, if implemented in the programs.
				388
				389	In order to access eBPF maps from user space beyond
				390	.B tc(8)
				391	setup lifetime, the ownership can be transferred to an eBPF agent via
				392	Unix domain sockets. There are two possibilities for implementing this:
				393
				394	.B 1)
				395	implementation of an own eBPF agent that takes care of setting up
				396	the Unix domain socket and implementing the protocol that
				397	.B tc(8)
				398	dictates. A code example of this can be found inside the iproute2
				399	source package under:
				400	.B examples/bpf/
				401
				402	.B 2)
				403	use
				404	.B tc exec
				405	for transferring the eBPF map file descriptors through a Unix domain
				406	socket, and spawning an application such as
				407	.B sh(1)
				408	\&. This approach's advantage is that tc will place the file descriptors
				409	into the environment and thus make them available just like stdin, stdout,
				410	stderr file descriptors, meaning, in case user applications run from within
Ville Skyttä	ac0817e	2015-11-07 11:53:00 +0200	[diff] [blame]	411	this fd-owner shell, they can terminate and restart without losing eBPF
Daniel Borkmann	cbdd1e6	2015-05-22 00:17:01 +0200	[diff] [blame]	412	maps file descriptors. Example invocation with the previous classifier and
				413	action mixture:
				414
				415	.in +4n
				416	.B tc exec bpf imp /tmp/bpf
				417	.br
				418	.B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e
				419	.br
				420	.in +25n
				421	.B action bpf obj bpf.o sec action-mark \e
				422	.br
				423	.B action bpf obj bpf.o sec action-rand ok
				424	.in -25n
				425	.in -4n
				426
				427	Assuming that eBPF maps are shared with classifier and actions, it's
				428	enough to export them once, for example, from within the classifier
				429	or action command. tc will setup all eBPF map file descriptors at the
				430	time when the object file is first parsed.
				431
				432	When a shell has been spawned, the environment will have a couple of
				433	eBPF related variables. BPF_NUM_MAPS provides the total number of maps
				434	that have been transferred over the Unix domain socket. BPF_MAP<X>'s
				435	value is the file descriptor number that can be accessed in eBPF agent
				436	applications, in other words, it can directly be used as the file
				437	descriptor value for the
				438	.B bpf(2)
				439	system call to retrieve or alter eBPF map values. <X> denotes the
				440	identifier of the eBPF map. It corresponds to the
				441	.B id
				442	member of
				443	.B struct bpf_elf_map
				444	\& from the tc eBPF map specification.
				445
				446	The environment in this example looks as follows:
				447
				448	.in +4n
				449	.nf
				450	.sp
				451	sh# env \| grep BPF
				452	BPF_NUM_MAPS=3
				453	BPF_MAP1=6
				454	BPF_MAP0=5
				455	BPF_MAP2=7
				456	sh# ls -la /proc/self/fd
				457	[...]
				458	lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
				459	lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
				460	lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
				461	sh# my_bpf_agent
				462	.fi
				463	.in
				464
				465	eBPF agents are very useful in that they can prepopulate eBPF maps from
				466	user space, monitor statistics via maps and based on that feedback, for
				467	example, rewrite classids in eBPF map values during runtime. Given that eBPF
				468	agents are implemented as normal applications, they can also dynamically
				469	receive traffic control policies from external controllers and thus push
				470	them down into eBPF maps to dynamically adapt to network conditions. Moreover,
				471	eBPF maps can also be shared with other eBPF program types (e.g. tracing),
				472	thus very powerful combination can therefore be implemented.
				473
				474	.SS eBPF PROGRAMMING
				475
				476	eBPF classifier and actions are being implemented in restricted C syntax
				477	(in future, there could additionally be new language frontends supported).
				478
				479	The header file
				480	.B linux/bpf.h
				481	provides eBPF helper functions that can be called from an eBPF program.
				482	This man page will only provide two minimal, stand-alone examples, have a
				483	look at
				484	.B examples/bpf
				485	from the iproute2 source package for a fully fledged flow dissector
				486	example to better demonstrate some of the possibilities with eBPF.
				487
				488	Supported 32 bit classifier return codes from the C program and their meanings:
				489	.in +4n
				490	.B 0
				491	, denotes a mismatch
				492	.br
				493	.B -1
				494	, denotes the default classid configured from the command line
				495	.br
				496	.B else
				497	, everything else will override the default classid to provide a facility for
				498	non-linear matching
				499	.in
				500
				501	Supported 32 bit action return codes from the C program and their meanings (
				502	.B linux/pkt_cls.h
				503	):
				504	.in +4n
				505	.B TC_ACT_OK (0)
				506	, will terminate the packet processing pipeline and allows the packet to
				507	proceed
				508	.br
				509	.B TC_ACT_SHOT (2)
				510	, will terminate the packet processing pipeline and drops the packet
				511	.br
				512	.B TC_ACT_UNSPEC (-1)
				513	, will use the default action configured from tc (similarly as returning
				514	.B -1
				515	from a classifier)
				516	.br
				517	.B TC_ACT_PIPE (3)
				518	, will iterate to the next action, if available
				519	.br
				520	.B TC_ACT_RECLASSIFY (1)
				521	, will terminate the packet processing pipeline and start classification
				522	from the beginning
				523	.br
				524	.B else
				525	, everything else is an unspecified return code
				526	.in
				527
				528	Both classifier and action return codes are supported in eBPF and cBPF
				529	programs.
				530
				531	To demonstrate restricted C syntax, a minimal toy classifier example is
				532	provided, which assumes that egress packets, for instance originating
				533	from a container, have previously been marked in interval [0, 255]. The
				534	program keeps statistics on different marks for user space and maps the
				535	classid to the root qdisc with the marking itself as the minor handle:
				536
				537	.in +4n
				538	.nf
				539	.sp
				540	#include <stdint.h>
				541	#include <asm/types.h>
				542
				543	#include <linux/bpf.h>
				544	#include <linux/pkt_sched.h>
				545
				546	#include "helpers.h"
				547
				548	struct tuple {
				549	long packets;
				550	long bytes;
				551	};
				552
				553	#define BPF_MAP_ID_STATS 1 /* agent's map identifier */
				554	#define BPF_MAX_MARK 256
				555
				556	struct bpf_elf_map __section("maps") map_stats = {
				557	.type = BPF_MAP_TYPE_ARRAY,
				558	.id = BPF_MAP_ID_STATS,
				559	.size_key = sizeof(uint32_t),
				560	.size_value = sizeof(struct tuple),
				561	.max_elem = BPF_MAX_MARK,
				562	};
				563
				564	static inline void cls_update_stats(const struct __sk_buff *skb,
				565	uint32_t mark)
				566	{
				567	struct tuple *tu;
				568
				569	tu = bpf_map_lookup_elem(&map_stats, &mark);
				570	if (likely(tu)) {
				571	__sync_fetch_and_add(&tu->packets, 1);
				572	__sync_fetch_and_add(&tu->bytes, skb->len);
				573	}
				574	}
				575
				576	__section("cls") int cls_main(struct __sk_buff *skb)
				577	{
				578	uint32_t mark = skb->mark;
				579
				580	if (unlikely(mark >= BPF_MAX_MARK))
				581	return 0;
				582
				583	cls_update_stats(skb, mark);
				584
				585	return TC_H_MAKE(TC_H_ROOT, mark);
				586	}
				587
				588	char __license[] __section("license") = "GPL";
				589	.fi
				590	.in
				591
				592	Another small example is a port redirector which demuxes destination port
				593	80 into the interval [8080, 8087] steered by RSS, that can then be attached
				594	to ingress qdisc. The exercise of adding the egress counterpart and IPv6
				595	support is left to the reader:
				596
				597	.in +4n
				598	.nf
				599	.sp
				600	#include <asm/types.h>
				601	#include <asm/byteorder.h>
				602
				603	#include <linux/bpf.h>
				604	#include <linux/filter.h>
				605	#include <linux/in.h>
				606	#include <linux/if_ether.h>
				607	#include <linux/ip.h>
				608	#include <linux/tcp.h>
				609
				610	#include "helpers.h"
				611
				612	static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
				613	__u16 old_port, __u16 new_port)
				614	{
				615	bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
				616	old_port, new_port, sizeof(new_port));
				617	bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
				618	&new_port, sizeof(new_port), 0);
				619	}
				620
				621	static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
				622	{
				623	__u16 dport, dport_new = 8080, off;
				624	__u8 ip_proto, ip_vl;
				625
				626	ip_proto = load_byte(skb, nh_off +
				627	offsetof(struct iphdr, protocol));
				628	if (ip_proto != IPPROTO_TCP)
				629	return 0;
				630
				631	ip_vl = load_byte(skb, nh_off);
				632	if (likely(ip_vl == 0x45))
				633	nh_off += sizeof(struct iphdr);
				634	else
				635	nh_off += (ip_vl & 0xF) << 2;
				636
				637	dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
				638	if (dport != 80)
				639	return 0;
				640
				641	off = skb->queue_mapping & 7;
				642	set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
				643	__cpu_to_be16(dport_new + off));
				644	return -1;
				645	}
				646
				647	__section("lb") int lb_main(struct __sk_buff *skb)
				648	{
				649	int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
				650
				651	if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
				652	ret = lb_do_ipv4(skb, nh_off);
				653
				654	return ret;
				655	}
				656
				657	char __license[] __section("license") = "GPL";
				658	.fi
				659	.in
				660
				661	The related helper header file
				662	.B helpers.h
				663	in both examples was:
				664
				665	.in +4n
				666	.nf
				667	.sp
				668	/* Misc helper macros. */
				669	#define __section(x) __attribute__((section(x), used))
				670	#define offsetof(x, y) __builtin_offsetof(x, y)
				671	#define likely(x) __builtin_expect(!!(x), 1)
				672	#define unlikely(x) __builtin_expect(!!(x), 0)
				673
				674	/* Used map structure */
				675	struct bpf_elf_map {
				676	__u32 type;
				677	__u32 size_key;
				678	__u32 size_value;
				679	__u32 max_elem;
				680	__u32 id;
				681	};
				682
				683	/* Some used BPF function calls. */
				684	static int (bpf_skb_store_bytes)(void ctx, int off, void *from,
				685	int len, int flags) =
				686	(void *) BPF_FUNC_skb_store_bytes;
				687	static int (bpf_l4_csum_replace)(void ctx, int off, int from,
				688	int to, int flags) =
				689	(void *) BPF_FUNC_l4_csum_replace;
				690	static void (bpf_map_lookup_elem)(void map, void key) =
				691	(void *) BPF_FUNC_map_lookup_elem;
				692
				693	/* Some used BPF intrinsics. */
				694	unsigned long long load_byte(void *skb, unsigned long long off)
				695	asm ("llvm.bpf.load.byte");
				696	unsigned long long load_half(void *skb, unsigned long long off)
				697	asm ("llvm.bpf.load.half");
				698	.fi
				699	.in
				700
				701	Best practice, we recommend to only have a single eBPF classifier loaded
				702	in tc and perform
				703	.B all
				704	necessary matching and mangling from there instead of a list of individual
				705	classifier and separate actions. Just a single classifier tailored for a
				706	given use-case will be most efficient to run.
				707
				708	.SS eBPF DEBUGGING
				709
				710	Both tc
				711	.B filter
				712	and
				713	.B action
				714	commands for
				715	.B bpf
				716	support an optional
				717	.B verbose
				718	parameter that can be used to inspect the eBPF verifier log. It is dumped
				719	by default in case of an error.
				720
				721	In case the eBPF/cBPF JIT compiler has been enabled, it can also be
				722	instructed to emit a debug output of the resulting opcode image into
				723	the kernel log, which can be read via
				724	.B dmesg(1)
				725	:
				726
				727	.in +4n
				728	.B echo 2 > /proc/sys/net/core/bpf_jit_enable
				729	.in
				730
				731	The Linux kernel source tree ships additionally under
				732	.B tools/net/
				733	a small helper called
				734	.B bpf_jit_disasm
				735	that reads out the opcode image dump from the kernel log and dumps the
				736	resulting disassembly:
				737
				738	.in +4n
				739	.B bpf_jit_disasm -o
				740	.in
				741
				742	Other than that, the Linux kernel also contains an extensive eBPF/cBPF
				743	test suite module called
				744	.B test_bpf
				745	\&. Upon ...
				746
				747	.in +4n
				748	.B modprobe test_bpf
				749	.in
				750
				751	\&... it performs a diversity of test cases and dumps the results into
				752	the kernel log that can be inspected with
				753	.B dmesg(1)
				754	\&. The results can differ depending on whether the JIT compiler is enabled
				755	or not. In case of failed test cases, the module will fail to load. In
				756	such cases, we urge you to file a bug report to the related JIT authors,
				757	Linux kernel and networking mailing lists.
				758
				759	.SS cBPF
				760
				761	Although we generally recommend switching to implementing
				762	.B eBPF
				763	classifier and actions, for the sake of completeness, a few words on how to
				764	program in cBPF will be lost here.
				765
				766	Likewise, the
				767	.B bpf_jit_enable
				768	switch can be enabled as mentioned already. Tooling such as
				769	.B bpf_jit_disasm
				770	is also independent whether eBPF or cBPF code is being loaded.
				771
				772	Unlike in eBPF, classifier and action are not implemented in restricted C,
				773	but rather in a minimal assembler-like language or with the help of other
				774	tooling.
				775
				776	The raw interface with tc takes opcodes directly. For example, the most
				777	minimal classifier matching on every packet resulting in the default
				778	classid of 1:1 looks like:
				779
				780	.in +4n
				781	.B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1
				782	.in
				783
				784	The first decimal of the bytecode sequence denotes the number of subsequent
				785	4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of
				786	.B c t f k
				787	decimals, where
				788	.B c
				789	represents the cBPF opcode,
				790	.B t
				791	the jump true offset target,
				792	.B f
				793	the jump false offset target and
				794	.B k
				795	the immediate constant/literal. Here, this denotes an unconditional return
				796	from the program with immediate value of -1.
				797
				798	Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone
				799	helper tool under the GNU General Public License version 2 for
				800	.B iptables(8)
				801	BPF extension, which abuses the
				802	.B libpcap
				803	internal classic BPF compiler, his code derived here for usage with
				804	.B tc(8)
				805	:
				806
				807	.in +4n
				808	.nf
				809	.sp
				810	#include <pcap.h>
				811	#include <stdio.h>
				812
				813	int main(int argc, char **argv)
				814	{
				815	struct bpf_program prog;
				816	struct bpf_insn *ins;
				817	int i, ret, dlt = DLT_RAW;
				818
				819	if (argc < 2 \|\| argc > 3)
				820	return 1;
				821	if (argc == 3) {
				822	dlt = pcap_datalink_name_to_val(argv[1]);
				823	if (dlt == -1)
				824	return 1;
				825	}
				826
				827	ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
				828	1, PCAP_NETMASK_UNKNOWN);
				829	if (ret)
				830	return 1;
				831
				832	printf("%d,", prog.bf_len);
				833	ins = prog.bf_insns;
				834
				835	for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
				836	printf("%u %u %u %u,", ins->code,
				837	ins->jt, ins->jf, ins->k);
				838	printf("%u %u %u %u",
				839	ins->code, ins->jt, ins->jf, ins->k);
				840
				841	pcap_freecode(&prog);
				842	return 0;
				843	}
				844	.fi
				845	.in
				846
				847	Given this small helper, any
				848	.B tcpdump(8)
				849	filter expression can be abused as a classifier where a match will
				850	result in the default classid:
				851
				852	.in +4n
				853	.B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
				854	.br
				855	.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
				856	.in
				857
				858	Basically, such a minimal generator is equivalent to:
				859
				860	.in +4n
Ville Skyttä	85e3c87	2015-11-07 11:52:59 +0200	[diff] [blame]	861	.B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' \| tr '\\\\n' ',' > /var/bpf/tcp-syn
Daniel Borkmann	cbdd1e6	2015-05-22 00:17:01 +0200	[diff] [blame]	862	.in
				863
				864	Since
				865	.B libpcap
				866	does not support all Linux' specific cBPF extensions in its compiler, the
				867	Linux kernel also ships under
				868	.B tools/net/
				869	a minimal BPF assembler called
				870	.B bpf_asm
				871	for providing full control. For detailed syntax and semantics on implementing
				872	such programs by hand, see references under
				873	.B FURTHER READING
				874	\&.
				875
				876	Trivial toy example in
				877	.B bpf_asm
				878	for classifying IPv4/TCP packets, saved in a text file called
				879	.B foobar
				880	:
				881
				882	.in +4n
				883	.nf
				884	.sp
				885	ldh [12]
				886	jne #0x800, drop
				887	ldb [23]
				888	jneq #6, drop
				889	ret #-1
				890	drop: ret #0
				891	.fi
				892	.in
				893
				894	Similarly, such a classifier can be loaded as:
				895
				896	.in +4n
				897	.B bpf_asm foobar > /var/bpf/tcp-syn
				898	.br
				899	.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
				900	.in
				901
				902	For BPF classifiers, the Linux kernel provides additionally under
				903	.B tools/net/
				904	a small BPF debugger called
				905	.B bpf_dbg
				906	, which can be used to test a classifier against pcap files, single-step
				907	or add various breakpoints into the classifier program and dump register
				908	contents during runtime.
				909
				910	Implementing an action in classic BPF is rather limited in the sense that
				911	packet mangling is not supported. Therefore, it's generally recommended to
				912	make the switch to eBPF, whenever possible.
				913
				914	.SH FURTHER READING
				915	Further and more technical details about the BPF architecture can be found
				916	in the Linux kernel source tree under
				917	.B Documentation/networking/filter.txt
				918	\&.
				919
				920	Further details on eBPF
				921	.B tc(8)
				922	examples can be found in the iproute2 source
				923	tree under
				924	.B examples/bpf/
				925	\&.
				926
				927	.SH SEE ALSO
				928	.BR tc (8),
				929	.BR tc-ematch (8)
				930	.BR bpf (2)
				931	.BR bpf (4)
				932
				933	.SH AUTHORS
				934	Manpage written by Daniel Borkmann.
				935
				936	Please report corrections or improvements to the Linux kernel networking
				937	mailing list:
				938	.B <netdev@vger.kernel.org>