Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 1 | Linux Socket Filtering aka Berkeley Packet Filter (BPF) |
| 2 | ======================================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 3 | |
| 4 | Introduction |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 5 | ------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 7 | Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. |
| 8 | Though there are some distinct differences between the BSD and Linux |
| 9 | Kernel filtering, but when we speak of BPF or LSF in Linux context, we |
| 10 | mean the very same mechanism of filtering in the Linux kernel. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 11 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 12 | BPF allows a user-space program to attach a filter onto any socket and |
| 13 | allow or disallow certain types of data to come through the socket. LSF |
| 14 | follows exactly the same filter code structure as BSD's BPF, so referring |
| 15 | to the BSD bpf.4 manpage is very helpful in creating filters. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 16 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 17 | On Linux, BPF is much simpler than on BSD. One does not have to worry |
| 18 | about devices or anything like that. You simply create your filter code, |
| 19 | send it to the kernel via the SO_ATTACH_FILTER option and if your filter |
| 20 | code passes the kernel check on it, you then immediately begin filtering |
| 21 | data on that socket. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 22 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 23 | You can also detach filters from your socket via the SO_DETACH_FILTER |
| 24 | option. This will probably not be used much since when you close a socket |
| 25 | that has a filter on it the filter is automagically removed. The other |
| 26 | less common case may be adding a different filter on the same socket where |
| 27 | you had another filter that is still running: the kernel takes care of |
| 28 | removing the old one and placing your new one in its place, assuming your |
| 29 | filter has passed the checks, otherwise if it fails the old filter will |
| 30 | remain on that socket. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 31 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 32 | SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once |
| 33 | set, a filter cannot be removed or changed. This allows one process to |
| 34 | setup a socket, attach a filter, lock it then drop privileges and be |
| 35 | assured that the filter will be kept until the socket is closed. |
Vincent Bernat | d59577b | 2013-01-16 22:55:49 +0100 | [diff] [blame] | 36 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 37 | The biggest user of this construct might be libpcap. Issuing a high-level |
| 38 | filter command like `tcpdump -i em1 port 22` passes through the libpcap |
| 39 | internal compiler that generates a structure that can eventually be loaded |
| 40 | via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` |
| 41 | displays what is being placed into this structure. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 42 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 43 | Although we were only speaking about sockets here, BPF in Linux is used |
| 44 | in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel |
| 45 | qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places |
| 46 | such as team driver, PTP code, etc where BPF is being used. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 47 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 48 | [1] Documentation/prctl/seccomp_filter.txt |
| 49 | |
| 50 | Original BPF paper: |
| 51 | |
| 52 | Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new |
| 53 | architecture for user-level packet capture. In Proceedings of the |
| 54 | USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 |
| 55 | Conference Proceedings (USENIX'93). USENIX Association, Berkeley, |
| 56 | CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] |
| 57 | |
| 58 | Structure |
| 59 | --------- |
| 60 | |
| 61 | User space applications include <linux/filter.h> which contains the |
| 62 | following relevant structures: |
| 63 | |
| 64 | struct sock_filter { /* Filter block */ |
| 65 | __u16 code; /* Actual filter code */ |
| 66 | __u8 jt; /* Jump true */ |
| 67 | __u8 jf; /* Jump false */ |
| 68 | __u32 k; /* Generic multiuse field */ |
| 69 | }; |
| 70 | |
| 71 | Such a structure is assembled as an array of 4-tuples, that contains |
| 72 | a code, jt, jf and k value. jt and jf are jump offsets and k a generic |
| 73 | value to be used for a provided code. |
| 74 | |
| 75 | struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ |
| 76 | unsigned short len; /* Number of filter blocks */ |
| 77 | struct sock_filter __user *filter; |
| 78 | }; |
| 79 | |
| 80 | For socket filtering, a pointer to this structure (as shown in |
| 81 | follow-up example) is being passed to the kernel through setsockopt(2). |
| 82 | |
| 83 | Example |
| 84 | ------- |
| 85 | |
| 86 | #include <sys/socket.h> |
| 87 | #include <sys/types.h> |
| 88 | #include <arpa/inet.h> |
| 89 | #include <linux/if_ether.h> |
| 90 | /* ... */ |
| 91 | |
| 92 | /* From the example above: tcpdump -i em1 port 22 -dd */ |
| 93 | struct sock_filter code[] = { |
| 94 | { 0x28, 0, 0, 0x0000000c }, |
| 95 | { 0x15, 0, 8, 0x000086dd }, |
| 96 | { 0x30, 0, 0, 0x00000014 }, |
| 97 | { 0x15, 2, 0, 0x00000084 }, |
| 98 | { 0x15, 1, 0, 0x00000006 }, |
| 99 | { 0x15, 0, 17, 0x00000011 }, |
| 100 | { 0x28, 0, 0, 0x00000036 }, |
| 101 | { 0x15, 14, 0, 0x00000016 }, |
| 102 | { 0x28, 0, 0, 0x00000038 }, |
| 103 | { 0x15, 12, 13, 0x00000016 }, |
| 104 | { 0x15, 0, 12, 0x00000800 }, |
| 105 | { 0x30, 0, 0, 0x00000017 }, |
| 106 | { 0x15, 2, 0, 0x00000084 }, |
| 107 | { 0x15, 1, 0, 0x00000006 }, |
| 108 | { 0x15, 0, 8, 0x00000011 }, |
| 109 | { 0x28, 0, 0, 0x00000014 }, |
| 110 | { 0x45, 6, 0, 0x00001fff }, |
| 111 | { 0xb1, 0, 0, 0x0000000e }, |
| 112 | { 0x48, 0, 0, 0x0000000e }, |
| 113 | { 0x15, 2, 0, 0x00000016 }, |
| 114 | { 0x48, 0, 0, 0x00000010 }, |
| 115 | { 0x15, 0, 1, 0x00000016 }, |
| 116 | { 0x06, 0, 0, 0x0000ffff }, |
| 117 | { 0x06, 0, 0, 0x00000000 }, |
| 118 | }; |
| 119 | |
| 120 | struct sock_fprog bpf = { |
| 121 | .len = ARRAY_SIZE(code), |
| 122 | .filter = code, |
| 123 | }; |
| 124 | |
| 125 | sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); |
| 126 | if (sock < 0) |
| 127 | /* ... bail out ... */ |
| 128 | |
| 129 | ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); |
| 130 | if (ret < 0) |
| 131 | /* ... bail out ... */ |
| 132 | |
| 133 | /* ... */ |
| 134 | close(sock); |
| 135 | |
| 136 | The above example code attaches a socket filter for a PF_PACKET socket |
| 137 | in order to let all IPv4/IPv6 packets with port 22 pass. The rest will |
| 138 | be dropped for this socket. |
| 139 | |
| 140 | The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments |
| 141 | and SO_LOCK_FILTER for preventing the filter to be detached, takes an |
| 142 | integer value with 0 or 1. |
| 143 | |
| 144 | Note that socket filters are not restricted to PF_PACKET sockets only, |
| 145 | but can also be used on other socket families. |
| 146 | |
| 147 | Summary of system calls: |
| 148 | |
| 149 | * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); |
| 150 | * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); |
| 151 | * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); |
| 152 | |
| 153 | Normally, most use cases for socket filtering on packet sockets will be |
| 154 | covered by libpcap in high-level syntax, so as an application developer |
| 155 | you should stick to that. libpcap wraps its own layer around all that. |
| 156 | |
| 157 | Unless i) using/linking to libpcap is not an option, ii) the required BPF |
| 158 | filters use Linux extensions that are not supported by libpcap's compiler, |
| 159 | iii) a filter might be more complex and not cleanly implementable with |
| 160 | libpcap's compiler, or iv) particular filter codes should be optimized |
| 161 | differently than libpcap's internal compiler does; then in such cases |
| 162 | writing such a filter "by hand" can be of an alternative. For example, |
| 163 | xt_bpf and cls_bpf users might have requirements that could result in |
| 164 | more complex filter code, or one that cannot be expressed with libpcap |
| 165 | (e.g. different return codes for various code paths). Moreover, BPF JIT |
| 166 | implementors may wish to manually write test cases and thus need low-level |
| 167 | access to BPF code as well. |
| 168 | |
| 169 | BPF engine and instruction set |
| 170 | ------------------------------ |
| 171 | |
| 172 | Under tools/net/ there's a small helper tool called bpf_asm which can |
| 173 | be used to write low-level filters for example scenarios mentioned in the |
| 174 | previous section. Asm-like syntax mentioned here has been implemented in |
| 175 | bpf_asm and will be used for further explanations (instead of dealing with |
| 176 | less readable opcodes directly, principles are the same). The syntax is |
| 177 | closely modelled after Steven McCanne's and Van Jacobson's BPF paper. |
| 178 | |
| 179 | The BPF architecture consists of the following basic elements: |
| 180 | |
| 181 | Element Description |
| 182 | |
| 183 | A 32 bit wide accumulator |
| 184 | X 32 bit wide X register |
| 185 | M[] 16 x 32 bit wide misc registers aka "scratch memory |
| 186 | store", addressable from 0 to 15 |
| 187 | |
| 188 | A program, that is translated by bpf_asm into "opcodes" is an array that |
| 189 | consists of the following elements (as already mentioned): |
| 190 | |
| 191 | op:16, jt:8, jf:8, k:32 |
| 192 | |
| 193 | The element op is a 16 bit wide opcode that has a particular instruction |
| 194 | encoded. jt and jf are two 8 bit wide jump targets, one for condition |
| 195 | "jump if true", the other one "jump if false". Eventually, element k |
| 196 | contains a miscellaneous argument that can be interpreted in different |
| 197 | ways depending on the given instruction in op. |
| 198 | |
| 199 | The instruction set consists of load, store, branch, alu, miscellaneous |
| 200 | and return instructions that are also represented in bpf_asm syntax. This |
| 201 | table lists all bpf_asm instructions available resp. what their underlying |
| 202 | opcodes as defined in linux/filter.h stand for: |
| 203 | |
| 204 | Instruction Addressing mode Description |
| 205 | |
| 206 | ld 1, 2, 3, 4, 10 Load word into A |
| 207 | ldi 4 Load word into A |
| 208 | ldh 1, 2 Load half-word into A |
| 209 | ldb 1, 2 Load byte into A |
| 210 | ldx 3, 4, 5, 10 Load word into X |
| 211 | ldxi 4 Load word into X |
| 212 | ldxb 5 Load byte into X |
| 213 | |
| 214 | st 3 Store A into M[] |
| 215 | stx 3 Store X into M[] |
| 216 | |
| 217 | jmp 6 Jump to label |
| 218 | ja 6 Jump to label |
| 219 | jeq 7, 8 Jump on k == A |
| 220 | jneq 8 Jump on k != A |
| 221 | jne 8 Jump on k != A |
| 222 | jlt 8 Jump on k < A |
| 223 | jle 8 Jump on k <= A |
| 224 | jgt 7, 8 Jump on k > A |
| 225 | jge 7, 8 Jump on k >= A |
| 226 | jset 7, 8 Jump on k & A |
| 227 | |
| 228 | add 0, 4 A + <x> |
| 229 | sub 0, 4 A - <x> |
| 230 | mul 0, 4 A * <x> |
| 231 | div 0, 4 A / <x> |
| 232 | mod 0, 4 A % <x> |
| 233 | neg 0, 4 !A |
| 234 | and 0, 4 A & <x> |
| 235 | or 0, 4 A | <x> |
| 236 | xor 0, 4 A ^ <x> |
| 237 | lsh 0, 4 A << <x> |
| 238 | rsh 0, 4 A >> <x> |
| 239 | |
| 240 | tax Copy A into X |
| 241 | txa Copy X into A |
| 242 | |
| 243 | ret 4, 9 Return |
| 244 | |
| 245 | The next table shows addressing formats from the 2nd column: |
| 246 | |
| 247 | Addressing mode Syntax Description |
| 248 | |
| 249 | 0 x/%x Register X |
| 250 | 1 [k] BHW at byte offset k in the packet |
| 251 | 2 [x + k] BHW at the offset X + k in the packet |
| 252 | 3 M[k] Word at offset k in M[] |
| 253 | 4 #k Literal value stored in k |
| 254 | 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet |
| 255 | 6 L Jump label L |
| 256 | 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf |
| 257 | 8 #k,Lt Jump to Lt if predicate is true |
| 258 | 9 a/%a Accumulator A |
| 259 | 10 extension BPF extension |
| 260 | |
| 261 | The Linux kernel also has a couple of BPF extensions that are used along |
| 262 | with the class of load instructions by "overloading" the k argument with |
| 263 | a negative offset + a particular extension offset. The result of such BPF |
| 264 | extensions are loaded into A. |
| 265 | |
| 266 | Possible BPF extensions are shown in the following table: |
| 267 | |
| 268 | Extension Description |
| 269 | |
| 270 | len skb->len |
| 271 | proto skb->protocol |
| 272 | type skb->pkt_type |
| 273 | poff Payload start offset |
| 274 | ifidx skb->dev->ifindex |
| 275 | nla Netlink attribute of type X with offset A |
| 276 | nlan Nested Netlink attribute of type X with offset A |
| 277 | mark skb->mark |
| 278 | queue skb->queue_mapping |
| 279 | hatype skb->dev->type |
| 280 | rxhash skb->rxhash |
| 281 | cpu raw_smp_processor_id() |
| 282 | vlan_tci vlan_tx_tag_get(skb) |
| 283 | vlan_pr vlan_tx_tag_present(skb) |
| 284 | |
| 285 | These extensions can also be prefixed with '#'. |
| 286 | Examples for low-level BPF: |
| 287 | |
| 288 | ** ARP packets: |
| 289 | |
| 290 | ldh [12] |
| 291 | jne #0x806, drop |
| 292 | ret #-1 |
| 293 | drop: ret #0 |
| 294 | |
| 295 | ** IPv4 TCP packets: |
| 296 | |
| 297 | ldh [12] |
| 298 | jne #0x800, drop |
| 299 | ldb [23] |
| 300 | jneq #6, drop |
| 301 | ret #-1 |
| 302 | drop: ret #0 |
| 303 | |
| 304 | ** (Accelerated) VLAN w/ id 10: |
| 305 | |
| 306 | ld vlan_tci |
| 307 | jneq #10, drop |
| 308 | ret #-1 |
| 309 | drop: ret #0 |
| 310 | |
| 311 | ** SECCOMP filter example: |
| 312 | |
| 313 | ld [4] /* offsetof(struct seccomp_data, arch) */ |
| 314 | jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ |
| 315 | ld [0] /* offsetof(struct seccomp_data, nr) */ |
| 316 | jeq #15, good /* __NR_rt_sigreturn */ |
| 317 | jeq #231, good /* __NR_exit_group */ |
| 318 | jeq #60, good /* __NR_exit */ |
| 319 | jeq #0, good /* __NR_read */ |
| 320 | jeq #1, good /* __NR_write */ |
| 321 | jeq #5, good /* __NR_fstat */ |
| 322 | jeq #9, good /* __NR_mmap */ |
| 323 | jeq #14, good /* __NR_rt_sigprocmask */ |
| 324 | jeq #13, good /* __NR_rt_sigaction */ |
| 325 | jeq #35, good /* __NR_nanosleep */ |
| 326 | bad: ret #0 /* SECCOMP_RET_KILL */ |
| 327 | good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ |
| 328 | |
| 329 | The above example code can be placed into a file (here called "foo"), and |
| 330 | then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf |
| 331 | and cls_bpf understands and can directly be loaded with. Example with above |
| 332 | ARP code: |
| 333 | |
| 334 | $ ./bpf_asm foo |
| 335 | 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, |
| 336 | |
| 337 | In copy and paste C-like output: |
| 338 | |
| 339 | $ ./bpf_asm -c foo |
| 340 | { 0x28, 0, 0, 0x0000000c }, |
| 341 | { 0x15, 0, 1, 0x00000806 }, |
| 342 | { 0x06, 0, 0, 0xffffffff }, |
| 343 | { 0x06, 0, 0, 0000000000 }, |
| 344 | |
| 345 | In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF |
| 346 | filters that might not be obvious at first, it's good to test filters before |
| 347 | attaching to a live system. For that purpose, there's a small tool called |
| 348 | bpf_dbg under tools/net/ in the kernel source directory. This debugger allows |
| 349 | for testing BPF filters against given pcap files, single stepping through the |
| 350 | BPF code on the pcap's packets and to do BPF machine register dumps. |
| 351 | |
| 352 | Starting bpf_dbg is trivial and just requires issuing: |
| 353 | |
| 354 | # ./bpf_dbg |
| 355 | |
| 356 | In case input and output do not equal stdin/stdout, bpf_dbg takes an |
| 357 | alternative stdin source as a first argument, and an alternative stdout |
| 358 | sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. |
| 359 | |
| 360 | Other than that, a particular libreadline configuration can be set via |
| 361 | file "~/.bpf_dbg_init" and the command history is stored in the file |
| 362 | "~/.bpf_dbg_history". |
| 363 | |
| 364 | Interaction in bpf_dbg happens through a shell that also has auto-completion |
| 365 | support (follow-up example commands starting with '>' denote bpf_dbg shell). |
| 366 | The usual workflow would be to ... |
| 367 | |
| 368 | > load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 |
| 369 | Loads a BPF filter from standard output of bpf_asm, or transformed via |
| 370 | e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT |
| 371 | debugging (next section), this command creates a temporary socket and |
| 372 | loads the BPF code into the kernel. Thus, this will also be useful for |
| 373 | JIT developers. |
| 374 | |
| 375 | > load pcap foo.pcap |
| 376 | Loads standard tcpdump pcap file. |
| 377 | |
| 378 | > run [<n>] |
| 379 | bpf passes:1 fails:9 |
| 380 | Runs through all packets from a pcap to account how many passes and fails |
| 381 | the filter will generate. A limit of packets to traverse can be given. |
| 382 | |
| 383 | > disassemble |
| 384 | l0: ldh [12] |
| 385 | l1: jeq #0x800, l2, l5 |
| 386 | l2: ldb [23] |
| 387 | l3: jeq #0x1, l4, l5 |
| 388 | l4: ret #0xffff |
| 389 | l5: ret #0 |
| 390 | Prints out BPF code disassembly. |
| 391 | |
| 392 | > dump |
| 393 | /* { op, jt, jf, k }, */ |
| 394 | { 0x28, 0, 0, 0x0000000c }, |
| 395 | { 0x15, 0, 3, 0x00000800 }, |
| 396 | { 0x30, 0, 0, 0x00000017 }, |
| 397 | { 0x15, 0, 1, 0x00000001 }, |
| 398 | { 0x06, 0, 0, 0x0000ffff }, |
| 399 | { 0x06, 0, 0, 0000000000 }, |
| 400 | Prints out C-style BPF code dump. |
| 401 | |
| 402 | > breakpoint 0 |
| 403 | breakpoint at: l0: ldh [12] |
| 404 | > breakpoint 1 |
| 405 | breakpoint at: l1: jeq #0x800, l2, l5 |
| 406 | ... |
| 407 | Sets breakpoints at particular BPF instructions. Issuing a `run` command |
| 408 | will walk through the pcap file continuing from the current packet and |
| 409 | break when a breakpoint is being hit (another `run` will continue from |
| 410 | the currently active breakpoint executing next instructions): |
| 411 | |
| 412 | > run |
| 413 | -- register dump -- |
| 414 | pc: [0] <-- program counter |
| 415 | code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction |
| 416 | curr: l0: ldh [12] <-- disassembly of current instruction |
| 417 | A: [00000000][0] <-- content of A (hex, decimal) |
| 418 | X: [00000000][0] <-- content of X (hex, decimal) |
| 419 | M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) |
| 420 | -- packet dump -- <-- Current packet from pcap (hex) |
| 421 | len: 42 |
| 422 | 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 |
| 423 | 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 |
| 424 | 32: 00 00 00 00 00 00 0a 3b 01 01 |
| 425 | (breakpoint) |
| 426 | > |
| 427 | |
| 428 | > breakpoint |
| 429 | breakpoints: 0 1 |
| 430 | Prints currently set breakpoints. |
| 431 | |
| 432 | > step [-<n>, +<n>] |
| 433 | Performs single stepping through the BPF program from the current pc |
| 434 | offset. Thus, on each step invocation, above register dump is issued. |
| 435 | This can go forwards and backwards in time, a plain `step` will break |
| 436 | on the next BPF instruction, thus +1. (No `run` needs to be issued here.) |
| 437 | |
| 438 | > select <n> |
| 439 | Selects a given packet from the pcap file to continue from. Thus, on |
| 440 | the next `run` or `step`, the BPF program is being evaluated against |
| 441 | the user pre-selected packet. Numbering starts just as in Wireshark |
| 442 | with index 1. |
| 443 | |
| 444 | > quit |
| 445 | # |
| 446 | Exits bpf_dbg. |
| 447 | |
| 448 | JIT compiler |
| 449 | ------------ |
| 450 | |
| 451 | The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, |
| 452 | ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is |
| 453 | transparently invoked for each attached filter from user space or for internal |
| 454 | kernel users if it has been previously enabled by root: |
| 455 | |
| 456 | echo 1 > /proc/sys/net/core/bpf_jit_enable |
| 457 | |
| 458 | For JIT developers, doing audits etc, each compile run can output the generated |
| 459 | opcode image into the kernel log via: |
| 460 | |
| 461 | echo 2 > /proc/sys/net/core/bpf_jit_enable |
| 462 | |
| 463 | Example output from dmesg: |
| 464 | |
| 465 | [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f |
| 466 | [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 |
| 467 | [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 |
| 468 | [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 |
| 469 | [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 |
| 470 | [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 |
| 471 | |
| 472 | In the kernel source tree under tools/net/, there's bpf_jit_disasm for |
| 473 | generating disassembly out of the kernel log's hexdump: |
| 474 | |
| 475 | # ./bpf_jit_disasm |
| 476 | 70 bytes emitted from JIT compiler (pass:3, flen:6) |
| 477 | ffffffffa0069c8f + <x>: |
| 478 | 0: push %rbp |
| 479 | 1: mov %rsp,%rbp |
| 480 | 4: sub $0x60,%rsp |
| 481 | 8: mov %rbx,-0x8(%rbp) |
| 482 | c: mov 0x68(%rdi),%r9d |
| 483 | 10: sub 0x6c(%rdi),%r9d |
| 484 | 14: mov 0xd8(%rdi),%r8 |
| 485 | 1b: mov $0xc,%esi |
| 486 | 20: callq 0xffffffffe0ff9442 |
| 487 | 25: cmp $0x800,%eax |
| 488 | 2a: jne 0x0000000000000042 |
| 489 | 2c: mov $0x17,%esi |
| 490 | 31: callq 0xffffffffe0ff945e |
| 491 | 36: cmp $0x1,%eax |
| 492 | 39: jne 0x0000000000000042 |
| 493 | 3b: mov $0xffff,%eax |
| 494 | 40: jmp 0x0000000000000044 |
| 495 | 42: xor %eax,%eax |
| 496 | 44: leaveq |
| 497 | 45: retq |
| 498 | |
| 499 | Issuing option `-o` will "annotate" opcodes to resulting assembler |
| 500 | instructions, which can be very useful for JIT developers: |
| 501 | |
| 502 | # ./bpf_jit_disasm -o |
| 503 | 70 bytes emitted from JIT compiler (pass:3, flen:6) |
| 504 | ffffffffa0069c8f + <x>: |
| 505 | 0: push %rbp |
| 506 | 55 |
| 507 | 1: mov %rsp,%rbp |
| 508 | 48 89 e5 |
| 509 | 4: sub $0x60,%rsp |
| 510 | 48 83 ec 60 |
| 511 | 8: mov %rbx,-0x8(%rbp) |
| 512 | 48 89 5d f8 |
| 513 | c: mov 0x68(%rdi),%r9d |
| 514 | 44 8b 4f 68 |
| 515 | 10: sub 0x6c(%rdi),%r9d |
| 516 | 44 2b 4f 6c |
| 517 | 14: mov 0xd8(%rdi),%r8 |
| 518 | 4c 8b 87 d8 00 00 00 |
| 519 | 1b: mov $0xc,%esi |
| 520 | be 0c 00 00 00 |
| 521 | 20: callq 0xffffffffe0ff9442 |
| 522 | e8 1d 94 ff e0 |
| 523 | 25: cmp $0x800,%eax |
| 524 | 3d 00 08 00 00 |
| 525 | 2a: jne 0x0000000000000042 |
| 526 | 75 16 |
| 527 | 2c: mov $0x17,%esi |
| 528 | be 17 00 00 00 |
| 529 | 31: callq 0xffffffffe0ff945e |
| 530 | e8 28 94 ff e0 |
| 531 | 36: cmp $0x1,%eax |
| 532 | 83 f8 01 |
| 533 | 39: jne 0x0000000000000042 |
| 534 | 75 07 |
| 535 | 3b: mov $0xffff,%eax |
| 536 | b8 ff ff 00 00 |
| 537 | 40: jmp 0x0000000000000044 |
| 538 | eb 02 |
| 539 | 42: xor %eax,%eax |
| 540 | 31 c0 |
| 541 | 44: leaveq |
| 542 | c9 |
| 543 | 45: retq |
| 544 | c3 |
| 545 | |
| 546 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful |
| 547 | toolchain for developing and testing the kernel's JIT compiler. |
| 548 | |
| 549 | Misc |
| 550 | ---- |
| 551 | |
| 552 | Also trinity, the Linux syscall fuzzer, has built-in support for BPF and |
| 553 | SECCOMP-BPF kernel fuzzing. |
| 554 | |
| 555 | Written by |
| 556 | ---------- |
| 557 | |
| 558 | The document was written in the hope that it is found useful and in order |
| 559 | to give potential BPF hackers or security auditors a better overview of |
| 560 | the underlying architecture. |
| 561 | |
| 562 | Jay Schulist <jschlst@samba.org> |
| 563 | Daniel Borkmann <dborkman@redhat.com> |