| Suchakra Sharma | c497056 | 2015-08-03 19:22:22 -0400 | [diff] [blame] | 1 |  |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 2 | # BPF Compiler Collection (BCC) |
| 3 | |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 4 | BCC is a toolkit for creating efficient kernel tracing and manipulation |
| 5 | programs, and includes several useful tools and examples. It makes use of eBPF |
| 6 | (Extended Berkeley Packet Filters), a new feature that was first added to |
| 7 | Linux 3.15. Much of what BCC uses requires Linux 4.1 and above. |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 8 | |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 9 | eBPF was [described by](https://lkml.org/lkml/2015/4/14/232) Ingo Molnár as: |
| 10 | |
| 11 | > One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. |
| 12 | |
| Brendan Gregg | 90b3ea5 | 2015-09-10 14:50:02 -0700 | [diff] [blame] | 13 | BCC makes eBPF programs easier to write, with kernel instrumentation in C |
| 14 | and a front-end in Python. It is suited for many tasks, including performance |
| 15 | analysis and network traffic control. |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 16 | |
| 17 | ## Screenshot |
| 18 | |
| 19 | This example traces a disk I/O kernel function, and populates an in-kernel |
| 20 | power-of-2 histogram of the I/O size. For efficiency, only the histogram |
| 21 | summary is returned to user-level. |
| 22 | |
| 23 | ```Shell |
| 24 | # ./bitehist.py |
| 25 | Tracing... Hit Ctrl-C to end. |
| 26 | ^C |
| 27 | value : count distribution |
| 28 | 0 -> 1 : 3 | | |
| 29 | 2 -> 3 : 0 | | |
| 30 | 4 -> 7 : 211 |********** | |
| 31 | 8 -> 15 : 0 | | |
| 32 | 16 -> 31 : 0 | | |
| 33 | 32 -> 63 : 0 | | |
| 34 | 64 -> 127 : 1 | | |
| 35 | 128 -> 255 : 800 |**************************************| |
| 36 | ``` |
| 37 | |
| 38 | The above output shows a bimodal distribution, where the largest mode of |
| 39 | 800 I/O was between 128 and 255 Kbytes in size. |
| 40 | |
| 41 | See the source: [bitehist.c](examples/bitehist.c) and |
| 42 | [bitehist.py](examples/bitehist.py). What this traces, what this stores, and how |
| 43 | the data is presented, can be entirely customized. This shows only some of |
| 44 | many possible capabilities. |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 45 | |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 46 | ## Installing |
| 47 | |
| 48 | See [INSTALL.md](INSTALL.md) for installation steps on your platform. |
| 49 | |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 50 | ## Contents |
| 51 | |
| 52 | Some of these are single files that contain both C and Python, others have a |
| 53 | pair of .c and .py files, and some are directories of files. |
| 54 | |
| 55 | ### Tracing |
| 56 | |
| 57 | Examples: |
| 58 | |
| 59 | - examples/[bitehist.py](examples/bitehist.py) examples/[bitehist.c](examples/bitehist.c): Block I/O size histogram. [Examples](examples/bitehist_example.txt). |
| Brendan Gregg | 2517339 | 2015-09-10 14:48:48 -0700 | [diff] [blame] | 60 | - examples/[disksnoop.py](examples/disksnoop.py) examples/[disksnoop.c](examples/disksnoop.c): Trace block device I/O latency. [Examples](examples/disksnoop_example.txt). |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 61 | - examples/[hello_world.py](examples/hello_world.py): Prints "Hello, World!" for new processes. |
| 62 | - examples/[trace_fields.py](examples/trace_fields.py): Simple example of printing fields from traced events. |
| 63 | - examples/[vfsreadlat.py](examples/vfsreadlat.py) examples/[vfsreadlat.c](examples/vfsreadlat.c): VFS read latency distribution. [Examples](examples/vfsreadlat_example.txt). |
| 64 | |
| 65 | Tools: |
| 66 | |
| Brendan Gregg | 9fa1562 | 2015-09-21 15:51:11 -0700 | [diff] [blame] | 67 | - tools/[biolatency](tools/biolatency): Summarize block device I/O latency as a histogram. [Examples](tools/biolatency_example.txt). |
| Brendan Gregg | ac5c9e3 | 2015-09-16 15:30:07 -0700 | [diff] [blame] | 68 | - tools/[biosnoop](tools/biosnoop): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt). |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 69 | - tools/[funccount](tools/funccount): Count kernel function calls. [Examples](tools/funccount_example.txt). |
| Brendan Gregg | 74016c3 | 2015-09-21 15:49:21 -0700 | [diff] [blame] | 70 | - tools/[funclatency](tools/funclatency): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt). |
| Brendan Gregg | d9e578b | 2015-09-21 11:59:42 -0700 | [diff] [blame] | 71 | - tools/[killsnoop](tools/killsnoop): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt). |
| Brendan Gregg | bedd150 | 2015-09-17 21:52:52 -0700 | [diff] [blame] | 72 | - tools/[opensnoop](tools/opensnoop): Trace open() syscalls. [Examples](tools/opensnoop_example.txt). |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 73 | - tools/[pidpersec](tools/pidpersec): Count new processes (via fork). [Examples](tools/pidpersec_example.txt). |
| 74 | - tools/[syncsnoop](tools/syncsnoop): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt). |
| 75 | - tools/[vfscount](tools/vfscount) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt). |
| 76 | - tools/[vfsstat](tools/vfsstat) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt). |
| 77 | |
| 78 | ### Networking |
| 79 | |
| 80 | Examples: |
| 81 | |
| 82 | - examples/[distributed_bridge/](examples/distributed_bridge): Distributed bridge example. |
| 83 | - examples/[simple_tc.py](examples/simple_tc.py): Simple traffic control example. |
| Brendan Gregg | 02695fd | 2015-09-10 16:46:12 -0700 | [diff] [blame] | 84 | - examples/[simulation.py](examples/simulation.py): Simulation helper. |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 85 | - examples/[tc_neighbor_sharing.py](examples/tc_neighbor_sharing.py) examples/[tc_neighbor_sharing.c](examples/tc_neighbor_sharing.c): Per-IP classification and rate limiting. |
| Brendan Gregg | 2517339 | 2015-09-10 14:48:48 -0700 | [diff] [blame] | 86 | - examples/[tunnel_monitor/](examples/tunnel_monitor): Efficiently monitor traffic flows. [Example video](https://www.youtube.com/watch?v=yYy3Cwce02k). |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 87 | - examples/[vlan_learning.py](examples/vlan_learning.py) examples/[vlan_learning.c](examples/vlan_learning.c): Demux Ethernet traffic into worker veth+namespaces. |
| 88 | |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 89 | ## Motivation |
| 90 | |
| 91 | BPF guarantees that the programs loaded into the kernel cannot crash, and |
| Brenden Blanco | 452de20 | 2015-05-03 10:43:07 -0700 | [diff] [blame] | 92 | cannot run forever, but yet BPF is general purpose enough to perform many |
| 93 | arbitrary types of computation. Currently, it is possible to write a program in |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 94 | C that will compile into a valid BPF program, yet it is vastly easier to |
| 95 | write a C program that will compile into invalid BPF (C is like that). The user |
| Brenden Blanco | 452de20 | 2015-05-03 10:43:07 -0700 | [diff] [blame] | 96 | won't know until trying to run the program whether it was valid or not. |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 97 | |
| 98 | With a BPF-specific frontend, one should be able to write in a language and |
| 99 | receive feedback from the compiler on the validity as it pertains to a BPF |
| 100 | backend. This toolkit aims to provide a frontend that can only create valid BPF |
| 101 | programs while still harnessing its full flexibility. |
| 102 | |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 103 | Furthermore, current integrations with BPF have a kludgy workflow, sometimes |
| 104 | involving compiling directly in a linux kernel source tree. This toolchain aims |
| 105 | to minimize the time that a developer spends getting BPF compiled, and instead |
| 106 | focus on the applications that can be written and the problems that can be |
| 107 | solved with BPF. |
| 108 | |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 109 | The features of this toolkit include: |
| 110 | * End-to-end BPF workflow in a shared library |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 111 | * A modified C language for BPF backends |
| Brenden Blanco | 452de20 | 2015-05-03 10:43:07 -0700 | [diff] [blame] | 112 | * Integration with llvm-bpf backend for JIT |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 113 | * Dynamic (un)loading of JITed programs |
| 114 | * Support for BPF kernel hooks: socket filters, tc classifiers, |
| 115 | tc actions, and kprobes |
| 116 | * Bindings for Python |
| 117 | * Examples for socket filters, tc classifiers, and kprobes |
| Brenden Blanco | 3232620 | 2015-09-03 16:31:47 -0700 | [diff] [blame] | 118 | * Self-contained tools for tracing a running system |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 119 | |
| 120 | In the future, more bindings besides python will likely be supported. Feel free |
| 121 | to add support for the language of your choice and send a pull request! |
| 122 | |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 123 | ## Tutorial |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 124 | |
| Brendan Gregg | 493fd62 | 2015-09-10 14:46:52 -0700 | [diff] [blame] | 125 | The BCC toolchain is currently composed of two parts: a C wrapper around LLVM, |
| 126 | and a Python API to interact with the running program. Later, we will go into |
| 127 | more detail of how this all works. |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 128 | |
| 129 | ### Hello, World |
| 130 | |
| 131 | First, we should include the BPF class from the bpf module: |
| 132 | ```python |
| Brenden Blanco | c35989d | 2015-09-02 18:04:07 -0700 | [diff] [blame] | 133 | from bcc import BPF |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 134 | ``` |
| 135 | |
| 136 | Since the C code is so short, we will embed it inside the python script. |
| 137 | |
| 138 | The BPF program always takes at least one argument, which is a pointer to the |
| 139 | context for this type of program. Different program types have different calling |
| 140 | conventions, but for this one we don't care so `void *` is fine. |
| 141 | ```python |
| Yonghong Song | 1375320 | 2015-09-10 19:05:58 -0700 | [diff] [blame] | 142 | BPF(text='void kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); }').trace_print() |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 143 | ``` |
| 144 | |
| 145 | For this example, we will call the program every time `fork()` is called by a |
| Yonghong Song | 1375320 | 2015-09-10 19:05:58 -0700 | [diff] [blame] | 146 | userspace process. Underneath the hood, fork translates to the `clone` syscall. |
| 147 | BCC recognizes prefix `kprobe__`, and will auto attach our program to the kernel symbol `sys_clone`. |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 148 | |
| 149 | The python process will then print the trace printk circular buffer until ctrl-c |
| 150 | is pressed. The BPF program is removed from the kernel when the userspace |
| 151 | process that loaded it closes the fd (or exits). |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 152 | |
| 153 | Output: |
| 154 | ``` |
| Yonghong Song | 1375320 | 2015-09-10 19:05:58 -0700 | [diff] [blame] | 155 | bcc/examples$ sudo python hello_world.py |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 156 | python-7282 [002] d... 3757.488508: : Hello, World! |
| 157 | ``` |
| 158 | |
| Brenden Blanco | 0031285 | 2015-09-04 00:08:19 -0700 | [diff] [blame] | 159 | For an explanation of the meaning of the printed fields, see the trace_pipe |
| 160 | section of the [kernel ftrace doc](https://www.kernel.org/doc/Documentation/trace/ftrace.txt). |
| 161 | |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 162 | [Source code listing](examples/hello_world.py) |
| 163 | |
| 164 | ### Networking |
| 165 | |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 166 | At RedHat Summit 2015, BCC was presented as part of a [session on BPF](http://www.devnation.org/#7784f1f7513e8542e4db519e79ff5eec). |
| 167 | A multi-host vxlan environment is simulated and a BPF program used to monitor |
| 168 | one of the physical interfaces. The BPF program keeps statistics on the inner |
| 169 | and outer IP addresses traversing the interface, and the userspace component |
| 170 | turns those statistics into a graph showing the traffic distribution at |
| 171 | multiple granularities. See the code [here](examples/tunnel_monitor). |
| 172 | |
| 173 | [](https://youtu.be/yYy3Cwce02k) |
| Brenden Blanco | 46176a1 | 2015-07-07 13:05:22 -0700 | [diff] [blame] | 174 | |
| 175 | ### Tracing |
| Brenden | c3c4fc1 | 2015-05-03 08:33:53 -0700 | [diff] [blame] | 176 | |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 177 | Here is a slightly more complex tracing example than Hello World. This program |
| 178 | will be invoked for every task change in the kernel, and record in a BPF map |
| 179 | the new and old pids. |
| 180 | |
| 181 | The C program below introduces two new concepts. |
| 182 | The first is the macro `BPF_TABLE`. This defines a table (type="hash"), with key |
| 183 | type `key_t` and leaf type `u64` (a single counter). The table name is `stats`, |
| 184 | containing 1024 entries maximum. One can `lookup`, `lookup_or_init`, `update`, |
| 185 | and `delete` entries from the table. |
| 186 | The second concept is the prev argument. This argument is treated specially by |
| 187 | the BCC frontend, such that accesses to this variable are read from the saved |
| 188 | context that is passed by the kprobe infrastructure. The prototype of the args |
| 189 | starting from position 1 should match the prototype of the kernel function being |
| 190 | kprobed. If done so, the program will have seamless access to the function |
| 191 | parameters. |
| 192 | ```c |
| 193 | #include <uapi/linux/ptrace.h> |
| 194 | #include <linux/sched.h> |
| 195 | |
| 196 | struct key_t { |
| 197 | u32 prev_pid; |
| 198 | u32 curr_pid; |
| 199 | }; |
| 200 | // map_type, key_type, leaf_type, table_name, num_entry |
| 201 | BPF_TABLE("hash", struct key_t, u64, stats, 1024); |
| Brenden Blanco | 0031285 | 2015-09-04 00:08:19 -0700 | [diff] [blame] | 202 | // attach to finish_task_switch in kernel/sched/core.c, which has the following |
| 203 | // prototype: |
| 204 | // struct rq *finish_task_switch(struct task_struct *prev) |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 205 | int count_sched(struct pt_regs *ctx, struct task_struct *prev) { |
| 206 | struct key_t key = {}; |
| 207 | u64 zero = 0, *val; |
| 208 | |
| 209 | key.curr_pid = bpf_get_current_pid_tgid(); |
| 210 | key.prev_pid = prev->pid; |
| 211 | |
| 212 | val = stats.lookup_or_init(&key, &zero); |
| 213 | (*val)++; |
| 214 | return 0; |
| 215 | } |
| 216 | ``` |
| 217 | [Source code listing](examples/task_switch.c) |
| 218 | |
| 219 | The userspace component loads the file shown above, and attaches it to the |
| Brenden Blanco | 0031285 | 2015-09-04 00:08:19 -0700 | [diff] [blame] | 220 | `finish_task_switch` kernel function. |
| 221 | The [] operator of the BPF object gives access to each BPF_TABLE in the |
| 222 | program, allowing pass-through access to the values residing in the kernel. Use |
| 223 | the object as you would any other python dict object: read, update, and deletes |
| 224 | are all allowed. |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 225 | ```python |
| Brenden Blanco | c35989d | 2015-09-02 18:04:07 -0700 | [diff] [blame] | 226 | from bcc import BPF |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 227 | from time import sleep |
| 228 | |
| 229 | b = BPF(src_file="task_switch.c") |
| Brenden Blanco | c8b6698 | 2015-08-28 23:15:19 -0700 | [diff] [blame] | 230 | b.attach_kprobe(event="finish_task_switch", fn_name="count_sched") |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 231 | |
| 232 | # generate many schedule events |
| 233 | for i in range(0, 100): sleep(0.01) |
| 234 | |
| Brenden Blanco | c8b6698 | 2015-08-28 23:15:19 -0700 | [diff] [blame] | 235 | for k, v in b["stats"].items(): |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 236 | print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value)) |
| 237 | ``` |
| 238 | [Source code listing](examples/task_switch.py) |
| 239 | |
| Brenden Blanco | 452de20 | 2015-05-03 10:43:07 -0700 | [diff] [blame] | 240 | ## Getting started |
| 241 | |
| Brenden Blanco | 3151843 | 2015-07-07 17:38:30 -0700 | [diff] [blame] | 242 | See [INSTALL.md](INSTALL.md) for installation steps on your platform. |