funclatency: user functions support (#733)

* bcc: Allow custom bucket formatting for histogram keys

When histogram keys are complex structures (`ct.Structure`
created from C struct), they can't be used as dictionary keys
for counting purposes without a custom hashing function.
Allow the user to provide such hashing function when calling
`print_log_hist` to print the histogram.

* bcc: Allow regular expression in attach_uretprobe

Similarly to `attach_uprobe`, `attach_uretprobe` now supports
taking a regular expression.

* funclatency: Support user functions

funclatency now supports user functions (including regular
expressions) in addition to kernel functions. When multiple
processes are traced, the output is always per-function, per-
process. When a single process is traced, the output can be
combined for all traced functions (as with kernel functions).

Usage examples:

```
funclatency pthread:*mutex* -p 6449 -F
funclatency c:read
funclatency dd:* -p $(pidof dd) -uF
```
diff --git a/README.md b/README.md
index 29e3aad..07d4bf1 100644
--- a/README.md
+++ b/README.md
@@ -95,7 +95,7 @@
 - tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
 - tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt).
 - tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
-- tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
+- tools/[funclatency](tools/funclatency.py): Time functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
 - tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
 - tools/[hardirqs](tools/hardirqs.py):  Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
 - tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
diff --git a/man/man8/funclatency.8 b/man/man8/funclatency.8
index 587546c..7b7771b 100644
--- a/man/man8/funclatency.8
+++ b/man/man8/funclatency.8
@@ -1,10 +1,10 @@
 .TH funclatency 8  "2015-08-18" "USER COMMANDS"
 .SH NAME
-funclatency \- Time kernel functions and print latency as a histogram.
+funclatency \- Time functions and print latency as a histogram.
 .SH SYNOPSIS
-.B funclatency [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-u] [\-m] [\-r] [\-F] pattern
+.B funclatency [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-u] [\-m] [\-F] [\-r] [\-v] pattern
 .SH DESCRIPTION
-This tool traces kernel function calls and times their duration (latency), and
+This tool traces function calls and times their duration (latency), and
 shows the latency distribution as a histogram. The time is measured from when
 the function is called to when it returns, and is inclusive of both on-CPU
 time and time spent blocked.
@@ -17,7 +17,7 @@
 functions, or groups of functions that run at the same stack layer, and
 don't ultimately call each other.
 
-WARNING: This uses dynamic tracing of (what can be many) kernel functions, an
+WARNING: This uses dynamic tracing of (what can be many) functions, an
 activity that has had issues on some kernel versions (risk of panics or
 freezes). Test, and know what you are doing, before use.
 
@@ -51,12 +51,19 @@
 .TP
 \-r
 Use regular expressions for the search pattern.
+.TP
+\-v
+Print the BPF program (for debugging purposes).
 .SH EXAMPLES
 .TP
 Time the do_sys_open() kernel function, and print the distribution as a histogram:
 #
 .B funclatency do_sys_open
 .TP
+Time the read() function in libc across all processes on the system:
+#
+.B funclatency c:read
+.TP
 Time vfs_read(), and print the histogram in units of microseconds:
 #
 .B funclatency \-u vfs_read
@@ -116,6 +123,6 @@
 .SH STABILITY
 Unstable - in development.
 .SH AUTHOR
-Brendan Gregg
+Brendan Gregg, Sasha Goldshtein
 .SH SEE ALSO
 funccount(8)
diff --git a/src/python/bcc/__init__.py b/src/python/bcc/__init__.py
index c6ae751..3d258a6 100644
--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -700,9 +700,9 @@
             raise Exception("Failed to detach BPF from uprobe")
         self._del_uprobe(ev_name)
 
-    def attach_uretprobe(self, name="", sym="", addr=None,
+    def attach_uretprobe(self, name="", sym="", sym_re="", addr=None,
             fn_name="", pid=-1, cpu=0, group_fd=-1):
-        """attach_uretprobe(name="", sym="", addr=None, fn_name=""
+        """attach_uretprobe(name="", sym="", sym_re="", addr=None, fn_name=""
                             pid=-1, cpu=0, group_fd=-1)
 
         Run the bpf function denoted by fn_name every time the symbol sym in
@@ -710,6 +710,13 @@
         meaning of additional parameters.
         """
 
+        if sym_re:
+            for sym_addr in self._get_user_functions(name, sym_re):
+                self.attach_uretprobe(name=name, addr=sym_addr,
+                                      fn_name=fn_name, pid=pid, cpu=cpu,
+                                      group_fd=group_fd)
+            return
+
         name = str(name)
         (path, addr) = BPF._check_path_symbol(name, sym, addr)
 
diff --git a/src/python/bcc/table.py b/src/python/bcc/table.py
index f34cb7b..d395fce 100644
--- a/src/python/bcc/table.py
+++ b/src/python/bcc/table.py
@@ -243,16 +243,17 @@
         return next_key
 
     def print_log2_hist(self, val_type="value", section_header="Bucket ptr",
-            section_print_fn=None):
+            section_print_fn=None, bucket_fn=None):
         """print_log2_hist(val_type="value", section_header="Bucket ptr",
-                           section_print_fn=None)
+                           section_print_fn=None, bucket_fn=None)
 
         Prints a table as a log2 histogram. The table must be stored as
         log2. The val_type argument is optional, and is a column header.
         If the histogram has a secondary key, multiple tables will print
         and section_header can be used as a header description for each.
         If section_print_fn is not None, it will be passed the bucket value
-        to format into a string as it sees fit.
+        to format into a string as it sees fit. If bucket_fn is not None,
+        it will be used to produce a bucket value for the histogram keys.
         """
         if isinstance(self.Key(), ct.Structure):
             tmp = {}
@@ -260,6 +261,8 @@
             f2 = self.Key._fields_[1][0]
             for k, v in self.items():
                 bucket = getattr(k, f1)
+                if bucket_fn:
+                    bucket = bucket_fn(bucket)
                 vals = tmp[bucket] = tmp.get(bucket, [0] * 65)
                 slot = getattr(k, f2)
                 vals[slot] = v.value
diff --git a/tools/funclatency.py b/tools/funclatency.py
index f5618e0..4e6407c 100755
--- a/tools/funclatency.py
+++ b/tools/funclatency.py
@@ -1,17 +1,16 @@
 #!/usr/bin/python
 # @lint-avoid-python-3-compatibility-imports
 #
-# funclatency   Time kernel funcitons and print latency as a histogram.
+# funclatency   Time functions and print latency as a histogram.
 #               For Linux, uses BCC, eBPF.
 #
-# USAGE: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-r] pattern
+# USAGE: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r] [-v]
+#                    pattern
 #
 # Run "funclatency -h" for full usage.
 #
 # The pattern is a string with optional '*' wildcards, similar to file globbing.
-# If you'd prefer to use regular expressions, use the -r option. Matching
-# multiple functions is of limited use, since the output has one histogram for
-# everything. Future versions should split the output histogram by the function.
+# If you'd prefer to use regular expressions, use the -r option.
 #
 # Currently nested or recursive functions are not supported properly, and
 # timestamps will be overwritten, creating dubious output. Try to match single
@@ -21,7 +20,8 @@
 # Copyright (c) 2015 Brendan Gregg.
 # Licensed under the Apache License, Version 2.0 (the "License")
 #
-# 20-Sep-2015   Brendan Gregg   Created this.
+# 20-Sep-2015   Brendan Gregg       Created this.
+# 06-Oct-2016   Sasha Goldshtein    Added user function support.
 
 from __future__ import print_function
 from bcc import BPF
@@ -31,16 +31,18 @@
 
 # arguments
 examples = """examples:
-    ./funclatency do_sys_open       # time the do_sys_open() kenel function
+    ./funclatency do_sys_open       # time the do_sys_open() kernel function
+    ./funclatency c:read            # time the read() C library function
     ./funclatency -u vfs_read       # time vfs_read(), in microseconds
     ./funclatency -m do_nanosleep   # time do_nanosleep(), in milliseconds
     ./funclatency -mTi 5 vfs_read   # output every 5 seconds, with timestamps
     ./funclatency -p 181 vfs_read   # time process 181 only
     ./funclatency 'vfs_fstat*'      # time both vfs_fstat() and vfs_fstatat()
+    ./funclatency 'c:*printf'       # time the *printf family of functions
     ./funclatency -F 'vfs_r*'       # show one histogram per matched function
 """
 parser = argparse.ArgumentParser(
-    description="Time kernel funcitons and print latency as a histogram",
+    description="Time functions and print latency as a histogram",
     formatter_class=argparse.RawDescriptionHelpFormatter,
     epilog=examples)
 parser.add_argument("-p", "--pid",
@@ -57,31 +59,57 @@
     help="show a separate histogram per function")
 parser.add_argument("-r", "--regexp", action="store_true",
     help="use regular expressions. Default is \"*\" wildcards only.")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="print the BPF program (for debugging purposes)")
 parser.add_argument("pattern",
-    help="search expression for kernel functions")
+    help="search expression for functions")
 args = parser.parse_args()
-pattern = args.pattern
+
+def bail(error):
+    print("Error: " + error)
+    exit(1)
+
+parts = args.pattern.split(':')
+if len(parts) == 1:
+    library = None
+    pattern = args.pattern
+elif len(parts) == 2:
+    library = parts[0]
+    libpath = BPF.find_library(library) or BPF.find_exe(library)
+    if not libpath:
+        bail("can't resolve library %s" % library)
+    library = libpath
+    pattern = parts[1]
+else:
+    bail("unrecognized pattern format '%s'" % pattern)
+
 if not args.regexp:
     pattern = pattern.replace('*', '.*')
     pattern = '^' + pattern + '$'
-debug = 0
 
 # define BPF program
 bpf_text = """
 #include <uapi/linux/ptrace.h>
 #include <linux/blkdev.h>
 
-typedef struct ip_key {
+typedef struct ip_pid {
     u64 ip;
+    u64 pid;
+} ip_pid_t;
+
+typedef struct hist_key {
+    ip_pid_t key;
     u64 slot;
-} ip_key_t;
+} hist_key_t;
 
 BPF_HASH(start, u32);
 STORAGE
 
 int trace_func_entry(struct pt_regs *ctx)
 {
-    u32 pid = bpf_get_current_pid_tgid();
+    u64 pid_tgid = bpf_get_current_pid_tgid();
+    u32 pid = pid_tgid;
+    u32 tgid = pid_tgid >> 32;
     u64 ts = bpf_ktime_get_ns();
 
     FILTER
@@ -94,7 +122,9 @@
 int trace_func_return(struct pt_regs *ctx)
 {
     u64 *tsp, delta;
-    u32 pid = bpf_get_current_pid_tgid();
+    u64 pid_tgid = bpf_get_current_pid_tgid();
+    u32 pid = pid_tgid;
+    u32 tgid = pid_tgid >> 32;
 
     // calculate delta time
     tsp = start.lookup(&pid);
@@ -112,10 +142,13 @@
 }
 """
 
+# do we need to store the IP and pid for each invocation?
+need_key = args.function or (library and not args.pid)
+
 # code substitutions
 if args.pid:
     bpf_text = bpf_text.replace('FILTER',
-        'if (pid != %s) { return 0; }' % args.pid)
+        'if (tgid != %s) { return 0; }' % args.pid)
 else:
     bpf_text = bpf_text.replace('FILTER', '')
 if args.milliseconds:
@@ -127,22 +160,32 @@
 else:
     bpf_text = bpf_text.replace('FACTOR', '')
     label = "nsecs"
-if args.function:
+if need_key:
     bpf_text = bpf_text.replace('STORAGE', 'BPF_HASH(ipaddr, u32);\n' +
-        'BPF_HISTOGRAM(dist, ip_key_t);')
+        'BPF_HISTOGRAM(dist, hist_key_t);')
     # stash the IP on entry, as on return it's kretprobe_trampoline:
     bpf_text = bpf_text.replace('ENTRYSTORE',
         'u64 ip = PT_REGS_IP(ctx); ipaddr.update(&pid, &ip);')
+    pid = '-1' if not library else 'tgid'
     bpf_text = bpf_text.replace('STORE',
-        'u64 ip, *ipp = ipaddr.lookup(&pid); if (ipp) { ip = *ipp; ' +
-        'dist.increment((ip_key_t){ip, bpf_log2l(delta)}); ' +
-        'ipaddr.delete(&pid); }')
+        """
+    u64 ip, *ipp = ipaddr.lookup(&pid);
+    if (ipp) {
+        ip = *ipp;
+        hist_key_t key;
+        key.key.ip = ip;
+        key.key.pid = %s;
+        key.slot = bpf_log2l(delta);
+        dist.increment(key);
+        ipaddr.delete(&pid);
+    }
+        """ % pid)
 else:
     bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
     bpf_text = bpf_text.replace('ENTRYSTORE', '')
     bpf_text = bpf_text.replace('STORE',
         'dist.increment(bpf_log2l(delta));')
-if debug:
+if args.verbose:
     print(bpf_text)
 
 # signal handler
@@ -151,9 +194,17 @@
 
 # load BPF program
 b = BPF(text=bpf_text)
-b.attach_kprobe(event_re=pattern, fn_name="trace_func_entry")
-b.attach_kretprobe(event_re=pattern, fn_name="trace_func_return")
-matched = b.num_open_kprobes()
+
+# attach probes
+if not library:
+    b.attach_kprobe(event_re=pattern, fn_name="trace_func_entry")
+    b.attach_kretprobe(event_re=pattern, fn_name="trace_func_return")
+    matched = b.num_open_kprobes()
+else:
+    b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry")
+    b.attach_uretprobe(name=library, sym_re=pattern, fn_name="trace_func_return")
+    matched = b.num_open_uprobes()
+
 if matched == 0:
     print("0 functions matched by \"%s\". Exiting." % args.pattern)
     exit()
@@ -163,6 +214,12 @@
     (matched / 2, args.pattern))
 
 # output
+def print_section(key):
+    if not library:
+        return BPF.sym(key[0], -1)
+    else:
+        return "%s [%d]" % (BPF.sym(key[0], key[1]), key[1])
+
 exiting = 0 if args.interval else 1
 dist = b.get_table("dist")
 while (1):
@@ -177,8 +234,9 @@
     if args.timestamp:
         print("%-8s\n" % strftime("%H:%M:%S"), end="")
 
-    if args.function:
-        dist.print_log2_hist(label, "Function", BPF.ksym)
+    if need_key:
+        dist.print_log2_hist(label, "Function", section_print_fn=print_section,
+            bucket_fn=lambda k: (k.ip, k.pid))
     else:
         dist.print_log2_hist(label)
     dist.clear()
diff --git a/tools/funclatency_example.txt b/tools/funclatency_example.txt
index 2db8d2d..a4c6f88 100644
--- a/tools/funclatency_example.txt
+++ b/tools/funclatency_example.txt
@@ -47,6 +47,98 @@
 efficient way to time kernel functions and examine their latency distribution.
 
 
+Now trace a user function, pthread_mutex_lock in libpthread, to determine if
+there is considerable lock contention:
+
+# ./funclatency pthread:pthread_mutex_lock -p $(pidof contentions)
+Tracing 1 function for "pthread:pthread_mutex_lock"... Hit Ctrl-C to end.
+     nsecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 0        |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 0        |                                        |
+      1024 -> 2047       : 0        |                                        |
+      2048 -> 4095       : 508967   |****************************************|
+      4096 -> 8191       : 70072    |*****                                   |
+      8192 -> 16383      : 27686    |**                                      |
+     16384 -> 32767      : 5075     |                                        |
+     32768 -> 65535      : 2318     |                                        |
+     65536 -> 131071     : 581      |                                        |
+    131072 -> 262143     : 38       |                                        |
+    262144 -> 524287     : 5        |                                        |
+    524288 -> 1048575    : 1        |                                        |
+   1048576 -> 2097151    : 9        |                                        |
+Detaching...
+
+It seems that most calls to pthread_mutex_lock completed rather quickly (in 
+under 4us), but there were some cases of considerable contention, sometimes
+over a full millisecond.
+
+
+Run a quick-and-dirty profiler over all the functions in an executable:
+# ./funclatency /home/user/primes:* -p $(pidof primes) -F
+Tracing 15 functions for "/home/user/primes:*"... Hit Ctrl-C to end.
+^C
+
+Function = is_prime [6556]
+     nsecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 0        |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 0        |                                        |
+      1024 -> 2047       : 0        |                                        |
+      2048 -> 4095       : 1495322  |****************************************|
+      4096 -> 8191       : 95744    |**                                      |
+      8192 -> 16383      : 9926     |                                        |
+     16384 -> 32767      : 3070     |                                        |
+     32768 -> 65535      : 1415     |                                        |
+     65536 -> 131071     : 112      |                                        |
+    131072 -> 262143     : 9        |                                        |
+    262144 -> 524287     : 3        |                                        |
+    524288 -> 1048575    : 0        |                                        |
+   1048576 -> 2097151    : 8        |                                        |
+
+Function = insert_result [6556]
+     nsecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 0        |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 0        |                                        |
+      1024 -> 2047       : 0        |                                        |
+      2048 -> 4095       : 111047   |****************************************|
+      4096 -> 8191       : 3998     |*                                       |
+      8192 -> 16383      : 720      |                                        |
+     16384 -> 32767      : 238      |                                        |
+     32768 -> 65535      : 106      |                                        |
+     65536 -> 131071     : 5        |                                        |
+    131072 -> 262143     : 4        |                                        |
+Detaching...
+
+From the results, we can see that the is_prime function has something resembling
+an exponential distribution -- very few primes take a very long time to test,
+while most numbers are verified as prime or composite in less than 4us. The
+insert_result function exhibits a similar phenomenon, likely due to contention
+over a shared results container.
+
+
 Now vfs_read() is traced, and a microseconds histogram printed:
 
 # ./funclatency -u vfs_read
@@ -239,13 +331,13 @@
 USAGE message:
 
 # ./funclatency -h
-usage: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r]
+usage: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r] [-v]
                    pattern
 
-Time kernel functions and print latency as a histogram
+Time functions and print latency as a histogram
 
 positional arguments:
-  pattern               search expression for kernel functions
+  pattern               search expression for functions
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -258,12 +350,15 @@
   -F, --function        show a separate histogram per function
   -r, --regexp          use regular expressions. Default is "*" wildcards
                         only.
+  -v, --verbose         print the BPF program (for debugging purposes)
 
 examples:
     ./funclatency do_sys_open       # time the do_sys_open() kernel function
+    ./funclatency c:read            # time the read() C library function
     ./funclatency -u vfs_read       # time vfs_read(), in microseconds
     ./funclatency -m do_nanosleep   # time do_nanosleep(), in milliseconds
     ./funclatency -mTi 5 vfs_read   # output every 5 seconds, with timestamps
     ./funclatency -p 181 vfs_read   # time process 181 only
     ./funclatency 'vfs_fstat*'      # time both vfs_fstat() and vfs_fstatat()
+    ./funclatency 'c:*printf'       # time the *printf family of functions
     ./funclatency -F 'vfs_r*'       # show one histogram per matched function