argdist, trace, and tplist support for USDT probes

These tools now support USDT probes with the 'u:provider:probe' syntax.
Probes in a library or process can be listed with 'tplist -l LIB' or 'tplist -p PID'.
Probe arguments are also parsed and available in both argdist and trace as arg1,
arg2, etc., regardless of the probe attach location.

The same USDT probe can be used at multiple locations, which means the attach infra-
structure must probe all these locations. argdist and trace register thunk probes
at each location, which call a central probe function (which is static inline) with
the location id (__loc_id). The central probe function checks the location id to
determine how the arguments should be retrieved -- this is location-dependent.

Finally, some USDT probes must be enabled first by writing a value to a memory
location (this is called a "semaphore"). This value is per-process, so we require a
process id for this kind of probes.

Along with trace and argdist tool support, this commit also introduces new classes
in the bcc module: ProcStat handles pid-wrap detection, whereas USDTReader,
USDTProbe, USDTProbeLocation, and USDTArgument are the shared USDT-related
infrastructure that enables enumeration, attachment, and argument retrieval for
USDT probes.
diff --git a/README.md b/README.md
index 244f33c..052d10e 100644
--- a/README.md
+++ b/README.md
@@ -109,7 +109,7 @@
 - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
 - tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).
 - tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
-- tools/[tplist](tools/tplist.py): Display kernel tracepoints and their format.
+- tools/[tplist](tools/tplist.py): Display kernel tracepoints or USDT probes and their formats. [Examples](tools/tplist_example.txt).
 - tools/[trace](tools/trace.py): Trace arbitrary functions, with filters. [Examples](tools/trace_example.txt)
 - tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
 - tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).
diff --git a/man/man8/argdist.8 b/man/man8/argdist.8
index 60a970b..bf6a293 100644
--- a/man/man8/argdist.8
+++ b/man/man8/argdist.8
@@ -50,11 +50,11 @@
 .SH SPECIFIER SYNTAX
 The general specifier syntax is as follows:
 
-.B {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+.B {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 .TP
-.B {p,r,t}
+.B {p,r,t,u}
 Probe type \- "p" for function entry, "r" for function return, "t" for kernel
-tracepoint; \-H for histogram collection, \-C for frequency count.
+tracepoint, "u" for USDT probe; \-H for histogram collection, \-C for frequency count.
 Indicates where to place the probe and whether the probe should collect frequency
 count information, or aggregate the collected values into a histogram. Counting 
 probes will collect the number of times every parameter value was observed,
@@ -78,7 +78,9 @@
 based on that signature. For example, if you only want to collect the first
 parameter, you don't have to specify the rest of the parameters in the signature.
 When capturing kernel tracepoints, this should be the name of the event, e.g.
-net_dev_start_xmit. The signature for kernel tracepoints should be empty.
+net_dev_start_xmit. The signature for kernel tracepoints should be empty. When
+capturing USDT probes, this should be the name of the probe, e.g. reloc_complete.
+The signature for USDT probes should be empty.
 .TP
 .B [type[,type...]]
 The type(s) of the expression(s) to capture.
@@ -94,6 +96,8 @@
 to the tracepoint format (which you can obtain using tplist). For example, the
 block:block_rq_complete tracepoint can access tp.nr_sector. You may also use the
 members of the "tp" struct directly, e.g. "nr_sector" instead of "tp.nr_sector".
+USDT probes may access the arguments defined by the tracing program in the 
+special arg1, arg2, ... variables. To obtain their types, use the tplist tool.
 Return probes can use the argument values received by the
 function when it was entered, through the $entry(paramname) special variable.
 Return probes can also access the function's return value in $retval, and the
@@ -154,6 +158,10 @@
 #
 .B argdist -C 't:irq:irq_handler_entry():int:irq'
 .TP
+Print the functions used as thread entry points and how common they are:
+#
+.B argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
+.TP
 Print histograms of sleep() and nanosleep() parameter values:
 #
 .B argdist -H 'p:c:sleep(u32 seconds):u32:seconds' 'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
diff --git a/man/man8/tplist.8 b/man/man8/tplist.8
index 53f5f4a..474b6ad 100644
--- a/man/man8/tplist.8
+++ b/man/man8/tplist.8
@@ -1,23 +1,33 @@
 .TH tplist 8  "2016-03-20" "USER COMMANDS"
 .SH NAME
-tplist \- Display kernel tracepoints and their format.
+tplist \- Display kernel tracepoints or USDT probes and their formats.
 .SH SYNOPSIS
-.B tplist [-v] [tracepoint]
+.B tplist [-p PID] [-l LIB] [-v] [filter]
 .SH DESCRIPTION
 tplist lists all kernel tracepoints, and can optionally print out the tracepoint
-format; namely, the variables that you can trace when the tracepoint is hit. This
-is usually used in conjunction with the argdist and/or trace tools.
+format; namely, the variables that you can trace when the tracepoint is hit. 
+tplist can also list USDT probes embedded in a specific library or executable,
+and can list USDT probes for all the libraries loaded by a specific process.
+These features are usually used in conjunction with the argdist and/or trace tools.
 
 On a typical system, accessing the tracepoint list and format requires root.
+However, accessing USDT probes does not require root.
 .SH OPTIONS
 .TP
-\-v
-Display the variables associated with the tracepoint or tracepoints.
+\-p PID
+Display the USDT probes from all the libraries loaded by the specified process.
 .TP
-[tracepoint]
-A wildcard expression that specifies which tracepoints to print. For example,
-block:* will print all block tracepoints (block:block_rq_complete, etc.).
-Regular expressions are not supported.
+\-l LIB
+Display the USDT probes from the specified library or executable. If the librar
+or executable can be found in the standard paths, a full path is not required.
+.TP
+\-v
+Display the variables associated with the tracepoint or USDT probe.
+.TP
+[filter]
+A wildcard expression that specifies which tracepoints or probes to print.
+For example, block:* will print all block tracepoints (block:block_rq_complete,
+etc.). Regular expressions are not supported.
 .SH EXAMPLES
 .TP
 Print all kernel tracepoints:
@@ -27,6 +37,14 @@
 Print all net tracepoints with their format:
 #
 .B tplist -v 'net:*'
+.TP
+Print all USDT probes in libpthread:
+$ 
+.B tplist -l pthread
+.TP
+Print all USDT probes in process 4717 from the libc provider:
+$
+.B tplist -p 4717 'libc:*'
 .SH SOURCE
 This is from bcc.
 .IP
diff --git a/man/man8/trace.8 b/man/man8/trace.8
index c4e3546..4c70bf6 100644
--- a/man/man8/trace.8
+++ b/man/man8/trace.8
@@ -46,11 +46,11 @@
 
 .B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
 
-.B t:category:event [(predicate)] ["format string"[, arguments]]
+.B {t:category:event,u:library:probe} [(predicate)] ["format string"[, arguments]]
 .TP
-.B {[{p,r}],t}
+.B {[{p,r}],t,u}
 Probe type \- "p" for function entry, "r" for function return, "t" for kernel
-tracepoint. The default probe type is "p".
+tracepoint, "u" for USDT probe. The default probe type is "p".
 .TP
 .B [library]
 Library containing the probe.
@@ -69,6 +69,9 @@
 .B event
 The tracepoint event. For example, "block_rq_complete".
 .TP
+.B probe
+The USDT probe name. For example, "pthread_create".
+.TP
 .B [(predicate)]
 The filter applied to the captured data. Only if the filter evaluates as true,
 the trace message will be printed. The filter can use any valid C expression
@@ -96,6 +99,9 @@
 also use the members of the "tp" struct directly, e.g "nr_sector" instead of
 "tp.nr_sector".
 
+In USDT probes, the arg1, ..., argN variables refer to the probe's arguments.
+To determine which arguments your probe has, use the tplist tool.
+
 The predicate expression and the format specifier replacements for printing
 may also use the following special keywords: $pid, $tgid to refer to the 
 current process' pid and tgid; $uid, $gid to refer to the current user's
@@ -121,6 +127,10 @@
 Trace the block:block_rq_complete tracepoint and print the number of sectors completed:
 #
 .B trace 't:block:block_rq_complete """%d sectors"", nr_sector'
+.TP
+Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
+#
+.B trace 'u:pthread:pthread_create """start addr = %llx"", arg3'
 .SH SOURCE
 This is from bcc.
 .IP
diff --git a/src/python/bcc/__init__.py b/src/python/bcc/__init__.py
index 6671f8a..28d61b7 100644
--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -26,6 +26,7 @@
 basestring = (unicode if sys.version_info[0] < 3 else str)
 
 from .libbcc import lib, _CB_TYPE
+from .procstat import ProcStat
 from .table import Table
 from .tracepoint import Perf, Tracepoint
 from .usyms import ProcessSymbols
@@ -341,7 +342,7 @@
                 desc.encode("ascii"), pid, cpu, group_fd,
                 self._reader_cb_impl, ct.cast(id(self), ct.py_object))
         res = ct.cast(res, ct.c_void_p)
-        if res.value is None:
+        if res == None:
             raise Exception("Failed to attach BPF to kprobe")
         open_kprobes[ev_name] = res
         return self
@@ -389,7 +390,7 @@
                 desc.encode("ascii"), pid, cpu, group_fd,
                 self._reader_cb_impl, ct.cast(id(self), ct.py_object))
         res = ct.cast(res, ct.c_void_p)
-        if res.value is None:
+        if res == None:
             raise Exception("Failed to attach BPF to kprobe")
         open_kprobes[ev_name] = res
         return self
@@ -513,7 +514,7 @@
                 desc.encode("ascii"), pid, cpu, group_fd,
                 self._reader_cb_impl, ct.cast(id(self), ct.py_object))
         res = ct.cast(res, ct.c_void_p)
-        if res.value is None:
+        if res == None:
             raise Exception("Failed to attach BPF to uprobe")
         open_uprobes[ev_name] = res
         return self
@@ -557,7 +558,7 @@
                 desc.encode("ascii"), pid, cpu, group_fd,
                 self._reader_cb_impl, ct.cast(id(self), ct.py_object))
         res = ct.cast(res, ct.c_void_p)
-        if res.value is None:
+        if res == None:
             raise Exception("Failed to attach BPF to uprobe")
         open_uprobes[ev_name] = res
         return self
@@ -793,3 +794,5 @@
         except KeyboardInterrupt:
             exit()
 
+from .usdt import USDTReader
+
diff --git a/src/python/bcc/procstat.py b/src/python/bcc/procstat.py
new file mode 100644
index 0000000..06cb677
--- /dev/null
+++ b/src/python/bcc/procstat.py
@@ -0,0 +1,33 @@
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+class ProcStat(object):
+        def __init__(self, pid):
+                self.pid = pid
+                self.exe = self._get_exe()
+                self.start_time = self._get_start_time()
+
+        def is_stale(self):
+                return self.exe != self._get_exe() or \
+                       self.start_time != self._get_start_time()
+
+        def _get_exe(self):
+                return os.popen("readlink -f /proc/%d/exe" % self.pid).read()
+
+        def _get_start_time(self):
+                return os.popen("cut -d' ' -f 22 /proc/%d/stat" %
+                                self.pid).read()
+
diff --git a/src/python/bcc/usdt.py b/src/python/bcc/usdt.py
new file mode 100644
index 0000000..9a1556a
--- /dev/null
+++ b/src/python/bcc/usdt.py
@@ -0,0 +1,433 @@
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import struct
+import re
+
+from . import BPF
+from . import ProcStat
+
+class USDTArgument(object):
+        def __init__(self, size, is_signed, register=None, constant=None,
+                     deref_offset=None, deref_name=None):
+                self.size = size
+                self.is_signed = is_signed
+                self.register = register
+                self.constant = constant
+                self.deref_offset = deref_offset
+                self.deref_name = deref_name
+
+        def _normalize_register(self):
+                normalized = self.register
+                if normalized is None:
+                        return None
+                if normalized.startswith('%'):
+                        normalized = normalized[1:]
+                if normalized in USDTArgument.translations:
+                        normalized = USDTArgument.translations[normalized]
+                return normalized
+
+        translations = {
+                "rax": "ax", "rbx": "bx", "rcx": "cx", "rdx": "dx",
+                "rdi": "di", "rsi": "si", "rbp": "bp", "rsp": "sp",
+                "rip": "ip", "eax": "ax", "ebx": "bx", "ecx": "cx",
+                "edx": "dx", "edi": "di", "esi": "si", "ebp": "bp",
+                "esp": "sp", "eip": "ip", "al": "ax", "bl": "bx",
+                "cl": "cx", "dl": "dx"
+                        }
+
+        def generate_assign_to_local(self, local_name):
+                """
+                generate_assign_to_local(local_name)
+
+                Generates an assignment statement that initializes a local
+                variable with the value of this argument. Assumes that the
+                struct pt_regs pointer is called 'ctx', and accesses registers
+                from that pointer. The local variable must already be declared
+                by the caller. Use get_type() to get the proper type for that
+                declaration.
+
+                Example output:
+                        local1 = (u64)ctx->di;
+                        {
+                                u64 __tmp;
+                                bpf_probe_read(&__tmp, sizeof(__tmp),
+                                               (void *)(ctx->bp - 8));
+                                bpf_probe_read(&local2, sizeof(local2),
+                                               (void *)__tmp);
+                        }
+                """
+                normalized_reg = self._normalize_register()
+                if self.constant is not None:
+                        # Simplest case, it's just a constant
+                        return "%s = %d;" % (local_name, self.constant)
+                if self.deref_offset is None:
+                        # Simple read from the specified register
+                        return "%s = (%s)ctx->%s;" % \
+                                (local_name, self.get_type(), normalized_reg)
+                        # Note that the cast to a smaller type should grab the
+                        # relevant part of the register anyway, if we're dealing
+                        # with 32/16/8-bit registers like ecx, dx, al, etc.
+
+                if self.deref_offset is not None and self.deref_name is None:
+                        # Add deref_offset to register value and bpf_probe_read
+                        # from the resulting address
+                        return \
+"""{
+        u64 __temp = ctx->%s + (%d);
+        bpf_probe_read(&%s, sizeof(%s), (void *)__temp);
+}                       """ % (normalized_reg, self.deref_offset,
+                               local_name, local_name)
+
+                # Final case: dereference global, need to find address of global
+                # with the provided name and then potentially add deref_offset
+                # and bpf_probe_read the result. None of this will work with BPF
+                # because we can't just access arbitrary addresses.
+                return "%s = 0;      /* UNSUPPORTED CASE, SEE SOURCE */" % \
+                        local_name
+
+        def get_type(self):
+                result_type = None
+                if self.size == 1:
+                        result_type = "char"
+                elif self.size == 2:
+                        result_type = "short"
+                elif self.size == 4:
+                        result_type = "int"
+                elif self.size == 8:
+                        result_type = "long"
+
+                if result_type is None:
+                        raise ValueError("arguments of size %d are not " +
+                                         "currently supported" % self.size)
+
+                if not self.is_signed:
+                        result_type = "unsigned " + result_type
+
+                return result_type
+
+        def __str__(self):
+                prefix = "%d %s bytes @ " % (self.size,
+                        "  signed" if self.is_signed else "unsigned")
+                if self.constant is not None:
+                        return prefix + "constant %d" % self.constant
+                if self.deref_offset is None:
+                        return prefix + "register " + self.register
+                if self.deref_offset is not None and self.deref_name is None:
+                        return prefix + "%d(%s)" % (self.deref_offset,
+                                                    self.register)
+                return prefix + "%d from %s global" % (self.deref_offset,
+                                                       self.deref_name)
+
+class USDTProbeLocation(object):
+        def __init__(self, address, args):
+                self.address = address
+                self.raw_args = args
+                self.args = []
+                self._parse_args()
+
+        def generate_usdt_assignments(self, prefix="arg"):
+                text = ""
+                for i, arg in enumerate(self.args, 1):
+                        text += (" "*16) + \
+                                arg.generate_assign_to_local(
+                                                "%s%d" % (prefix, i)) + "\n"
+                return text
+
+        def _parse_args(self):
+                for arg in self.raw_args.split():
+                        self._parse_arg(arg.strip())
+
+        def _parse_arg(self, arg):
+                qregs = ["%rax", "%rbx", "%rcx", "%rdx", "%rdi", "%rsi",
+                         "%rbp", "%rsp", "%rip", "%r8", "%r9", "%r10", "%r11",
+                         "%r12", "%r13", "%r14", "%r15"]
+                dregs = ["%eax", "%ebx", "%ecx", "%edx", "%edi", "%esi",
+                         "%ebp", "%esp", "%eip"]
+                wregs = ["%ax",  "%bx",  "%cx",  "%dx",  "%di",  "%si",
+                         "%bp",  "%sp",  "%ip"]
+                bregs = ["%al", "%bl", "%cl", "%dl"]
+
+                any_reg = "(" + "|".join(qregs + dregs + wregs + bregs) + ")"
+
+                # -4@$0, 8@$1234
+                m = re.match(r'(\-?)(\d+)@\$(\d+)', arg)
+                if m is not None:
+                        self.args.append(USDTArgument(
+                                int(m.group(2)),
+                                m.group(1) == '-',
+                                constant=int(m.group(3))
+                                ))
+                        return
+
+                # %rdi, %rax, %rsi
+                m = re.match(any_reg, arg)
+                if m is not None:
+                        if arg in qregs:
+                                size = 8
+                        elif arg in dregs:
+                                size = 4
+                        elif arg in wregs:
+                                size = 2
+                        elif arg in bregs:
+                                size = 1
+                        self.args.append(USDTArgument(
+                                size, False, register=arg
+                                ))
+                        return
+
+                # -8@%rbx, 4@%r12
+                m = re.match(r'(\-?)(\d+)@' + any_reg, arg)
+                if m is not None:
+                        self.args.append(USDTArgument(
+                                int(m.group(2)),       # Size (in bytes)
+                                m.group(1) == '-',     # Signed
+                                register=m.group(3)
+                                ))
+                        return
+
+                # 8@-8(%rbp), 4@(%rax)
+                m = re.match(r'(\-?)(\d+)@(\-?)(\d*)\(' + any_reg + r'\)', arg)
+                if m is not None:
+                        deref_offset = int(m.group(4))
+                        if m.group(3) == '-':
+                                deref_offset = -deref_offset
+                        self.args.append(USDTArgument(
+                                int(m.group(2)), m.group(1) == '-',
+                                register=m.group(5), deref_offset=deref_offset
+                                ))
+                        return
+
+                # -4@global_max_action(%rip)
+                m = re.match(r'(\-?)(\d+)@(\w+)\(%rip\)', arg)
+                if m is not None:
+                        self.args.append(USDTArgument(
+                                int(m.group(2)), m.group(1) == '-',
+                                register="%rip", deref_name=m.group(3),
+                                deref_offset=0
+                                ))
+                        return
+
+                # 8@24+mp_(@rip)
+                m = re.match(r'(\-?)(\d+)@(\-?)(\d+)\+(\w+)\(%rip\)', arg)
+                if m is not None:
+                        deref_offset = int(m.group(4))
+                        if m.group(3) == '-':
+                                deref_offset = -deref_offset
+                        self.args.append(USDTArgument(
+                                int(m.group(2)), m.group(1) == '-',
+                                register="%rip", deref_offset=deref_offset,
+                                deref_name=m.group(5)
+                                ))
+                        return
+
+                raise ValueError("unrecognized argument format: '%s'" % arg)
+
+
+class USDTProbe(object):
+        def __init__(self, bin_path, provider, name, semaphore):
+                self.bin_path = bin_path
+                self.provider = provider
+                self.name = name
+                self.semaphore = semaphore
+                self.enabled_procs = {}
+                self.proc_semas = {}
+                self.locations = []
+
+        def add_location(self, location, arguments):
+                self.locations.append(USDTProbeLocation(location, arguments))
+
+        def need_enable(self):
+                """
+                Returns whether this probe needs to be enabled in each
+                process that uses it. Probes that must be enabled can't be
+                traced without specifying a specific pid.
+                """
+                return self.semaphore != 0
+
+        def enable(self, pid):
+                """Enables this probe in the specified process."""
+                self._add_to_semaphore(pid, +1)
+                self.enabled_procs[pid] = ProcStat(pid)
+
+        def disable(self, pid):
+                """Disables the probe in the specified process."""
+                if pid not in self.enabled_procs:
+                        raise ValueError("probe wasn't enabled in this process")
+                # Because of the possibility of pid wrap, it's extremely
+                # important to verify that we are still dealing with the same
+                # process. Otherwise, we are overwriting random memory in some
+                # other process :-)
+                if not self.enabled_procs[pid].is_stale():
+                        self._add_to_semaphore(pid, -1)
+                del(self.enabled_procs[pid])
+
+        def get_arg_types(self):
+                """
+                Returns the argument types used by this probe. Different probe
+                locations might use different argument types, e.g. signed i32
+                vs. unsigned i64. We should take the largest type, and the
+                sign really doesn't matter that much.
+                """
+                arg_types = []
+                for i in range(len(self.locations[0].args)):
+                        max_size_loc = max(self.locations, key=lambda loc:
+                                loc.args[i].size)
+                        arg_types.append(max_size_loc.args[i].get_type())
+                return arg_types
+
+        def generate_usdt_thunks(self, name_prefix, thunk_names):
+                text = ""
+                for i in range(len(self.locations)):
+                        thunk_name = "%s_thunk_%d" % (name_prefix, i)
+                        thunk_names.append(thunk_name)
+                        text += """
+int %s(struct pt_regs *ctx) {
+        return %s(ctx, %d);
+}                       """ % (thunk_name, name_prefix, i)
+                return text
+
+        def generate_usdt_cases(self):
+                text = ""
+                for i, arg_type in enumerate(self.get_arg_types(), 1):
+                        text += "        %s arg%d = 0;\n" % (arg_type, i)
+                for i, location in enumerate(self.locations):
+                        assignments = location.generate_usdt_assignments()
+                        text += \
+"""
+        if (__loc_id == %d) {
+%s
+        }               \n""" % (i, assignments)
+                return text
+
+        def _ensure_proc_sema(self, pid):
+                if pid in self.proc_semas:
+                        return self.proc_semas[pid]
+
+                if self.bin_path.endswith(".so"):
+                        # Semaphores declared in shared objects are relative
+                        # to that shared object's load address
+                        with open("/proc/%d/maps" % pid) as m:
+                                maps = m.readlines()
+                        addrs = map(lambda l: l.split('-')[0],
+                                    filter(lambda l: self.bin_path in l, maps)
+                                    )
+                        if len(addrs) == 0:
+                                raise ValueError("lib %s not loaded in pid %d"
+                                                % (self.bin_path, pid))
+                        sema_addr = int(addrs[0], 16) + self.semaphore
+                else:
+                        sema_addr = self.semaphore      # executable, absolute
+                self.proc_semas[pid] = sema_addr
+                return sema_addr
+
+        def _add_to_semaphore(self, pid, val):
+                sema_addr = self._ensure_proc_sema(pid)
+                with open("/proc/%d/mem" % pid, "r+b") as fd:
+                        fd.seek(sema_addr, 0)
+                        prev = struct.unpack("H", fd.read(2))[0]
+                        fd.seek(sema_addr, 0)
+                        fd.write(struct.pack("H", prev + val))
+
+        def __str__(self):
+                return "%s %s:%s" % (self.bin_path, self.provider, self.name)
+
+        def display_verbose(self):
+                text = str(self) + " [sema 0x%x]\n" % self.semaphore
+                for location in self.locations:
+                        text += "  location 0x%x raw args: %s\n" % \
+                                        (location.address, location.raw_args)
+                        for arg in location.args:
+                                text += "    %s\n" % str(arg)
+                return text
+
+class USDTReader(object):
+        def __init__(self, bin_path="", pid=-1):
+                """
+                __init__(bin_path="", pid=-1)
+
+                Reads all the probes from the specified library, executable,
+                or process. If a pid is specified, all the libraries (including
+                the executable) are searched for probes. After initialization
+                completes, the found probes are in the 'probes' property.
+                """
+                self.probes = []
+                if pid != -1:
+                        for mod in USDTReader._get_modules(pid):
+                                self._add_probes(mod)
+                elif len(bin_path) != 0:
+                        self._add_probes(bin_path)
+                else:
+                        raise ValueError("pid or bin_path is required")
+
+        @staticmethod
+        def _get_modules(pid):
+                with open("/proc/%d/maps" % pid) as f:
+                        maps = f.readlines()
+                modules = []
+                for line in maps:
+                        parts = line.strip().split()
+                        if len(parts) < 6:
+                                continue
+                        if parts[5][0] == '[' or not 'x' in parts[1]:
+                                continue
+                        modules.append(parts[5])
+                return modules
+
+        def _add_probes(self, bin_path):
+                if not os.path.isfile(bin_path):
+                        attempt1 = os.popen(
+                                "which --skip-alias %s 2>/dev/null"
+                                % bin_path).read().strip()
+                        if attempt1 is None or not os.path.isfile(attempt1):
+                                attempt2 = BPF.find_library(bin_path)
+                                if attempt2 is None or \
+                                   not os.path.isfile(attempt2):
+                                        raise ValueError("can't find %s"
+                                                         % bin_path)
+                                else:
+                                        bin_path = attempt2
+                        else:
+                                bin_path = attempt1
+
+                with os.popen("readelf -n %s 2>/dev/null" % bin_path) as child:
+                        notes = child.read()
+                for match in re.finditer(r'stapsdt.*?NT_STAPSDT.*?Provider: ' +
+                        r'(\w+).*?Name: (\w+).*?Location: (\w+), Base: ' +
+                        r'(\w+), Semaphore: (\w+).*?Arguments: ([^\n]*)',
+                        notes, re.DOTALL):
+                        self._add_or_merge_probe(
+                                bin_path, match.group(1), match.group(2),
+                                int(match.group(3), 16),
+                                int(match.group(5), 16), match.group(6)
+                                )
+                # Note that BPF.attach_uprobe takes care of subtracting
+                # the load address for that bin, so we can report the actual
+                # address that appears in the note
+
+        def _add_or_merge_probe(self, bin_path, provider, name, location,
+                                semaphore, arguments):
+                matches = filter(lambda p: p.provider == provider and \
+                                           p.name == name, self.probes)
+                if len(matches) > 0:
+                        probe = matches[0]
+                else:
+                        probe = USDTProbe(bin_path, provider, name, semaphore)
+                        self.probes.append(probe)
+                probe.add_location(location, arguments)
+
+        def __str__(self):
+                return "\n".join(map(USDTProbe.display_verbose, self.probes))
+
diff --git a/src/python/bcc/usyms.py b/src/python/bcc/usyms.py
index d5fe8d3..6e6372c 100644
--- a/src/python/bcc/usyms.py
+++ b/src/python/bcc/usyms.py
@@ -27,16 +27,7 @@
     def refresh_code_ranges(self):
         self.code_ranges = self._get_code_ranges()
         self.ranges_cache = {}
-        self.exe = self._get_exe()
-        self.start_time = self._get_start_time()
-
-    def _get_exe(self):
-        return ProcessSymbols._run_command_get_output(
-                "readlink -f /proc/%d/exe" % self.pid)
-
-    def _get_start_time(self):
-        return ProcessSymbols._run_command_get_output(
-                "cut -d' ' -f 22 /proc/%d/stat" % self.pid)
+        self.procstat = ProcStat(self.pid)
 
     @staticmethod
     def _is_binary_segment(parts):
@@ -101,10 +92,7 @@
         return "%x" % offset
 
     def _check_pid_wrap(self):
-        # If the pid wrapped, our exe name and start time must have changed.
-        # Detect this and get rid of the cached ranges.
-        if self.exe != self._get_exe() or \
-           self.start_time != self._get_start_time():
+        if self.procstat.is_stale():
             self.refresh_code_ranges()
 
     def decode_addr(self, addr):
@@ -127,3 +115,4 @@
                                     binary)
         return "%x" % addr
 
+from . import ProcStat
diff --git a/tools/argdist.py b/tools/argdist.py
index 8f8327d..d3c5239 100755
--- a/tools/argdist.py
+++ b/tools/argdist.py
@@ -12,7 +12,7 @@
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
 
-from bcc import BPF, Tracepoint, Perf
+from bcc import BPF, Tracepoint, Perf, USDTReader
 from time import sleep, strftime
 import argparse
 import re
@@ -20,27 +20,14 @@
 import os
 import sys
 
-class Specifier(object):
-        probe_text = """
-DATA_DECL
-
-int PROBENAME(struct pt_regs *ctx SIGNATURE)
-{
-        PREFIX
-        PID_FILTER
-        if (!(FILTER)) return 0;
-        KEY_EXPR
-        COLLECT
-        return 0;
-}
-"""
+class Probe(object):
         next_probe_index = 0
         aliases = { "$PID": "bpf_get_current_pid_tgid()" }
 
         def _substitute_aliases(self, expr):
                 if expr is None:
                         return expr
-                for alias, subst in Specifier.aliases.items():
+                for alias, subst in Probe.aliases.items():
                         expr = expr.replace(alias, subst)
                 return expr
 
@@ -57,7 +44,9 @@
                         param_name = param[index+1:].strip()
                         self.param_types[param_name] = param_type
 
-        entry_probe_text = """
+        def _generate_entry(self):
+                self.entry_probe_func = self.probe_func_name + "_entry"
+                text = """
 int PROBENAME(struct pt_regs *ctx SIGNATURE)
 {
         u32 pid = bpf_get_current_pid_tgid();
@@ -66,10 +55,6 @@
         return 0;
 }
 """
-
-        def _generate_entry(self):
-                self.entry_probe_func = self.probe_func_name + "_entry"
-                text = self.entry_probe_text
                 text = text.replace("PROBENAME", self.entry_probe_func)
                 text = text.replace("SIGNATURE",
                      "" if len(self.signature) == 0 else ", " + self.signature)
@@ -173,8 +158,8 @@
                                    "function signature must be specified")
                 if len(parts) > 6:
                         self._bail("extraneous ':'-separated parts detected")
-                if parts[0] not in ["r", "p", "t"]:
-                        self._bail("probe type must be 'p', 'r', or 't', " +
+                if parts[0] not in ["r", "p", "t", "u"]:
+                        self._bail("probe type must be 'p', 'r', 't', or 'u' " +
                                    "but got '%s'" % parts[0])
                 if re.match(r"\w+\(.*\)", parts[2]) is None:
                         self._bail(("function signature '%s' has an invalid " +
@@ -191,6 +176,7 @@
                 self.exprs = exprs.split(',')
 
         def __init__(self, type, specifier, pid):
+                self.pid = pid
                 self.raw_spec = specifier
                 self._validate_specifier()
 
@@ -210,6 +196,10 @@
                         self.tp = Tracepoint.enable_tracepoint(
                                         self.tp_category, self.tp_event)
                         self.function = "perf_trace_" + self.function
+                elif self.probe_type == "u":
+                        self.library = parts[1]
+                        self._find_usdt_probe()
+                        self._enable_usdt_probe()
                 else:
                         self.library = parts[1]
                 self.is_user = len(self.library) > 0
@@ -244,12 +234,32 @@
                 self.entry_probe_required = self.probe_type == "r" and \
                         (any(map(check, self.exprs)) or check(self.filter))
 
-                self.pid = pid
                 self.probe_func_name = "%s_probe%d" % \
-                        (self.function, Specifier.next_probe_index)
+                        (self.function, Probe.next_probe_index)
                 self.probe_hash_name = "%s_hash%d" % \
-                        (self.function, Specifier.next_probe_index)
-                Specifier.next_probe_index += 1
+                        (self.function, Probe.next_probe_index)
+                Probe.next_probe_index += 1
+
+        def _enable_usdt_probe(self):
+                if self.usdt.need_enable():
+                        if self.pid is None:
+                                self._bail("probe needs pid to enable")
+                        self.usdt.enable(self.pid)
+
+        def _disable_usdt_probe(self):
+                if self.probe_type == "u" and self.usdt.need_enable():
+                        self.usdt.disable(self.pid)
+
+        def close(self):
+                self._disable_usdt_probe()
+
+        def _find_usdt_probe(self):
+                reader = USDTReader(bin_path=self.library)
+                for probe in reader.probes:
+                        if probe.name == self.function:
+                                self.usdt = probe
+                                return
+                self._bail("unrecognized USDT probe %s" % self.function)
 
         def _substitute_exprs(self):
                 def repl(expr):
@@ -270,8 +280,8 @@
 
         def _generate_field_assignment(self, i):
                 if self._is_string(self.expr_types[i]):
-                        return "        bpf_probe_read(" + \
-                               "&__key.v%d.s, sizeof(__key.v%d.s), %s);\n" % \
+                        return ("        bpf_probe_read(&__key.v%d.s," +
+                                " sizeof(__key.v%d.s), (void *)%s);\n") % \
                                 (i, i, self.exprs[i])
                 else:
                         return "        __key.v%d = %s;\n" % (i, self.exprs[i])
@@ -318,10 +328,25 @@
 
         def generate_text(self):
                 program = ""
+                probe_text = """
+DATA_DECL
+
+QUALIFIER int PROBENAME(struct pt_regs *ctx SIGNATURE)
+{
+        PID_FILTER
+        PREFIX
+        if (!(FILTER)) return 0;
+        KEY_EXPR
+        COLLECT
+        return 0;
+}
+"""
+                prefix = ""
+                qualifier = ""
+                signature = ""
 
                 # If any entry arguments are probed in a ret probe, we need
                 # to generate an entry probe to collect them
-                prefix = ""
                 if self.entry_probe_required:
                         program += self._generate_entry_probe()
                         prefix += self._generate_retprobe_prefix()
@@ -329,18 +354,19 @@
                         # value we collected when entering the function:
                         self._replace_entry_exprs()
 
-                # If this is a tracepoint probe, generate a local variable
-                # that enables access to the tracepoint structure and also
-                # the structure definition itself
                 if self.probe_type == "t":
                         program += self.tp.generate_struct()
                         prefix += self.tp.generate_get_struct()
+                elif self.probe_type == "u":
+                        qualifier = "static inline"
+                        signature = ", int __loc_id"
+                        prefix += self.usdt.generate_usdt_cases()
+                elif self.probe_type == "p" and len(self.signature) > 0:
+                        # Only entry uprobes/kprobes can have user-specified
+                        # signatures. Other probes force it to ().
+                        signature = ", " + self.signature
 
-                program += self.probe_text.replace("PROBENAME",
-                                                   self.probe_func_name)
-                signature = "" if len(self.signature) == 0 \
-                                  or self.probe_type == "r" \
-                               else ", " + self.signature
+                program += probe_text.replace("PROBENAME", self.probe_func_name)
                 program = program.replace("SIGNATURE", signature)
                 program = program.replace("PID_FILTER",
                                           self._generate_pid_filter())
@@ -354,34 +380,56 @@
                         "1" if len(self.filter) == 0 else self.filter)
                 program = program.replace("COLLECT", collect)
                 program = program.replace("PREFIX", prefix)
+                program = program.replace("QUALIFIER", qualifier)
+
+                if self.probe_type == "u":
+                        self.usdt_thunk_names = []
+                        program += self.usdt.generate_usdt_thunks(
+                                self.probe_func_name, self.usdt_thunk_names)
+
                 return program
 
+        def _attach_u(self):
+                libpath = BPF.find_library(self.library)
+                if libpath is None:
+                        with os.popen(("which --skip-alias %s " +
+                                "2>/dev/null") % self.library) as w:
+                                libpath = w.read().strip()
+                if libpath is None or len(libpath) == 0:
+                        self._bail("unable to find library %s" %
+                                   self.library)
+
+                if self.probe_type == "u":
+                        for i, location in enumerate(self.usdt.locations):
+                                self.bpf.attach_uprobe(name=libpath,
+                                        addr=location.address,
+                                        fn_name=self.usdt_thunk_names[i],
+                                        pid=self.pid or -1)
+                elif self.probe_type == "r":
+                        self.bpf.attach_uretprobe(name=libpath,
+                                                  sym=self.function,
+                                                  fn_name=self.probe_func_name,
+                                                  pid=self.pid or -1)
+                else:
+                        self.bpf.attach_uprobe(name=libpath,
+                                               sym=self.function,
+                                               fn_name=self.probe_func_name,
+                                               pid=self.pid or -1)
+
+        def _attach_k(self):
+                if self.probe_type == "r" or self.probe_type == "t":
+                        self.bpf.attach_kretprobe(event=self.function,
+                                             fn_name=self.probe_func_name)
+                else:
+                        self.bpf.attach_kprobe(event=self.function,
+                                          fn_name=self.probe_func_name)
+
         def attach(self, bpf):
                 self.bpf = bpf
-                uprobes_start = len(BPF.open_uprobes())
-                kprobes_start = len(BPF.open_kprobes())
                 if self.is_user:
-                        if self.probe_type == "r":
-                                bpf.attach_uretprobe(name=self.library,
-                                                  sym=self.function,
-                                                  fn_name=self.probe_func_name,
-                                                  pid=self.pid or -1)
-                        else:
-                                bpf.attach_uprobe(name=self.library,
-                                                  sym=self.function,
-                                                  fn_name=self.probe_func_name,
-                                                  pid=self.pid or -1)
-                        if len(BPF.open_uprobes()) != uprobes_start + 1:
-                                self._bail("error attaching probe")
+                        self._attach_u()
                 else:
-                        if self.probe_type == "r" or self.probe_type == "t":
-                                bpf.attach_kretprobe(event=self.function,
-                                                  fn_name=self.probe_func_name)
-                        else:
-                                bpf.attach_kprobe(event=self.function,
-                                                  fn_name=self.probe_func_name)
-                        if len(BPF.open_kprobes()) != kprobes_start + 1:
-                                self._bail("error attaching probe")
+                        self._attach_k()
                 if self.entry_probe_required:
                         self._attach_entry_probe()
 
@@ -397,7 +445,7 @@
                 expr = self.exprs[i].replace(
                         "(bpf_ktime_get_ns() - *____latency_val)", "$latency")
                 # Replace alias values back with the alias name
-                for alias, subst in Specifier.aliases.items():
+                for alias, subst in Probe.aliases.items():
                         expr = expr.replace(subst, alias)
                 # Replace retval expression with $retval
                 expr = expr.replace("ctx->ax", "$retval")
@@ -445,12 +493,16 @@
                                 if not self.is_default_expr  else "retval")
                         data.print_log2_hist(val_type=label)
 
+        def __str__(self):
+                return self.label or self.raw_spec
+
 class Tool(object):
         examples = """
 Probe specifier syntax:
-        {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+        {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
-        p,r,t      -- probe at function entry, function exit, or kernel tracepoint
+        p,r,t,u    -- probe at function entry, function exit, kernel tracepoint,
+                      or USDT probe
                       in exit probes: can use $retval, $entry(param), $latency
         library    -- the library that contains the function
                       (leave empty for kernel functions)
@@ -509,6 +561,10 @@
 argdist -C 't:irq:irq_handler_entry():int:tp.irq'
         Aggregate interrupts by interrupt request (IRQ)
 
+argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
+        Print frequency of function addresses used as a pthread start function,
+        relying on the USDT pthread_start probe in process 1337
+
 argdist  -H \\
         'p:c:sleep(u32 seconds):u32:seconds' \\
         'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
@@ -552,15 +608,15 @@
                   help="additional header files to include in the BPF program")
                 self.args = parser.parse_args()
 
-        def _create_specifiers(self):
-                self.specifiers = []
+        def _create_probes(self):
+                self.probes = []
                 for specifier in (self.args.countspecifier or []):
-                        self.specifiers.append(Specifier(
+                        self.probes.append(Probe(
                                 "freq", specifier, self.args.pid))
                 for histspecifier in (self.args.histspecifier or []):
-                        self.specifiers.append(
-                                Specifier("hist", histspecifier, self.args.pid))
-                if len(self.specifiers) == 0:
+                        self.probes.append(
+                                Probe("hist", histspecifier, self.args.pid))
+                if len(self.probes) == 0:
                         print("at least one specifier is required")
                         exit()
 
@@ -573,19 +629,19 @@
                 for include in (self.args.include or []):
                         bpf_source += "#include <%s>\n" % include
                 bpf_source += BPF.generate_auto_includes(
-                                map(lambda s: s.raw_spec, self.specifiers))
+                                map(lambda p: p.raw_spec, self.probes))
                 bpf_source += Tracepoint.generate_decl()
                 bpf_source += Tracepoint.generate_entry_probe()
-                for specifier in self.specifiers:
-                        bpf_source += specifier.generate_text()
+                for probe in self.probes:
+                        bpf_source += probe.generate_text()
                 if self.args.verbose:
                         print(bpf_source)
                 self.bpf = BPF(text=bpf_source)
 
         def _attach(self):
                 Tracepoint.attach(self.bpf)
-                for specifier in self.specifiers:
-                        specifier.attach(self.bpf)
+                for probe in self.probes:
+                        probe.attach(self.bpf)
                 if self.args.verbose:
                         print("open uprobes: %s" % BPF.open_uprobes())
                         print("open kprobes: %s" % BPF.open_kprobes())
@@ -598,16 +654,22 @@
                         except KeyboardInterrupt:
                                 exit()
                         print("[%s]" % strftime("%H:%M:%S"))
-                        for specifier in self.specifiers:
-                                specifier.display(self.args.top)
+                        for probe in self.probes:
+                                probe.display(self.args.top)
                         count_so_far += 1
                         if self.args.count is not None and \
                            count_so_far >= self.args.count:
                                 exit()
 
+        def _close_probes(self):
+                for probe in self.probes:
+                        probe.close()
+                        if self.args.verbose:
+                                print("closed probe: " + str(probe))
+
         def run(self):
                 try:
-                        self._create_specifiers()
+                        self._create_probes()
                         self._generate_program()
                         self._attach()
                         self._main_loop()
@@ -616,6 +678,7 @@
                                 traceback.print_exc()
                         elif sys.exc_type is not SystemExit:
                                 print(sys.exc_value)
+                self._close_probes()
 
 if __name__ == "__main__":
         Tool().run()
diff --git a/tools/argdist_example.txt b/tools/argdist_example.txt
index acd4549..d851337 100644
--- a/tools/argdist_example.txt
+++ b/tools/argdist_example.txt
@@ -332,9 +332,10 @@
                         additional header files to include in the BPF program
 
 Probe specifier syntax:
-        {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+        {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
-        p,r,t      -- probe at function entry, function exit, or kernel tracepoint
+        p,r,t,u    -- probe at function entry, function exit, kernel tracepoint,
+                      or USDT probe
                       in exit probes: can use $retval, $entry(param), $latency
         library    -- the library that contains the function
                       (leave empty for kernel functions)
@@ -392,6 +393,10 @@
 argdist -C 't:irq:irq_handler_entry():int:tp.irq'
         Aggregate interrupts by interrupt request (IRQ)
 
+argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
+        Print frequency of function addresses used as a pthread start function,
+        relying on the USDT pthread_start probe in process 1337
+
 argdist -H \
         'p:c:sleep(u32 seconds):u32:seconds' \
         'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
diff --git a/tools/tplist.py b/tools/tplist.py
index fead0a1..abb011d 100755
--- a/tools/tplist.py
+++ b/tools/tplist.py
@@ -1,27 +1,34 @@
 #!/usr/bin/env python
 #
-# tplist    Display kernel tracepoints and their formats.
+# tplist    Display kernel tracepoints or USDT probes and their formats.
 #
-# USAGE:    tplist [-v] [tracepoint]
+# USAGE:    tplist [-p PID] [-l LIB] [-v] [filter]
 #
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
 
 import argparse
 import fnmatch
-import re
 import os
+import re
+import sys
+
+from bcc import USDTReader
 
 trace_root = "/sys/kernel/debug/tracing"
 event_root = os.path.join(trace_root, "events")
 
 parser = argparse.ArgumentParser(description=
-                "Display kernel tracepoints and their formats.",
+                "Display kernel tracepoints or USDT probes and their formats.",
                 formatter_class=argparse.RawDescriptionHelpFormatter)
+parser.add_argument("-p", "--pid", type=int, default=-1, help=
+                "List USDT probes in the specified process")
+parser.add_argument("-l", "--lib", default="", help=
+                "List USDT probes in the specified library or executable")
 parser.add_argument("-v", dest="variables", action="store_true", help=
-                "Print the format (available variables) for each tracepoint")
-parser.add_argument(dest="tracepoint", nargs="?",
-                help="The tracepoint name to print (wildcards allowed)")
+                "Print the format (available variables)")
+parser.add_argument(dest="filter", nargs="?", help=
+                "A filter that specifies which probes/tracepoints to print")
 args = parser.parse_args()
 
 def print_tpoint_format(category, event):
@@ -42,12 +49,12 @@
 
 def print_tpoint(category, event):
         tpoint = "%s:%s" % (category, event)
-        if not args.tracepoint or fnmatch.fnmatch(tpoint, args.tracepoint):
+        if not args.filter or fnmatch.fnmatch(tpoint, args.filter):
                 print(tpoint)
                 if args.variables:
                         print_tpoint_format(category, event)
 
-def print_all():
+def print_tracepoints():
         for category in os.listdir(event_root):
                 cat_dir = os.path.join(event_root, category)
                 if not os.path.isdir(cat_dir):
@@ -57,5 +64,28 @@
                         if os.path.isdir(evt_dir):
                                 print_tpoint(category, event)
 
+def print_usdt(pid, lib):
+        reader = USDTReader(bin_path=lib, pid=pid)
+        probes_seen = []
+        for probe in reader.probes:
+                probe_name = "%s:%s" % (probe.provider, probe.name)
+                if not args.filter or fnmatch.fnmatch(probe_name, args.filter):
+                        if probe_name in probes_seen:
+                                continue
+                        probes_seen.append(probe_name)
+                        if args.variables:
+                                print(probe.display_verbose())
+                        else:
+                                print("%s %s:%s" % (probe.bin_path,
+                                        probe.provider, probe.name))
+
 if __name__ == "__main__":
-        print_all()
+        try:
+                if args.pid != -1 or args.lib != "":
+                        print_usdt(args.pid, args.lib)
+                else:
+                        print_tracepoints()
+        except:
+                if sys.exc_type is not SystemExit:
+                        print(sys.exc_value)
+
diff --git a/tools/tplist_example.txt b/tools/tplist_example.txt
new file mode 100644
index 0000000..dfa13e2
--- /dev/null
+++ b/tools/tplist_example.txt
@@ -0,0 +1,113 @@
+Demonstrations of tplist.
+
+
+tplist displays kernel tracepoints and USDT probes, including their
+format. It can be used to discover probe points for use with the trace
+and argdist tools. Kernel tracepoints are scattered around the kernel
+and provide valuable static tracing on block and network I/O, scheduling,
+power events, and many other subjects. USDT probes are placed in libraries
+(such as libc) and executables (such as node) and provide static tracing
+information that can (optionally) be turned on and off at runtime.
+
+For example, suppose you want to discover which USDT probes a particular
+executable contains. Just run tplist on that executable (or library):
+
+$ tplist -l basic_usdt
+/home/vagrant/basic_usdt basic_usdt:start_main
+/home/vagrant/basic_usdt basic_usdt:loop_iter
+/home/vagrant/basic_usdt basic_usdt:end_main
+
+The loop_iter probe sounds interesting. What are the locations of that
+probe, and which variables are available?
+
+$ tplist '*loop_iter' -l basic_usdt -v
+/home/vagrant/basic_usdt basic_usdt:loop_iter [sema 0x601036]
+  location 0x400550 raw args: -4@$42 8@%rax
+    4   signed bytes @ constant 42
+    8 unsigned bytes @ register %rax
+  location 0x40056f raw args: 8@-8(%rbp) 8@%rax
+    8 unsigned bytes @ -8(%rbp)
+    8 unsigned bytes @ register %rax
+
+This output indicates that the loop_iter probe is used in two locations
+in the basic_usdt executable. The first location passes a constant value,
+42, to the probe. The second location passes a variable value located at
+an offset from the %rbp register. Don't worry -- you don't have to trace
+the register values yourself. The argdist and trace tools understand the
+probe format and can print out the arguments automatically -- you can
+refer to them as arg1, arg2, and so on.
+
+Try to explore with some common libraries on your system and see if they
+contain UDST probes. Here are two examples you might find interesting:
+
+$ tplist -l pthread     # list probes in libpthread
+/lib64/libpthread.so.0 libpthread:pthread_start
+/lib64/libpthread.so.0 libpthread:pthread_create
+/lib64/libpthread.so.0 libpthread:pthread_join
+/lib64/libpthread.so.0 libpthread:pthread_join_ret
+/lib64/libpthread.so.0 libpthread:mutex_init
+... more output truncated
+
+$ tplist -l c           # list probes in libc
+/lib64/libc.so.6 libc:setjmp
+/lib64/libc.so.6 libc:longjmp
+/lib64/libc.so.6 libc:longjmp_target
+/lib64/libc.so.6 libc:memory_arena_reuse_free_list
+/lib64/libc.so.6 libc:memory_heap_new
+... more output truncated
+
+tplist also understands kernel tracepoints, and can list their format
+as well. For example, let's look for all block I/O-related tracepoints:
+
+# tplist 'block*'
+block:block_touch_buffer
+block:block_dirty_buffer
+block:block_rq_abort
+block:block_rq_requeue
+block:block_rq_complete
+block:block_rq_insert
+block:block_rq_issue
+block:block_bio_bounce
+block:block_bio_complete
+block:block_bio_backmerge
+block:block_bio_frontmerge
+block:block_bio_queue
+block:block_getrq
+block:block_sleeprq
+block:block_plug
+block:block_unplug
+block:block_split
+block:block_bio_remap
+block:block_rq_remap
+
+The block:block_rq_complete tracepoints sounds interesting. Let's print
+its format to see what we can trace with argdist and trace:
+
+$ tplist -v block:block_rq_complete
+block:block_rq_complete
+    dev_t dev;
+    sector_t sector;
+    unsigned int nr_sector;
+    int errors;
+    char rwbs[8];
+
+The dev, sector, nr_sector, etc. variables can now all be used in probes
+you specify with argdist or trace.
+
+
+USAGE message:
+
+$ tplist -h
+usage: tplist.py [-h] [-p PID] [-l LIB] [-v] [filter]
+
+Display kernel tracepoints or USDT probes and their formats.
+
+positional arguments:
+  filter             A filter that specifies which probes/tracepoints to print
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  List USDT probes in the specified process
+  -l LIB, --lib LIB  List USDT probes in the specified library or executable
+  -v                 Print the format (available variables)
+
diff --git a/tools/trace.py b/tools/trace.py
index 4aac067..c5ec39c 100755
--- a/tools/trace.py
+++ b/tools/trace.py
@@ -9,7 +9,7 @@
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
 
-from bcc import BPF, Tracepoint, Perf
+from bcc import BPF, Tracepoint, Perf, USDTReader
 from time import sleep, strftime
 import argparse
 import re
@@ -49,12 +49,14 @@
         event_count = 0
         first_ts = 0
         use_localtime = True
+        pid = -1
 
         @classmethod
         def configure(cls, args):
                 cls.max_events = args.max_events
                 cls.use_localtime = not args.offset
                 cls.first_ts = Time.monotonic_time()
+                cls.pid = args.pid or -1
 
         def __init__(self, probe, string_size):
                 self.raw_probe = probe
@@ -63,18 +65,18 @@
                 self._parse_probe()
                 self.probe_num = Probe.probe_count
                 self.probe_name = "probe_%s_%d" % \
-                                (self.function, self.probe_num)
+                                (self._display_function(), self.probe_num)
 
         def __str__(self):
-                return "%s:%s`%s FLT=%s ACT=%s/%s" % (self.probe_type,
-                        self.library, self.function, self.filter,
+                return "%s:%s:%s FLT=%s ACT=%s/%s" % (self.probe_type,
+                        self.library, self._display_function(), self.filter,
                         self.types, self.values)
 
         def is_default_action(self):
                 return self.python_format == ""
 
         def _bail(self, error):
-                raise ValueError("error parsing probe '%s': %s" %
+                raise ValueError("error in probe '%s': %s" %
                                  (self.raw_probe, error))
 
         def _parse_probe(self):
@@ -124,11 +126,11 @@
                         parts = ["p", parts[0], parts[1]]
                 if len(parts[0]) == 0:
                         self.probe_type = "p"
-                elif parts[0] in ["p", "r", "t"]:
+                elif parts[0] in ["p", "r", "t", "u"]:
                         self.probe_type = parts[0]
                 else:
-                        self._bail("expected '', 'p', 't', or 'r', got '%s'" %
-                                   parts[0])
+                        self._bail("probe type must be '', 'p', 't', 'r', " +
+                                   "or 'u', but got '%s'" % parts[0])
                 if self.probe_type == "t":
                         self.tp_category = parts[1]
                         self.tp_event = parts[2]
@@ -136,10 +138,39 @@
                                         self.tp_category, self.tp_event)
                         self.library = ""       # kernel
                         self.function = "perf_trace_%s" % self.tp_event
+                elif self.probe_type == "u":
+                        self.library = parts[1]
+                        self.usdt_name = parts[2]
+                        self.function = ""      # no function, just address
+                        # We will discover the USDT provider by matching on
+                        # the USDT name in the specified library
+                        self._find_usdt_probe()
+                        self._enable_usdt_probe()
                 else:
                         self.library = parts[1]
                         self.function = parts[2]
 
+        def _enable_usdt_probe(self):
+                if self.usdt.need_enable():
+                        if Probe.pid == -1:
+                                self._bail("probe needs pid to enable")
+                        self.usdt.enable(Probe.pid)
+
+        def _disable_usdt_probe(self):
+                if self.probe_type == "u" and self.usdt.need_enable():
+                        self.usdt.disable(Probe.pid)
+
+        def close(self):
+                self._disable_usdt_probe()
+
+        def _find_usdt_probe(self):
+                reader = USDTReader(bin_path=self.library)
+                for probe in reader.probes:
+                        if probe.name == self.usdt_name:
+                                self.usdt = probe
+                                return
+                self._bail("unrecognized USDT probe %s" % self.usdt_name)
+
         def _parse_filter(self, filt):
                 self.filter = self._replace_args(filt)
 
@@ -187,6 +218,10 @@
 
         def _replace_args(self, expr):
                 for alias, replacement in Probe.aliases.items():
+                        # For USDT probes, we replace argN values with the
+                        # actual arguments for that probe.
+                        if alias.startswith("arg") and self.probe_type == "u":
+                                continue
                         expr = expr.replace(alias, replacement)
                 return expr
 
@@ -206,7 +241,7 @@
 
         def _generate_python_data_decl(self):
                 self.python_struct_name = "%s_%d_Data" % \
-                                (self.function, self.probe_num)
+                                (self._display_function(), self.probe_num)
                 fields = [
                         ("timestamp_ns", ct.c_ulonglong),
                         ("pid", ct.c_uint),
@@ -266,21 +301,16 @@
                 bpf_probe_read(&__data.v%d, sizeof(__data.v%d), (void *)%s);
         }
 """                     % (expr, idx, idx, expr)
-                        # return ("bpf_probe_read(&__data.v%d, " + \
-                        # "sizeof(__data.v%d), (char*)%s);\n") % (idx, idx, expr)
-                        # return ("__builtin_memcpy(&__data.v%d, (void *)%s, " + \
-                        #        "sizeof(__data.v%d));\n") % (idx, expr, idx)
                 if field_type in Probe.fmt_types:
                         return "        __data.v%d = (%s)%s;\n" % \
                                         (idx, Probe.c_type[field_type], expr)
                 self._bail("unrecognized field type %s" % field_type)
 
-        def generate_program(self, pid, include_self):
+        def generate_program(self, include_self):
                 data_decl = self._generate_data_decl()
-                self.pid = pid
                 # kprobes don't have built-in pid filters, so we have to add
                 # it to the function body:
-                if len(self.library) == 0 and pid != -1:
+                if len(self.library) == 0 and Probe.pid != -1:
                         pid_filter = """
         u32 __pid = bpf_get_current_pid_tgid();
         if (__pid != %d) { return 0; }
@@ -293,17 +323,23 @@
                 else:
                         pid_filter = ""
 
+                prefix = ""
+                qualifier = ""
+                signature = "struct pt_regs *ctx"
+                if self.probe_type == "t":
+                        data_decl += self.tp.generate_struct()
+                        prefix = self.tp.generate_get_struct()
+                elif self.probe_type == "u":
+                        signature += ", int __loc_id"
+                        prefix = self.usdt.generate_usdt_cases()
+                        qualifier = "static inline"
+
                 data_fields = ""
                 for i, expr in enumerate(self.values):
                         data_fields += self._generate_field_assign(i)
 
-                prefix = ""
-                if self.probe_type == "t":
-                        data_decl += self.tp.generate_struct()
-                        prefix = self.tp.generate_get_struct()
-
                 text = """
-int %s(struct pt_regs *ctx)
+%s int %s(%s)
 {
         %s
         %s
@@ -318,9 +354,14 @@
         return 0;
 }
 """
-                text = text % (self.probe_name, pid_filter, prefix,
-                               self.filter, self.struct_name,
-                               data_fields, self.events_name)
+                text = text % (qualifier, self.probe_name, signature,
+                               pid_filter, prefix, self.filter,
+                               self.struct_name, data_fields, self.events_name)
+
+                if self.probe_type == "u":
+                        self.usdt_thunk_names = []
+                        text += self.usdt.generate_usdt_thunks(
+                                        self.probe_name, self.usdt_thunk_names)
 
                 return data_decl + "\n" + text
 
@@ -329,10 +370,12 @@
                 return "%.6f" % (1e-9 * (timestamp_ns - cls.first_ts))
 
         def _display_function(self):
-                if self.probe_type != 't':
+                if self.probe_type == 'p' or self.probe_type == 'r':
                         return self.function
-                else:
-                        return self.function.replace("perf_trace_", "")
+                elif self.probe_type == 'u':
+                        return self.usdt_name
+                else:   # self.probe_type == 't'
+                        return self.tp_event
 
         def print_event(self, cpu, data, size):
                 # Cast as the generated structure type and display
@@ -361,39 +404,40 @@
                 bpf[self.events_name].open_perf_buffer(self.print_event)
 
         def _attach_k(self, bpf):
-                kprobes_start = len(BPF.open_kprobes())
                 if self.probe_type == "r":
                         bpf.attach_kretprobe(event=self.function,
                                              fn_name=self.probe_name)
                 elif self.probe_type == "p" or self.probe_type == "t":
                         bpf.attach_kprobe(event=self.function,
                                           fn_name=self.probe_name)
-                if len(BPF.open_kprobes()) != kprobes_start + 1:
-                        self._bail("error attaching probe")
 
         def _attach_u(self, bpf):
                 libpath = BPF.find_library(self.library)
                 if libpath is None:
                         # This might be an executable (e.g. 'bash')
-                        with os.popen("/usr/bin/which %s 2>/dev/null" %
-                                      self.library) as w:
+                        with os.popen(
+                                "/usr/bin/which --skip-alias %s 2>/dev/null" %
+                                self.library) as w:
                                 libpath = w.read().strip()
                 if libpath is None or len(libpath) == 0:
                         self._bail("unable to find library %s" % self.library)
 
-                uprobes_start = len(BPF.open_uprobes())
-                if self.probe_type == "r":
+                if self.probe_type == "u":
+                        for i, location in enumerate(self.usdt.locations):
+                                bpf.attach_uprobe(name=libpath,
+                                        addr=location.address,
+                                        fn_name=self.usdt_thunk_names[i],
+                                        pid=Probe.pid)
+                elif self.probe_type == "r":
                         bpf.attach_uretprobe(name=libpath,
                                              sym=self.function,
                                              fn_name=self.probe_name,
-                                             pid=self.pid)
+                                             pid=Probe.pid)
                 else:
                         bpf.attach_uprobe(name=libpath,
                                           sym=self.function,
                                           fn_name=self.probe_name,
-                                          pid=self.pid)
-                if len(BPF.open_uprobes()) != uprobes_start + 1:
-                        self._bail("error attaching probe")
+                                          pid=Probe.pid)
 
 class Tool(object):
         examples = """
@@ -419,6 +463,8 @@
         Trace returns from malloc and print non-NULL allocated buffers
 trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
         Trace the block_rq_complete kernel tracepoint and print # of tx sectors
+trace 'u:pthread:pthread_create (arg4 != 0)'
+        Trace the USDT probe pthread_create when its 4th argument is non-zero
 """
 
         def __init__(self):
@@ -461,7 +507,7 @@
                 self.program += Tracepoint.generate_entry_probe()
                 for probe in self.probes:
                         self.program += probe.generate_program(
-                                self.args.pid or -1, self.args.include_self)
+                                        self.args.include_self)
 
                 if self.args.verbose:
                         print(self.program)
@@ -486,6 +532,12 @@
                 while True:
                         self.bpf.kprobe_poll()
 
+        def _close_probes(self):
+                for probe in self.probes:
+                        probe.close()
+                        if self.args.verbose:
+                                print("closed probe: " + str(probe))
+
         def run(self):
                 try:
                         self._create_probes()
@@ -497,6 +549,7 @@
                                 traceback.print_exc()
                         elif sys.exc_type is not SystemExit:
                                 print(sys.exc_value)
+                self._close_probes()
 
 if __name__ == "__main__":
        Tool().run()
diff --git a/tools/trace_example.txt b/tools/trace_example.txt
index 98831ab..dce72b9 100644
--- a/tools/trace_example.txt
+++ b/tools/trace_example.txt
@@ -171,4 +171,6 @@
         Trace returns from malloc and print non-NULL allocated buffers
 trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
         Trace the block_rq_complete kernel tracepoint and print # of tx sectors
+trace 'u:pthread:pthread_create (arg4 != 0)'
+        Trace the USDT probe pthread_create when its 4th argument is non-zero