blob: a7feac74cf8aa626d238671f7c90b792c09a1e06 [file] [log] [blame]
Mike Dodd8cfa7022010-11-17 11:12:26 -08001<?xml version="1.0" encoding='ISO-8859-1'?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
3
4<book id="oprofile-internals">
5<bookinfo>
6 <title>OProfile Internals</title>
7
8 <authorgroup>
9 <author>
10 <firstname>John</firstname>
11 <surname>Levon</surname>
12 <affiliation>
13 <address><email>levon@movementarian.org</email></address>
14 </affiliation>
15 </author>
16 </authorgroup>
17
18 <copyright>
19 <year>2003</year>
20 <holder>John Levon</holder>
21 </copyright>
22</bookinfo>
23
24<toc></toc>
25
26<chapter id="introduction">
27<title>Introduction</title>
28
29<para>
30This document is current for OProfile version <oprofileversion />.
31This document provides some details on the internal workings of OProfile for the
32interested hacker. This document assumes strong C, working C++, plus some knowledge of
33kernel internals and CPU hardware.
34</para>
35<note>
36<para>
37Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4
38uses a very different kernel module implementation and daemon to produce the sample files.
39</para>
40</note>
41
42<sect1 id="overview">
43<title>Overview</title>
44<para>
45OProfile is a statistical continuous profiler. In other words, profiles are generated by
46regularly sampling the current registers on each CPU (from an interrupt handler, the
47saved PC value at the time of interrupt is stored), and converting that runtime PC
48value into something meaningful to the programmer.
49</para>
50<para>
51OProfile achieves this by taking the stream of sampled PC values, along with the detail
52of which task was running at the time of the interrupt, and converting into a file offset
53against a particular binary file. Because applications <function>mmap()</function>
54the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename>
55or whatever), it's possible to find the relevant binary file and offset by walking
56the task's list of mapped memory areas. Each PC value is thus converted into a tuple
57of binary-image,offset. This is something that the userspace tools can use directly
58to reconstruct where the code came from, including the particular assembly instructions,
59symbol, and source line (via the binary's debug information if present).
60</para>
61<para>
62Regularly sampling the PC value like this approximates what actually was executed and
63how often - more often than not, this statistical approximation is good enough to
64reflect reality. In common operation, the time between each sample interrupt is regulated
65by a fixed number of clock cycles. This implies that the results will reflect where
66the CPU is spending the most time; this is obviously a very useful information source
67for performance analysis.
68</para>
69<para>
70Sometimes though, an application programmer needs different kinds of information: for example,
71"which of the source routines cause the most cache misses ?". The rise in importance of
72such metrics in recent years has led many CPU manufacturers to provide hardware performance
73counters capable of measuring these events on the hardware level. Typically, these counters
74increment once per each event, and generate an interrupt on reaching some pre-defined
75number of events. OProfile can use these interrupts to generate samples: then, the
76profile results are a statistical approximation of which code caused how many of the
77given event.
78</para>
79<para>
80Consider a simplified system that only executes two functions A and B. A
81takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at
82100 cycles a second, and we've set the performance counter to create an
83interrupt after a set number of "events" (in this case an event is one
84clock cycle). It should be clear that the chances of the interrupt
85occurring in function A is 1/100, and 99/100 for function B. Thus, we
86statistically approximate the actual relative performance features of
87the two functions over time. This same analysis works for other types of
88events, providing that the interrupt is tied to the number of events
89occurring (that is, after N events, an interrupt is generated).
90</para>
91<para>
92There are typically more than one of these counters, so it's possible to set up profiling
93for several different event types. Using these counters gives us a powerful, low-overhead
94way of gaining performance metrics. If OProfile, or the CPU, does not support performance
95counters, then a simpler method is used: the kernel timer interrupt feeds samples
96into OProfile itself.
97</para>
98<para>
99The rest of this document concerns itself with how we get from receiving samples at
100interrupt time to producing user-readable profile information.
101</para>
102</sect1>
103
104<sect1 id="components">
105<title>Components of the OProfile system</title>
106
107<sect2 id="arch-specific-components">
108<title>Architecture-specific components</title>
109<para>
110If OProfile supports the hardware performance counters found on
111a particular architecture, code for managing the details of setting
112up and managing these counters can be found in the kernel source
113tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename>
114directory. The architecture-specific implementation works via
115filling in the oprofile_operations structure at init time. This
116provides a set of operations such as <function>setup()</function>,
117<function>start()</function>, <function>stop()</function>, etc.
118that manage the hardware-specific details of fiddling with the
119performance counter registers.
120</para>
121<para>
122The other important facility available to the architecture code is
123<function>oprofile_add_sample()</function>. This is where a particular sample
124taken at interrupt time is fed into the generic OProfile driver code.
125</para>
126</sect2>
127
128<sect2 id="filesystem">
129<title>oprofilefs</title>
130<para>
131OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from
132userspace at <filename>/dev/oprofile</filename>. This consists of small
133files for reporting and receiving configuration from userspace, as well
134as the actual character device that the OProfile userspace receives samples
135from. At <function>setup()</function> time, the architecture-specific may
136add further configuration files related to the details of the performance
137counters. For example, on x86, one numbered directory for each hardware
138performance counter is added, with files in each for the event type,
139reset value, etc.
140</para>
141<para>
142The filesystem also contains a <filename>stats</filename> directory with
143a number of useful counters for various OProfile events.
144</para>
145</sect2>
146
147<sect2 id="driver">
148<title>Generic kernel driver</title>
149<para>
150This lives in <filename>drivers/oprofile/</filename>, and forms the core of
151how OProfile works in the kernel. Its job is to take samples delivered
152from the architecture-specific code (via <function>oprofile_add_sample()</function>),
153and buffer this data, in a transformed form as described later, until releasing
154the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename>
155character device.
156</para>
157</sect2>
158
159<sect2 id="daemon">
160<title>The OProfile daemon</title>
161<para>
162The OProfile userspace daemon's job is to take the raw data provided by the
163kernel and write it to the disk. It takes the single data stream from the
164kernel and logs sample data against a number of sample files (found in
165<filename>$SESSION_DIR/samples/current/</filename>, by default located at
166<filename>/var/lib/oprofile/samples/current/</filename>. For the benefit
167of the "separate" functionality, the names/paths of these sample files
168are mangled to reflect where the samples were from: this can include
169thread IDs, the binary file path, the event type used, and more.
170</para>
171<para>
172After this final step from interrupt to disk file, the data is now
173persistent (that is, changes in the running of the system do not invalidate
174stored data). So the post-profiling tools can run on this data at any
175time (assuming the original binary files are still available and unchanged,
176naturally).
177</para>
178</sect2>
179
180<sect2 id="post-profiling">
181<title>Post-profiling tools</title>
182So far, we've collected data, but we've yet to present it in a useful form
183to the user. This is the job of the post-profiling tools. In general form,
184they collate a subset of the available sample files, load and process each one
185correlated against the relevant binary file, and finally produce user-readable
186information.
187</sect2>
188
189</sect1>
190
191</chapter>
192
193<chapter id="performance-counters">
194<title>Performance counter management</title>
195
196<sect1 id ="performance-counters-ui">
197<title>Providing a user interface</title>
198
199<para>
200The performance counter registers need programming in order to set the
201type of event to count, etc. OProfile uses a standard model across all
202CPUs for defining these events as follows :
203</para>
204<informaltable frame="all">
205<tgroup cols='2'>
206<tbody>
207<row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row>
208<row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row>
209<row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row>
210<row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row>
211<row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row>
212<row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row>
213</tbody>
214</tgroup>
215</informaltable>
216<para>
217The term "unit mask" is borrowed from the Intel architectures, and can
218further specify exactly when a counter is incremented (for example,
219cache-related events can be restricted to particular state transitions
220of the cache lines).
221</para>
222<para>
223All of the available hardware events and their details are specified in
224the textual files in the <filename>events</filename> directory. The
225syntax of these files should be fairly obvious. The user specifies the
226names and configuration details of the chosen counters via
227<command>opcontrol</command>. These are then written to the kernel
228module (in numerical form) via <filename>/dev/oprofile/N/</filename>
229where N is the physical hardware counter (some events can only be used
230on specific counters; OProfile hides these details from the user when
231possible). On IA64, the perfmon-based interface behaves somewhat
232differently, as described later.
233</para>
234
235</sect1>
236
237<sect1 id="performance-counters-programming">
238<title>Programming the performance counter registers</title>
239
240<para>
241We have described how the user interface fills in the desired
242configuration of the counters and transmits the information to the
243kernel. It is the job of the <function>-&gt;setup()</function> method
244to actually program the performance counter registers. Clearly, the
245details of how this is done is architecture-specific; it is also
246model-specific on many architectures. For example, i386 provides methods
247for each model type that programs the counter registers correctly
248(see the <filename>op_model_*</filename> files in
249<filename>arch/i386/oprofile</filename> for the details). The method
250reads the values stored in the virtual oprofilefs files and programs
251the registers appropriately, ready for starting the actual profiling
252session.
253</para>
254<para>
255The architecture-specific drivers make sure to save the old register
256settings before doing OProfile setup. They are restored when OProfile
257shuts down. This is useful, for example, on i386, where the NMI watchdog
258uses the same performance counter registers as OProfile; they cannot
259run concurrently, but OProfile makes sure to restore the setup it found
260before it was running.
261</para>
262<para>
263In addition to programming the counter registers themselves, other setup
264is often necessary. For example, on i386, the local APIC needs
265programming in order to make the counter's overflow interrupt appear as
266an NMI (non-maskable interrupt). This allows sampling (and therefore
267profiling) of regions where "normal" interrupts are masked, enabling
268more reliable profiles.
269</para>
270
271<sect2 id="performance-counters-start">
272<title>Starting and stopping the counters</title>
273<para>
274Initiating a profiling session is done via writing an ASCII '1'
275to the file <filename>/dev/oprofile/enable</filename>. This sets up the
276core, and calls into the architecture-specific driver to actually
277enable each configured counter. Again, the details of how this is
278done is model-specific (for example, the Athlon models can disable
279or enable on a per-counter basis, unlike the PPro models).
280</para>
281</sect2>
282
283<sect2>
284<title>IA64 and perfmon</title>
285<para>
286The IA64 architecture provides a different interface from the other
287architectures, using the existing perfmon driver. Register programming
288is handled entirely in user-space (see
289<filename>daemon/opd_perfmon.c</filename> for the details). A process
290is forked for each CPU, which creates a perfmon context and sets the
291counter registers appropriately via the
292<function>sys_perfmonctl</function> interface. In addition, the actual
293initiation and termination of the profiling session is handled via the
294same interface using <constant>PFM_START</constant> and
295<constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs
296files for the performance counters, as the kernel driver does not
297program the registers itself.
298</para>
299<para>
300Instead, the perfmon driver for OProfile simply registers with the
301OProfile core with an OProfile-specific UUID. During a profiling
302session, the perfmon core calls into the OProfile perfmon driver and
303samples are registered with the OProfile core itself as usual (with
304<function>oprofile_add_sample()</function>).
305</para>
306</sect2>
307
308</sect1>
309
310</chapter>
311
312<chapter id="collecting-samples">
313<title>Collecting and processing samples</title>
314
315<sect1 id="receiving-interrupts">
316<title>Receiving interrupts</title>
317<para>
318Naturally, how the overflow interrupts are received is specific
319to the hardware architecture, unless we are in "timer" mode, where the
320logging routine is called directly from the standard kernel timer
321interrupt handler.
322</para>
323<para>
324On the i386 architecture, the local APIC is programmed such that when a
325counter overflows (that is, it receives an event that causes an integer
326overflow of the register value to zero), an NMI is generated. This calls
327into the general handler <function>do_nmi()</function>; because OProfile
328has registered itself as capable of handling NMI interrupts, this will
329call into the OProfile driver code in
330<filename>arch/i386/oprofile</filename>. Here, the saved PC value (the
331CPU saves the register set at the time of interrupt on the stack
332available for inspection) is extracted, and the counters are examined to
333find out which one generated the interrupt. Also determined is whether
334the system was inside kernel or user space at the time of the interrupt.
335These three pieces of information are then forwarded onto the OProfile
336core via <function>oprofile_add_sample()</function>. Finally, the
337counter values are reset to the chosen count value, to ensure another
338interrupt happens after another N events have occurred. Other
339architectures behave in a similar manner.
340</para>
341</sect1>
342
343<sect1 id="core-structure">
344<title>Core data structures</title>
345<para>
346Before considering what happens when we log a sample, we shall digress
347for a moment and look at the general structure of the data collection
348system.
349</para>
350<para>
351OProfile maintains a small buffer for storing the logged samples for
352each CPU on the system. Only this buffer is altered when we actually log
353a sample (remember, we may still be in an NMI context, so no locking is
354possible). The buffer is managed by a two-handed system; the "head"
355iterator dictates where the next sample data should be placed in the
356buffer. Of course, overflow of the buffer is possible, in which case
357the sample is discarded.
358</para>
359<para>
360It is critical to remember that at this point, the PC value is an
361absolute value, and is therefore only meaningful in the context of which
362task it was logged against. Thus, these per-CPU buffers also maintain
363details of which task each logged sample is for, as described in the
364next section. In addition, we store whether the sample was in kernel
365space or user space (on some architectures and configurations, the address
366space is not sub-divided neatly at a specific PC value, so we must store
367this information).
368</para>
369<para>
370As well as these small per-CPU buffers, we have a considerably larger
371single buffer. This holds the data that is eventually copied out into
372the OProfile daemon. On certain system events, the per-CPU buffers are
373processed and entered (in mutated form) into the main buffer, known in
374the source as the "event buffer". The "tail" iterator indicates the
375point from which the CPU may be read, up to the position of the "head"
376iterator. This provides an entirely lock-free method for extracting data
377from the CPU buffers. This process is described in detail later in this chapter.
378</para>
379<figure><title>The OProfile buffers</title>
380<graphic fileref="buffers.png" />
381</figure>
382</sect1>
383
384<sect1 id="logging-sample">
385<title>Logging a sample</title>
386<para>
387As mentioned, the sample is logged into the buffer specific to the
388current CPU. The CPU buffer is a simple array of pairs of unsigned long
389values; for a sample, they hold the PC value and the counter for the
390sample. (The counter value is later used to translate back into the relevant
391event type the counter was programmed to).
392</para>
393<para>
394In addition to logging the sample itself, we also log task switches.
395This is simply done by storing the address of the last task to log a
396sample on that CPU in a data structure, and writing a task switch entry
397into the buffer if the new value of <function>current()</function> has
398changed. Note that later we will directly de-reference this pointer;
399this imposes certain restrictions on when and how the CPU buffers need
400to be processed.
401</para>
402<para>
403Finally, as mentioned, we log whether we have changed between kernel and
404userspace using a similar method. Both of these variables
405(<varname>last_task</varname> and <varname>last_is_kernel</varname>) are
406reset when the CPU buffer is read.
407</para>
408</sect1>
409
410<sect1 id="logging-stack">
411<title>Logging stack traces</title>
412<para>
413OProfile can also provide statistical samples of call chains (on x86). To
414do this, at sample time, the frame pointer chain is traversed, recording
415the return address for each stack frame. This will only work if the code
416was compiled with frame pointers, but we're careful to abort the
417traversal if the frame pointer appears bad. We store the set of return
418addresses straight into the CPU buffer. Note that, since this traversal
419is keyed off the standard sample interrupt, the number of times a
420function appears in a stack trace is not an indicator of how many times
421the call site was executed: rather, it's related to the number of
422samples we took where that call site was involved. Thus, the results for
423stack traces are not necessarily proportional to the call counts:
424typical programs will have many <function>main()</function> samples.
425</para>
426</sect1>
427
428<sect1 id="synchronising-buffers">
429<title>Synchronising the CPU buffers to the event buffer</title>
430<!-- FIXME: update when percpu patch goes in -->
431<para>
432At some point, we have to process the data in each CPU buffer and enter
433it into the main (event) buffer. The file
434<filename>buffer_sync.c</filename> contains the relevant code. We
435periodically (currently every <constant>HZ</constant>/4 jiffies) start
436the synchronisation process. In addition, we process the buffers on
437certain events, such as an application calling
438<function>munmap()</function>. This is particularly important for
439<function>exit()</function> - because the CPU buffers contain pointers
440to the task structure, if we don't process all the buffers before the
441task is actually destroyed and the task structure freed, then we could
442end up trying to dereference a bogus pointer in one of the CPU buffers.
443</para>
444<para>
445We also add a notification when a kernel module is loaded; this is so
446that user-space can re-read <filename>/proc/modules</filename> to
447determine the load addresses of kernel module text sections. Without
448this notification, samples for a newly-loaded module could get lost or
449be attributed to the wrong module.
450</para>
451<para>
452The synchronisation itself works in the following manner: first, mutual
453exclusion on the event buffer is taken. Remember, we do not need to do
454that for each CPU buffer, as we only read from the tail iterator (whilst
455interrupts might be arriving at the same buffer, but they will write to
456the position of the head iterator, leaving previously written entries
457intact). Then, we process each CPU buffer in turn. A CPU switch
458notification is added to the buffer first (for
459<option>--separate=cpu</option> support). Then the processing of the
460actual data starts.
461</para>
462<para>
463As mentioned, the CPU buffer consists of task switch entries and the
464actual samples. When the routine <function>sync_buffer()</function> sees
465a task switch, the process ID and process group ID are recorded into the
466event buffer, along with a dcookie (see below) identifying the
467application binary (e.g. <filename>/bin/bash</filename>). The
468<varname>mmap_sem</varname> for the task is then taken, to allow safe
469iteration across the tasks' list of mapped areas. Each sample is then
470processed as described in the next section.
471</para>
472<para>
473After a buffer has been read, the tail iterator is updated to reflect
474how much of the buffer was processed. Note that when we determined how
475much data there was to read in the CPU buffer, we also called
476<function>cpu_buffer_reset()</function> to reset
477<varname>last_task</varname> and <varname>last_is_kernel</varname>, as
478we've already mentioned. During the processing, more samples may have
479been arriving in the CPU buffer; this is OK because we are careful to
480only update the tail iterator to how much we actually read - on the next
481buffer synchronisation, we will start again from that point.
482</para>
483</sect1>
484
485<sect1 id="dentry-cookies">
486<title>Identifying binary images</title>
487<para>
488In order to produce useful profiles, we need to be able to associate a
489particular PC value sample with an actual ELF binary on the disk. This
490leaves us with the problem of how to export this information to
491user-space. We create unique IDs that identify a particular directory
492entry (dentry), and write those IDs into the event buffer. Later on,
493the user-space daemon can call the <function>lookup_dcookie</function>
494system call, which looks up the ID and fills in the full path of
495the binary image in the buffer user-space passes in. These IDs are
496maintained by the code in <filename>fs/dcookies.c</filename>; the
497cache lasts for as long as the daemon has the event buffer open.
498</para>
499</sect1>
500
501<sect1 id="finding-dentry">
502<title>Finding a sample's binary image and offset</title>
503<para>
504We haven't yet described how we process the absolute PC value into
505something usable by the user-space daemon. When we find a sample entered
506into the CPU buffer, we traverse the list of mappings for the task
507(remember, we will have seen a task switch earlier, so we know which
508task's lists to look at). When a mapping is found that contains the PC
509value, we look up the mapped file's dentry in the dcookie cache. This
510gives the dcookie ID that will uniquely identify the mapped file. Then
511we alter the absolute value such that it is an offset from the start of
512the file being mapped (the mapping need not start at the start of the
513actual file, so we have to consider the offset value of the mapping). We
514store this dcookie ID into the event buffer; this identifies which
515binary the samples following it are against.
516In this manner, we have converted a PC value, which has transitory
517meaning only, into a static offset value for later processing by the
518daemon.
519</para>
520<para>
521We also attempt to avoid the relatively expensive lookup of the dentry
522cookie value by storing the cookie value directly into the dentry
523itself; then we can simply derive the cookie value immediately when we
524find the correct mapping.
525</para>
526</sect1>
527
528</chapter>
529
530<chapter id="sample-files">
531<title>Generating sample files</title>
532
533<sect1 id="processing-buffer">
534<title>Processing the buffer</title>
535
536<para>
537Now we can move onto user-space in our description of how raw interrupt
538samples are processed into useful information. As we described in
539previous sections, the kernel OProfile driver creates a large buffer of
540sample data consisting of offset values, interspersed with
541notification of changes in context. These context changes indicate how
542following samples should be attributed, and include task switches, CPU
543changes, and which dcookie the sample value is against. By processing
544this buffer entry-by-entry, we can determine where the samples should
545be accredited to. This is particularly important when using the
546<option>--separate</option>.
547</para>
548<para>
549The file <filename>daemon/opd_trans.c</filename> contains the basic routine
550for the buffer processing. The <varname>struct transient</varname>
551structure is used to hold changes in context. Its members are modified
552as we process each entry; it is passed into the routines in
553<filename>daemon/opd_sfile.c</filename> for actually logging the sample
554to a particular sample file (which will be held in
555<filename>$SESSION_DIR/samples/current</filename>).
556</para>
557<para>
558The buffer format is designed for conciseness, as high sampling rates
559can easily generate a lot of data. Thus, context changes are prefixed
560by an escape code, identified by <function>is_escape_code()</function>.
561If an escape code is found, the next entry in the buffer identifies
562what type of context change is being read. These are handed off to
563various handlers (see the <varname>handlers</varname> array), which
564modify the transient structure as appropriate. If it's not an escape
565code, then it must be a PC offset value, and the very next entry will
566be the numeric hardware counter. These values are read and recorded
567in the transient structure; we then do a lookup to find the correct
568sample file, and log the sample, as described in the next section.
569</para>
570
571<sect2 id="handling-kernel-samples">
572<title>Handling kernel samples</title>
573
574<para>
575Samples from kernel code require a little special handling. Because
576the binary text which the sample is against does not correspond to
577any file that the kernel directly knows about, the OProfile driver
578stores the absolute PC value in the buffer, instead of the file offset.
579Of course, we need an offset against some particular binary. To handle
580this, we keep a list of loaded modules by parsing
581<filename>/proc/modules</filename> as needed. When a module is loaded,
582a notification is placed in the OProfile buffer, and this triggers a
583re-read. We store the module name, and the loading address and size.
584This is also done for the main kernel image, as specified by the user.
585The absolute PC value is matched against each address range, and
586modified into an offset when the matching module is found. See
587<filename>daemon/opd_kernel.c</filename> for the details.
588</para>
589
590</sect2>
591
592
593</sect1>
594
595<sect1 id="sample-file-generation">
596<title>Locating and creating sample files</title>
597
598<para>
599We have a sample value and its satellite data stored in a
600<varname>struct transient</varname>, and we must locate an
601actual sample file to store the sample in, using the context
602information in the transient structure as a key. The transient data to
603sample file lookup is handled in
604<filename>daemon/opd_sfile.c</filename>. A hash is taken of the
605transient values that are relevant (depending upon the setting of
606<option>--separate</option>, some values might be irrelevant), and the
607hash value is used to lookup the list of currently open sample files.
608Of course, the sample file might not be found, in which case we need
609to create and open it.
610</para>
611<para>
612OProfile uses a rather complex scheme for naming sample files, in order
613to make selecting relevant sample files easier for the post-profiling
614utilities. The exact details of the scheme are given in
615<filename>oprofile-tests/pp_interface</filename>, but for now it will
616suffice to remember that the filename will include only relevant
617information for the current settings, taken from the transient data. A
618fully-specified filename looks something like :
619</para>
620<computeroutput>
621/var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0
622</computeroutput>
623<para>
624It should be clear that this identifies such information as the
625application binary, the dependent (library) binary, the hardware event,
626and the process and thread ID. Typically, not all this information is
627needed, in which cases some values may be replaced with the token
628<filename>all</filename>.
629</para>
630<para>
631The code that generates this filename and opens the file is found in
632<filename>daemon/opd_mangling.c</filename>. You may have realised that
633at this point, we do not have the binary image file names, only the
634dcookie values. In order to determine a file name, a dcookie value is
635looked up in the dcookie cache. This is to be found in
636<filename>daemon/opd_cookie.c</filename>. Since dcookies are both
637persistent and unique during a sampling session, we can cache the
638values. If the value is not found in the cache, then we ask the kernel
639to do the lookup from value to file name for us by calling
640<function>lookup_dcookie()</function>. This looks up the value in a
641kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns
642the fully-qualified file name to userspace.
643</para>
644
645</sect1>
646
647<sect1 id="sample-file-writing">
648<title>Writing data to a sample file</title>
649
650<para>
651Each specific sample file is a hashed collection, where the key is
652the PC offset from the transient data, and the value is the number of
653samples recorded against that offset. The files are
654<function>mmap()</function>ed into the daemon's memory space. The code
655to actually log the write against the sample file can be found in
656<filename>libdb/</filename>.
657</para>
658<para>
659For recording stack traces, we have a more complicated sample filename
660mangling scheme that allows us to identify cross-binary calls. We use
661the same sample file format, where the key is a 64-bit value composed
662from the from,to pair of offsets.
663</para>
664
665</sect1>
666
667</chapter>
668
669<chapter id="output">
670<title>Generating useful output</title>
671
672<para>
673All of the tools used to generate human-readable output have to take
674roughly the same steps to collect the data for processing. First, the
675profile specification given by the user has to be parsed. Next, a list
676of sample files matching the specification has to obtained. Using this
677list, we need to locate the binary file for each sample file, and then
678use them to extract meaningful data, before a final collation and
679presentation to the user.
680</para>
681
682<sect1 id="profile-specification">
683<title>Handling the profile specification</title>
684
685<para>
686The profile specification presented by the user is parsed in
687the function <function>profile_spec::create()</function>. This
688creates an object representing the specification. Then we
689use <function>profile_spec::generate_file_list()</function>
690to search for all sample files and match them against the
691<varname>profile_spec</varname>.
692</para>
693
694<para>
695To enable this matching process to work, the attributes of
696each sample file is encoded in its filename. This is a low-tech
697approach to matching specifications against candidate sample
698files, but it works reasonably well. A typical sample file
699might look like these:
700</para>
701<screen>
702/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
703/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
704/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0
705/var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all
706</screen>
707<para>
708This looks unnecessarily complex, but it's actually fairly simple. First
709we have the session of the sample, by default located here
710<filename>/var/lib/oprofile/samples/current</filename>. This location
711can be changed by specifying the --session-dir option at command-line.
712This session could equally well be inside an archive from <command>oparchive</command>.
713Next we have one of the tokens <filename>{root}</filename> or
714<filename>{kern}</filename>. <filename>{root}</filename> indicates
715that the binary is found on a file system, and we will encode its path
716in the next section (e.g. <filename>/bin/ls</filename>).
717<filename>{kern}</filename> indicates a kernel module - on 2.6 kernels
718the path information is not available from the kernel, so we have to
719special-case kernel modules like this; we encode merely the name of the
720module as loaded.
721</para>
722<para>
723Next there is a <filename>{dep}</filename> token, indicating another
724token/path which identifies the dependent binary image. This is used even for
725the "primary" binary (i.e. the one that was
726<function>execve()</function>d), as it simplifies processing. Finally,
727if this sample file is a normal flat profile, the actual file is next in
728the path. If it's a call-graph sample file, we need one further
729specification, to allow us to identify cross-binary arcs in the call
730graph.
731</para>
732<para>
733The actual sample file name is dot-separated, where the fields are, in
734order: event name, event count, unit mask, task group ID, task ID, and
735CPU number.
736</para>
737<para>
738This sample file can be reliably parsed (with
739<function>parse_filename()</function>) into a
740<varname>filename_spec</varname>. Finally, we can check whether to
741include the sample file in the final results by comparing this
742<varname>filename_spec</varname> against the
743<varname>profile_spec</varname> the user specified (for the interested,
744see <function>valid_candidate()</function> and
745<function>profile_spec::match</function>). Then comes the really
746complicated bit...
747</para>
748
749</sect1>
750
751<sect1 id="sample-file-collating">
752<title>Collating the candidate sample files</title>
753
754<para>
755At this point we have a duplicate-free list of sample files we need
756to process. But first we need to do some further arrangement: we
757need to classify each sample file, and we may also need to "invert"
758the profiles.
759</para>
760
761<sect2 id="sample-file-classifying">
762<title>Classifying sample files</title>
763
764<para>
765It's possible for utilities like <command>opreport</command> to show
766data in columnar format: for example, we might want to show the results
767of two threads within a process side-by-side. To do this, we need
768to classify each sample file into classes - the classes correspond
769with each <command>opreport</command> column. The function that handles
770this is <function>arrange_profiles()</function>. Each sample file
771is added to a particular class. If the sample file is the first in
772its class, a template is generated from the sample file. Each template
773describes a particular class (thus, in our example above, each template
774will have a different thread ID, and this uniquely identifies each
775class).
776</para>
777
778<para>
779Each class has a list of "profile sets" matching that class's template.
780A profile set is either a profile of the primary binary image, or any of
781its dependent images. After all sample files have been listed in one of
782the profile sets belonging to the classes, we have to name each class and
783perform error-checking. This is done by
784<function>identify_classes()</function>; each class is checked to ensure
785that its "axis" is the same as all the others. This is needed because
786<command>opreport</command> can't produce results in 3D format: we can
787only differ in one aspect, such as thread ID or event name.
788</para>
789
790</sect2>
791
792<sect2 id="sample-file-inverting">
793<title>Creating inverted profile lists</title>
794
795<para>
796Remember that if we're using certain profile separation options, such as
797"--separate=lib", a single binary could be a dependent image to many
798different binaries. For example, the C library image would be a
799dependent image for most programs that have been profiled. As it
800happens, this can cause severe performance problems: without some
801re-arrangement, these dependent binary images would be opened each
802time we need to process sample files for each program.
803</para>
804
805<para>
806The solution is to "invert" the profiles via
807<function>invert_profiles()</function>. We create a new data structure
808where the dependent binary is first, and the primary binary images using
809that dependent binary are listed as sub-images. This helps our
810performance problem, as now we only need to open each dependent image
811once, when we process the list of inverted profiles.
812</para>
813
814</sect2>
815
816</sect1>
817
818<sect1 id="generating-profile-data">
819<title>Generating profile data</title>
820
821<para>
822Things don't get any simpler at this point, unfortunately. At this point
823we've collected and classified the sample files into the set of inverted
824profiles, as described in the previous section. Now we need to process
825each inverted profile and make something of the data. The entry point
826for this is <function>populate_for_image()</function>.
827</para>
828
829<sect2 id="bfd">
830<title>Processing the binary image</title>
831<para>
832The first thing we do with an inverted profile is attempt to open the
833binary image (remember each inverted profile set is only for one binary
834image, but may have many sample files to process). The
835<varname>op_bfd</varname> class provides an abstracted interface to
836this; internally it uses <filename>libbfd</filename>. The main purpose
837of this class is to process the symbols for the binary image; this is
838also where symbol filtering happens. This is actually quite tricky, but
839should be clear from the source.
840</para>
841</sect2>
842
843<sect2 id="processing-sample-files">
844<title>Processing the sample files</title>
845<para>
846The class <varname>profile_container</varname> is a hold-all that
847contains all the processed results. It is a container of
848<varname>profile_t</varname> objects. The
849<function>add_sample_files()</function> method uses
850<filename>libdb</filename> to open the given sample file and add the
851key/value types to the <varname>profile_t</varname>. Once this has been
852done, <function>profile_container::add()</function> is passed the
853<varname>profile_t</varname> plus the <varname>op_bfd</varname> for
854processing.
855</para>
856<para>
857<function>profile_container::add()</function> walks through the symbols
858collected in the <varname>op_bfd</varname>.
859<function>op_bfd::get_symbol_range()</function> gives us the start and
860end of the symbol as an offset from the start of the binary image,
861then we interrogate the <varname>profile_t</varname> for the relevant samples
862for that offset range. We create a <varname>symbol_entry</varname>
863object for this symbol and fill it in. If needed, here we also collect
864debug information from the <varname>op_bfd</varname>, and possibly
865record the detailed sample information (as used by <command>opreport
866-d</command> and <command>opannotate</command>).
867Finally the <varname>symbol_entry</varname> is added to
868a private container of <varname>profile_container</varname> - this
869<varname>symbol_container</varname> holds all such processed symbols.
870</para>
871</sect2>
872
873</sect1>
874
875<sect1 id="generating-output">
876<title>Generating output</title>
877
878<para>
879After the processing described in the previous section, we've now got
880full details of what we need to output stored in the
881<varname>profile_container</varname> on a symbol-by-symbol basis. To
882produce output, we need to replay that data and format it suitably.
883</para>
884<para>
885<command>opreport</command> first asks the
886<varname>profile_container</varname> for a
887<varname>symbol_collection</varname> (this is also where thresholding
888happens).
889This is sorted, then a
890<varname>opreport_formatter</varname> is initialised.
891This object initialises a set of field formatters as requested. Then
892<function>opreport_formatter::output()</function> is called. This
893iterates through the (sorted) <varname>symbol_collection</varname>;
894for each entry, the selected fields (as set by the
895<varname>format_flags</varname> options) are output by calling the
896field formatters, with the <varname>symbol_entry</varname> passed in.
897</para>
898
899</sect1>
900
901</chapter>
902
903<chapter id="ext">
904<title>Extended Feature Interface</title>
905
906<sect1 id="ext-intro">
907<title>Introduction</title>
908
909<para>
910The Extended Feature Interface is a standard callback interface
911designed to allow extension to the OProfile daemon's sample processing.
912Each feature defines a set of callback handlers which can be enabled or
913disabled through the OProfile daemon's command-line option.
914This interface can be used to implement support for architecture-specific
915features or features not commonly used by general OProfile users.
916</para>
917
918</sect1>
919
920<sect1 id="ext-name-and-handlers">
921<title>Feature Name and Handlers</title>
922
923<para>
924Each extended feature has an entry in the <varname>ext_feature_table</varname>
925in <filename>opd_extended.cpp</filename>. Each entry contains a feature name,
926and a corresponding set of handlers. Feature name is a unique string, which is
927used to identify a feature in the table. Each feature provides a set
928of handlers, which will be executed by the OProfile daemon from pre-determined
929locations to perform certain tasks. At runtime, the OProfile daemon calls a feature
930handler wrapper from one of the predetermined locations to check whether
931an extended feature is enabled, and whether a particular handler exists.
932Only the handlers of the enabled feature will be executed.
933</para>
934
935</sect1>
936
937<sect1 id="ext-enable">
938<title>Enabling Features</title>
939
940<para>
941Each feature is enabled using the OProfile daemon (oprofiled) command-line
942option "--ext-feature=&lt;extended-feature-name&gt;:[args]". The
943"extended-feature-name" is used to determine the feature to be enabled.
944The optional "args" is passed into the feature-specific initialization handler
945(<function>ext_init</function>). Currently, only one extended feature can be
946enabled at a time.
947</para>
948
949</sect1>
950
951<sect1 id="ext-types-of-handlers">
952<title>Type of Handlers</title>
953
954<para>
955Each feature is responsible for providing its own set of handlers.
956Types of handler are:
957</para>
958
959<sect2 id="ext_init">
960<title>ext_init Handler</title>
961
962<para>
963"ext_init" handles initialization of an extended feature. It takes
964"args" parameter which is passed in through the "oprofiled --ext-feature=&lt;
965extended-feature-name&gt;:[args]". This handler is executed in the function
966<function>opd_options()</function> in the file <filename>daemon/oprofiled.c
967</filename>.
968</para>
969
970<note>
971<para>
972The ext_init handler is required for all features.
973</para>
974</note>
975
976</sect2>
977
978<sect2 id="ext_print_stats">
979<title>ext_print_stats Handler</title>
980
981<para>
982"ext_print_stats" handles the extended feature statistics report. It adds
983a new section in the OProfile daemon statistics report, which is normally
984outputed to the file
985<filename>/var/lib/oprofile/samples/oprofiled.log</filename>.
986This handler is executed in the function <function>opd_print_stats()</function>
987in the file <filename>daemon/opd_stats.c</filename>.
988</para>
989
990</sect2>
991
992<sect2 id="ext_sfile_handlers">
993<title>ext_sfile Handler</title>
994
995<para>
996"ext_sfile" contains a set of handlers related to operations on the extended
997sample files (sample files for events related to extended feature).
998These operations include <function>create_sfile()</function>,
999<function>sfile_dup()</function>, <function>close_sfile()</function>,
1000<function>sync_sfile()</function>, and <function>get_file()</function>
1001as defined in <filename>daemon/opd_sfile.c</filename>.
1002An additional field, <varname>odb_t * ext_file</varname>, is added to the
1003<varname>struct sfile</varname> for storing extended sample files
1004information.
1005
1006</para>
1007
1008</sect2>
1009
1010</sect1>
1011
1012<sect1 id="ext-implementation">
1013<title>Extended Feature Reference Implementation</title>
1014
1015<sect2 id="ext-ibs">
1016<title>Instruction-Based Sampling (IBS)</title>
1017
1018<para>
1019An example of extended feature implementation can be seen by
1020examining the AMD Instruction-Based Sampling support.
1021</para>
1022
1023<sect3 id="ibs-init">
1024<title>IBS Initialization</title>
1025
1026<para>
1027Instruction-Based Sampling (IBS) is a new performance measurement technique
1028available on AMD Family 10h processors. Enabling IBS profiling is done simply
1029by specifying IBS performance events through the "--event=" options.
1030</para>
1031
1032<screen>
1033opcontrol --event=IBS_FETCH_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;
1034opcontrol --event=IBS_OP_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;
1035
1036Note: * Count and unitmask for all IBS fetch events must be the same,
1037 as do those for IBS op.
1038</screen>
1039
1040<para>
1041IBS performance events are listed in <function>opcontrol --list-events</function>.
1042When users specify these events, opcontrol verifies them using ophelp, which
1043checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname>
1044tag in <filename>events/x86-64/family10/events</filename> file.
1045Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and
1046/dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows.
1047</para>
1048
1049<screen>
1050oprofiled \
1051 --ext-feature=ibs:\
1052 fetch:&lt;IBS_FETCH_EVENT1&gt;,&lt;IBS_FETCH_EVENT2&gt;,...,:&lt;IBS fetch count&gt;:&lt;IBS Fetch um&gt;|\
1053 op:&lt;IBS_OP_EVENT1&gt;,&lt;IBS_OP_EVENT2&gt;,...,:&lt;IBS op count&gt;:&lt;IBS op um&gt;
1054</screen>
1055
1056<para>
1057Here, the OProfile daemon parses the <varname>--ext-feature</varname>
1058option and checks the feature name ("ibs") before calling the
1059the initialization function to handle the string
1060containing IBS events, counts, and unitmasks.
1061Then, it stores each event in the IBS virtual-counter table
1062(<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and
1063stores the event index in the IBS Virtual Counter Index (VCI) map
1064(<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value
1065as the map key.
1066</para>
1067</sect3>
1068
1069<sect3 id="ibs-data-processing">
1070<title>IBS Data Processing</title>
1071
1072<para>
1073During a profile session, the OProfile daemon identifies IBS samples in the
1074event buffer using the <varname>"IBS_FETCH_CODE"</varname> or
1075<varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers
1076<function>code_ibs_fetch_sample()</function> or
1077<function>code_ibs_op_sample()</function> listed in the
1078<varname>handler_t handlers[]</varname> vector in
1079<filename>daemon/opd_trans.c </filename>. These handlers are responsible for
1080processing IBS samples and translate them into IBS performance events.
1081</para>
1082
1083<para>
1084Unlike traditional performance events, each IBS sample can be derived into
1085multiple IBS performance events. For each event that the user specifies,
1086a combination of bits from Model-Specific Registers (MSR) are checked
1087against the bitmask defining the event. If the condition is met, the event
1088will then be recorded. The derivation logic is in the files
1089<filename>daemon/opd_ibs_macro.h</filename> and
1090<filename>daemon/opd_ibs_trans.[h,c]</filename>.
1091</para>
1092
1093</sect3>
1094
1095<sect3 id="ibs-sample-file">
1096<title>IBS Sample File</title>
1097
1098<para>
1099Traditionally, sample file information <varname>(odb_t)</varname> is stored
1100in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>.
1101Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on
1102alpha-based system. Event index (the counter number on which the event
1103is configured) is used to access the corresponding entry in the array.
1104Unlike the traditional performance event, IBS does not use the actual
1105counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>).
1106Also, the number of performance events generated by IBS could be larger than
1107<varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op
1108events). Therefore IBS requires a special data structure and sfile
1109handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing
1110IBS sample files. IBS-sample-file information is stored in a memory
1111allocated by handler <function>ibs_sfile_create()</function>, which can
1112be accessed through <varname>struct sfile::odb_t * ext_files</varname>.
1113</para>
1114
1115</sect3>
1116
1117</sect2>
1118
1119</sect1>
1120
1121</chapter>
1122
1123<glossary id="glossary">
1124<title>Glossary of OProfile source concepts and types</title>
1125
1126<glossentry><glossterm>application image</glossterm>
1127<glossdef><para>
1128The primary binary image used by an application. This is derived
1129from the kernel and corresponds to the binary started upon running
1130an application: for example, <filename>/bin/bash</filename>.
1131</para></glossdef></glossentry>
1132
1133<glossentry><glossterm>binary image</glossterm>
1134<glossdef><para>
1135An ELF file containing executable code: this includes kernel modules,
1136the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries,
1137and application binaries.
1138</para></glossdef></glossentry>
1139
1140<glossentry><glossterm>dcookie</glossterm>
1141<glossdef><para>
1142Short for "dentry cookie". A unique ID that can be looked up to provide
1143the full path name of a binary image.
1144</para></glossdef></glossentry>
1145
1146<glossentry><glossterm>dependent image</glossterm>
1147<glossdef><para>
1148A binary image that is dependent upon an application, used with
1149per-application separation. Most commonly, shared libraries. For example,
1150if <filename>/bin/bash</filename> is running and we take
1151some samples inside the C library itself due to <command>bash</command>
1152calling library code, then the image <filename>/lib/libc.so</filename>
1153would be dependent upon <filename>/bin/bash</filename>.
1154</para></glossdef></glossentry>
1155
1156<glossentry><glossterm>merging</glossterm>
1157<glossdef><para>
1158This refers to the ability to merge several distinct sample files
1159into one set of data at runtime, in the post-profiling tools. For example,
1160per-thread sample files can be merged into one set of data, because
1161they are compatible (i.e. the aggregation of the data is meaningful),
1162but it's not possible to merge sample files for two different events,
1163because there would be no useful meaning to the results.
1164</para></glossdef></glossentry>
1165
1166<glossentry><glossterm>profile class</glossterm>
1167<glossdef><para>
1168A collection of profile data that has been collected under the same
1169class template. For example, if we're using <command>opreport</command>
1170to show results after profiling with two performance counters enabled
1171profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>,
1172there would be two profile classes, one for each event. Or if we're on
1173an SMP system and doing per-cpu profiling, and we request
1174<command>opreport</command> to show results for each CPU side-by-side,
1175there would be a profile class for each CPU.
1176</para></glossdef></glossentry>
1177
1178<glossentry><glossterm>profile specification</glossterm>
1179<glossdef><para>
1180The parameters the user passes to the post-profiling tools that limit
1181what sample files are used. This specification is matched against
1182the available sample files to generate a selection of profile data.
1183</para></glossdef></glossentry>
1184
1185<glossentry><glossterm>profile template</glossterm>
1186<glossdef><para>
1187The parameters that define what goes in a particular profile class.
1188This includes a symbolic name (e.g. "cpu:1") and the code-usable
1189equivalent.
1190</para></glossdef></glossentry>
1191
1192</glossary>
1193
1194</book>