blob: 230516877c9ee7ed930336d2b27851c818511961 [file] [log] [blame]
Upstreamcc2ee171970-01-12 13:46:40 +00001<?xml version="1.0" encoding="ISO-8859-1"?>
2<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3<html xmlns="http://www.w3.org/1999/xhtml">
4 <head>
5 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
6 <title>OProfile Internals</title>
7 <meta name="generator" content="DocBook XSL Stylesheets V1.68.1" />
8 </head>
9 <body>
10 <div class="book" lang="en" xml:lang="en">
11 <div class="titlepage">
12 <div>
13 <div>
14 <h1 class="title"><a id="oprofile-internals"></a>OProfile Internals</h1>
15 </div>
16 <div>
17 <div class="authorgroup">
18 <div class="author">
19 <h3 class="author"><span class="firstname">John</span> <span class="surname">Levon</span></h3>
20 <div class="affiliation">
21 <div class="address">
22 <p>
23 <code class="email">&lt;<a href="mailto:levon@movementarian.org">levon@movementarian.org</a>&gt;</code>
24 </p>
25 </div>
26 </div>
27 </div>
28 </div>
29 </div>
30 <div>
31 <p class="copyright">Copyright © 2003 John Levon</p>
32 </div>
33 </div>
34 <hr />
35 </div>
36 <div class="toc">
37 <p>
38 <b>Table of Contents</b>
39 </p>
40 <dl>
41 <dt>
42 <span class="chapter">
43 <a href="#introduction">1. Introduction</a>
44 </span>
45 </dt>
46 <dd>
47 <dl>
48 <dt>
49 <span class="sect1">
50 <a href="#overview">1. Overview</a>
51 </span>
52 </dt>
53 <dt>
54 <span class="sect1">
55 <a href="#components">2. Components of the OProfile system</a>
56 </span>
57 </dt>
58 <dd>
59 <dl>
60 <dt>
61 <span class="sect2">
62 <a href="#arch-specific-components">2.1. Architecture-specific components</a>
63 </span>
64 </dt>
65 <dt>
66 <span class="sect2">
67 <a href="#filesystem">2.2. oprofilefs</a>
68 </span>
69 </dt>
70 <dt>
71 <span class="sect2">
72 <a href="#driver">2.3. Generic kernel driver</a>
73 </span>
74 </dt>
75 <dt>
76 <span class="sect2">
77 <a href="#daemon">2.4. The OProfile daemon</a>
78 </span>
79 </dt>
80 <dt>
81 <span class="sect2">
82 <a href="#post-profiling">2.5. Post-profiling tools</a>
83 </span>
84 </dt>
85 </dl>
86 </dd>
87 </dl>
88 </dd>
89 <dt>
90 <span class="chapter">
91 <a href="#performance-counters">2. Performance counter management</a>
92 </span>
93 </dt>
94 <dd>
95 <dl>
96 <dt>
97 <span class="sect1">
98 <a href="#performance-counters-ui">1. Providing a user interface</a>
99 </span>
100 </dt>
101 <dt>
102 <span class="sect1">
103 <a href="#performance-counters-programming">2. Programming the performance counter registers</a>
104 </span>
105 </dt>
106 <dd>
107 <dl>
108 <dt>
109 <span class="sect2">
110 <a href="#performance-counters-start">2.1. Starting and stopping the counters</a>
111 </span>
112 </dt>
113 <dt>
114 <span class="sect2">
115 <a href="#id2495021">2.2. IA64 and perfmon</a>
116 </span>
117 </dt>
118 </dl>
119 </dd>
120 </dl>
121 </dd>
122 <dt>
123 <span class="chapter">
124 <a href="#collecting-samples">3. Collecting and processing samples</a>
125 </span>
126 </dt>
127 <dd>
128 <dl>
129 <dt>
130 <span class="sect1">
131 <a href="#receiving-interrupts">1. Receiving interrupts</a>
132 </span>
133 </dt>
134 <dt>
135 <span class="sect1">
136 <a href="#core-structure">2. Core data structures</a>
137 </span>
138 </dt>
139 <dt>
140 <span class="sect1">
141 <a href="#logging-sample">3. Logging a sample</a>
142 </span>
143 </dt>
144 <dt>
145 <span class="sect1">
146 <a href="#logging-stack">4. Logging stack traces</a>
147 </span>
148 </dt>
149 <dt>
150 <span class="sect1">
151 <a href="#synchronising-buffers">5. Synchronising the CPU buffers to the event buffer</a>
152 </span>
153 </dt>
154 <dt>
155 <span class="sect1">
156 <a href="#dentry-cookies">6. Identifying binary images</a>
157 </span>
158 </dt>
159 <dt>
160 <span class="sect1">
161 <a href="#finding-dentry">7. Finding a sample's binary image and offset</a>
162 </span>
163 </dt>
164 </dl>
165 </dd>
166 <dt>
167 <span class="chapter">
168 <a href="#sample-files">4. Generating sample files</a>
169 </span>
170 </dt>
171 <dd>
172 <dl>
173 <dt>
174 <span class="sect1">
175 <a href="#processing-buffer">1. Processing the buffer</a>
176 </span>
177 </dt>
178 <dd>
179 <dl>
180 <dt>
181 <span class="sect2">
182 <a href="#handling-kernel-samples">1.1. Handling kernel samples</a>
183 </span>
184 </dt>
185 </dl>
186 </dd>
187 <dt>
188 <span class="sect1">
189 <a href="#sample-file-generation">2. Locating and creating sample files</a>
190 </span>
191 </dt>
192 <dt>
193 <span class="sect1">
194 <a href="#sample-file-writing">3. Writing data to a sample file</a>
195 </span>
196 </dt>
197 </dl>
198 </dd>
199 <dt>
200 <span class="chapter">
201 <a href="#output">5. Generating useful output</a>
202 </span>
203 </dt>
204 <dd>
205 <dl>
206 <dt>
207 <span class="sect1">
208 <a href="#profile-specification">1. Handling the profile specification</a>
209 </span>
210 </dt>
211 <dt>
212 <span class="sect1">
213 <a href="#sample-file-collating">2. Collating the candidate sample files</a>
214 </span>
215 </dt>
216 <dd>
217 <dl>
218 <dt>
219 <span class="sect2">
220 <a href="#sample-file-classifying">2.1. Classifying sample files</a>
221 </span>
222 </dt>
223 <dt>
224 <span class="sect2">
225 <a href="#sample-file-inverting">2.2. Creating inverted profile lists</a>
226 </span>
227 </dt>
228 </dl>
229 </dd>
230 <dt>
231 <span class="sect1">
232 <a href="#generating-profile-data">3. Generating profile data</a>
233 </span>
234 </dt>
235 <dd>
236 <dl>
237 <dt>
238 <span class="sect2">
239 <a href="#bfd">3.1. Processing the binary image</a>
240 </span>
241 </dt>
242 <dt>
243 <span class="sect2">
244 <a href="#processing-sample-files">3.2. Processing the sample files</a>
245 </span>
246 </dt>
247 </dl>
248 </dd>
249 <dt>
250 <span class="sect1">
251 <a href="#generating-output">4. Generating output</a>
252 </span>
253 </dt>
254 </dl>
255 </dd>
256 <dt>
257 <span class="glossary">
258 <a href="#glossary">Glossary of OProfile source concepts and types</a>
259 </span>
260 </dt>
261 </dl>
262 </div>
263 <div class="list-of-figures">
264 <p>
265 <b>List of Figures</b>
266 </p>
267 <dl>
268 <dt>3.1. <a href="#id2495193">The OProfile buffers</a></dt>
269 </dl>
270 </div>
271 <div class="chapter" lang="en" xml:lang="en">
272 <div class="titlepage">
273 <div>
274 <div>
275 <h2 class="title"><a id="introduction"></a>Chapter 1. Introduction</h2>
276 </div>
277 </div>
278 </div>
279 <div class="toc">
280 <p>
281 <b>Table of Contents</b>
282 </p>
283 <dl>
284 <dt>
285 <span class="sect1">
286 <a href="#overview">1. Overview</a>
287 </span>
288 </dt>
289 <dt>
290 <span class="sect1">
291 <a href="#components">2. Components of the OProfile system</a>
292 </span>
293 </dt>
294 <dd>
295 <dl>
296 <dt>
297 <span class="sect2">
298 <a href="#arch-specific-components">2.1. Architecture-specific components</a>
299 </span>
300 </dt>
301 <dt>
302 <span class="sect2">
303 <a href="#filesystem">2.2. oprofilefs</a>
304 </span>
305 </dt>
306 <dt>
307 <span class="sect2">
308 <a href="#driver">2.3. Generic kernel driver</a>
309 </span>
310 </dt>
311 <dt>
312 <span class="sect2">
313 <a href="#daemon">2.4. The OProfile daemon</a>
314 </span>
315 </dt>
316 <dt>
317 <span class="sect2">
318 <a href="#post-profiling">2.5. Post-profiling tools</a>
319 </span>
320 </dt>
321 </dl>
322 </dd>
323 </dl>
324 </div>
325 <p>
326This document is current for OProfile version 0.9.1cvs.
327This document provides some details on the internal workings of OProfile for the
328interested hacker. This document assumes strong C, working C++, plus some knowledge of
329kernel internals and CPU hardware.
330</p>
331 <div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
332 <h3 class="title">Note</h3>
333 <p>
334Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4
335uses a very different kernel module implementation and daemon to produce the sample files.
336</p>
337 </div>
338 <div class="sect1" lang="en" xml:lang="en">
339 <div class="titlepage">
340 <div>
341 <div>
342 <h2 class="title" style="clear: both"><a id="overview"></a>1. Overview</h2>
343 </div>
344 </div>
345 </div>
346 <p>
347OProfile is a statistical continuous profiler. In other words, profiles are generated by
348regularly sampling the current registers on each CPU (from an interrupt handler, the
349saved PC value at the time of interrupt is stored), and converting that runtime PC
350value into something meaningful to the programmer.
351</p>
352 <p>
353OProfile achieves this by taking the stream of sampled PC values, along with the detail
354of which task was running at the time of the interrupt, and converting into a file offset
355against a particular binary file. Because applications <code class="function">mmap()</code>
356the code they run (be it <code class="filename">/bin/bash</code>, <code class="filename">/lib/libfoo.so</code>
357or whatever), it's possible to find the relevant binary file and offset by walking
358the task's list of mapped memory areas. Each PC value is thus converted into a tuple
359of binary-image,offset. This is something that the userspace tools can use directly
360to reconstruct where the code came from, including the particular assembly instructions,
361symbol, and source line (via the binary's debug information if present).
362</p>
363 <p>
364Regularly sampling the PC value like this approximates what actually was executed and
365how often - more often than not, this statistical approximation is good enough to
366reflect reality. In common operation, the time between each sample interrupt is regulated
367by a fixed number of clock cycles. This implies that the results will reflect where
368the CPU is spending the most time; this is obviously a very useful information source
369for performance analysis.
370</p>
371 <p>
372Sometimes though, an application programmer needs different kinds of information: for example,
373"which of the source routines cause the most cache misses ?". The rise in importance of
374such metrics in recent years has led many CPU manufacturers to provide hardware performance
375counters capable of measuring these events on the hardware level. Typically, these counters
376increment once per each event, and generate an interrupt on reaching some pre-defined
377number of events. OProfile can use these interrupts to generate samples: then, the
378profile results are a statistical approximation of which code caused how many of the
379given event.
380</p>
381 <p>
382Consider a simplified system that only executes two functions A and B. A
383takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at
384100 cycles a second, and we've set the performance counter to create an
385interrupt after a set number of "events" (in this case an event is one
386clock cycle). It should be clear that the chances of the interrupt
387occurring in function A is 1/100, and 99/100 for function B. Thus, we
388statistically approximate the actual relative performance features of
389the two functions over time. This same analysis works for other types of
390events, providing that the interrupt is tied to the number of events
391occurring (that is, after N events, an interrupt is generated).
392</p>
393 <p>
394There are typically more than one of these counters, so it's possible to set up profiling
395for several different event types. Using these counters gives us a powerful, low-overhead
396way of gaining performance metrics. If OProfile, or the CPU, does not support performance
397counters, then a simpler method is used: the kernel timer interrupt feeds samples
398into OProfile itself.
399</p>
400 <p>
401The rest of this document concerns itself with how we get from receiving samples at
402interrupt time to producing user-readable profile information.
403</p>
404 </div>
405 <div class="sect1" lang="en" xml:lang="en">
406 <div class="titlepage">
407 <div>
408 <div>
409 <h2 class="title" style="clear: both"><a id="components"></a>2. Components of the OProfile system</h2>
410 </div>
411 </div>
412 </div>
413 <div class="sect2" lang="en" xml:lang="en">
414 <div class="titlepage">
415 <div>
416 <div>
417 <h3 class="title"><a id="arch-specific-components"></a>2.1. Architecture-specific components</h3>
418 </div>
419 </div>
420 </div>
421 <p>
422If OProfile supports the hardware performance counters found on
423a particular architecture, code for managing the details of setting
424up and managing these counters can be found in the kernel source
425tree in the relevant <code class="filename">arch/<span class="emphasis"><em>arch</em></span>/oprofile/</code>
426directory. The architecture-specific implementation works via
427filling in the oprofile_operations structure at init time. This
428provides a set of operations such as <code class="function">setup()</code>,
429<code class="function">start()</code>, <code class="function">stop()</code>, etc.
430that manage the hardware-specific details of fiddling with the
431performance counter registers.
432</p>
433 <p>
434The other important facility available to the architecture code is
435<code class="function">oprofile_add_sample()</code>. This is where a particular sample
436taken at interrupt time is fed into the generic OProfile driver code.
437</p>
438 </div>
439 <div class="sect2" lang="en" xml:lang="en">
440 <div class="titlepage">
441 <div>
442 <div>
443 <h3 class="title"><a id="filesystem"></a>2.2. oprofilefs</h3>
444 </div>
445 </div>
446 </div>
447 <p>
448OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from
449userspace at <code class="filename">/dev/oprofile</code>. This consists of small
450files for reporting and receiving configuration from userspace, as well
451as the actual character device that the OProfile userspace receives samples
452from. At <code class="function">setup()</code> time, the architecture-specific may
453add further configuration files related to the details of the performance
454counters. For example, on x86, one numbered directory for each hardware
455performance counter is added, with files in each for the event type,
456reset value, etc.
457</p>
458 <p>
459The filesystem also contains a <code class="filename">stats</code> directory with
460a number of useful counters for various OProfile events.
461</p>
462 </div>
463 <div class="sect2" lang="en" xml:lang="en">
464 <div class="titlepage">
465 <div>
466 <div>
467 <h3 class="title"><a id="driver"></a>2.3. Generic kernel driver</h3>
468 </div>
469 </div>
470 </div>
471 <p>
472This lives in <code class="filename">drivers/oprofile/</code>, and forms the core of
473how OProfile works in the kernel. Its job is to take samples delivered
474from the architecture-specific code (via <code class="function">oprofile_add_sample()</code>),
475and buffer this data, in a transformed form as described later, until releasing
476the data to the userspace daemon via the <code class="filename">/dev/oprofile/buffer</code>
477character device.
478</p>
479 </div>
480 <div class="sect2" lang="en" xml:lang="en">
481 <div class="titlepage">
482 <div>
483 <div>
484 <h3 class="title"><a id="daemon"></a>2.4. The OProfile daemon</h3>
485 </div>
486 </div>
487 </div>
488 <p>
489The OProfile userspace daemon's job is to take the raw data provided by the
490kernel and write it to the disk. It takes the single data stream from the
491kernel and logs sample data against a number of sample files (found in
492<code class="filename">/var/lib/oprofile/samples/current/</code>. For the benefit
493of the "separate" functionality, the names/paths of these sample files
494are mangled to reflect where the samples were from: this can include
495thread IDs, the binary file path, the event type used, and more.
496</p>
497 <p>
498After this final step from interrupt to disk file, the data is now
499persistent (that is, changes in the running of the system do not invalidate
500stored data). So the post-profiling tools can run on this data at any
501time (assuming the original binary files are still available and unchanged,
502naturally).
503</p>
504 </div>
505 <div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="post-profiling"></a>2.5. Post-profiling tools</h3></div></div></div>
506So far, we've collected data, but we've yet to present it in a useful form
507to the user. This is the job of the post-profiling tools. In general form,
508they collate a subset of the available sample files, load and process each one
509correlated against the relevant binary file, and finally produce user-readable
510information.
511</div>
512 </div>
513 </div>
514 <div class="chapter" lang="en" xml:lang="en">
515 <div class="titlepage">
516 <div>
517 <div>
518 <h2 class="title"><a id="performance-counters"></a>Chapter 2. Performance counter management</h2>
519 </div>
520 </div>
521 </div>
522 <div class="toc">
523 <p>
524 <b>Table of Contents</b>
525 </p>
526 <dl>
527 <dt>
528 <span class="sect1">
529 <a href="#performance-counters-ui">1. Providing a user interface</a>
530 </span>
531 </dt>
532 <dt>
533 <span class="sect1">
534 <a href="#performance-counters-programming">2. Programming the performance counter registers</a>
535 </span>
536 </dt>
537 <dd>
538 <dl>
539 <dt>
540 <span class="sect2">
541 <a href="#performance-counters-start">2.1. Starting and stopping the counters</a>
542 </span>
543 </dt>
544 <dt>
545 <span class="sect2">
546 <a href="#id2495021">2.2. IA64 and perfmon</a>
547 </span>
548 </dt>
549 </dl>
550 </dd>
551 </dl>
552 </div>
553 <div class="sect1" lang="en" xml:lang="en">
554 <div class="titlepage">
555 <div>
556 <div>
557 <h2 class="title" style="clear: both"><a id="performance-counters-ui"></a>1. Providing a user interface</h2>
558 </div>
559 </div>
560 </div>
561 <p>
562The performance counter registers need programming in order to set the
563type of event to count, etc. OProfile uses a standard model across all
564CPUs for defining these events as follows :
565</p>
566 <div class="informaltable">
567 <table border="1">
568 <colgroup>
569 <col />
570 <col />
571 </colgroup>
572 <tbody>
573 <tr>
574 <td>
575 <code class="option">event</code>
576 </td>
577 <td>The event type e.g. DATA_MEM_REFS</td>
578 </tr>
579 <tr>
580 <td>
581 <code class="option">unit mask</code>
582 </td>
583 <td>The sub-events to count (more detailed specification)</td>
584 </tr>
585 <tr>
586 <td>
587 <code class="option">counter</code>
588 </td>
589 <td>The hardware counter(s) that can count this event</td>
590 </tr>
591 <tr>
592 <td>
593 <code class="option">count</code>
594 </td>
595 <td>The reset value (how many events before an interrupt)</td>
596 </tr>
597 <tr>
598 <td>
599 <code class="option">kernel</code>
600 </td>
601 <td>Whether the counter should increment when in kernel space</td>
602 </tr>
603 <tr>
604 <td>
605 <code class="option">user</code>
606 </td>
607 <td>Whether the counter should increment when in user space</td>
608 </tr>
609 </tbody>
610 </table>
611 </div>
612 <p>
613The term "unit mask" is borrowed from the Intel architectures, and can
614further specify exactly when a counter is incremented (for example,
615cache-related events can be restricted to particular state transitions
616of the cache lines).
617</p>
618 <p>
619All of the available hardware events and their details are specified in
620the textual files in the <code class="filename">events</code> directory. The
621syntax of these files should be fairly obvious. The user specifies the
622names and configuration details of the chosen counters via
623<span><strong class="command">opcontrol</strong></span>. These are then written to the kernel
624module (in numerical form) via <code class="filename">/dev/oprofile/N/</code>
625where N is the physical hardware counter (some events can only be used
626on specific counters; OProfile hides these details from the user when
627possible). On IA64, the perfmon-based interface behaves somewhat
628differently, as described later.
629</p>
630 </div>
631 <div class="sect1" lang="en" xml:lang="en">
632 <div class="titlepage">
633 <div>
634 <div>
635 <h2 class="title" style="clear: both"><a id="performance-counters-programming"></a>2. Programming the performance counter registers</h2>
636 </div>
637 </div>
638 </div>
639 <p>
640We have described how the user interface fills in the desired
641configuration of the counters and transmits the information to the
642kernel. It is the job of the <code class="function">-&gt;setup()</code> method
643to actually program the performance counter registers. Clearly, the
644details of how this is done is architecture-specific; it is also
645model-specific on many architectures. For example, i386 provides methods
646for each model type that programs the counter registers correctly
647(see the <code class="filename">op_model_*</code> files in
648<code class="filename">arch/i386/oprofile</code> for the details). The method
649reads the values stored in the virtual oprofilefs files and programs
650the registers appropriately, ready for starting the actual profiling
651session.
652</p>
653 <p>
654The architecture-specific drivers make sure to save the old register
655settings before doing OProfile setup. They are restored when OProfile
656shuts down. This is useful, for example, on i386, where the NMI watchdog
657uses the same performance counter registers as OProfile; they cannot
658run concurrently, but OProfile makes sure to restore the setup it found
659before it was running.
660</p>
661 <p>
662In addition to programming the counter registers themselves, other setup
663is often necessary. For example, on i386, the local APIC needs
664programming in order to make the counter's overflow interrupt appear as
665an NMI (non-maskable interrupt). This allows sampling (and therefore
666profiling) of regions where "normal" interrupts are masked, enabling
667more reliable profiles.
668</p>
669 <div class="sect2" lang="en" xml:lang="en">
670 <div class="titlepage">
671 <div>
672 <div>
673 <h3 class="title"><a id="performance-counters-start"></a>2.1. Starting and stopping the counters</h3>
674 </div>
675 </div>
676 </div>
677 <p>
678Initiating a profiling session is done via writing an ASCII '1'
679to the file <code class="filename">/dev/oprofile/enable</code>. This sets up the
680core, and calls into the architecture-specific driver to actually
681enable each configured counter. Again, the details of how this is
682done is model-specific (for example, the Athlon models can disable
683or enable on a per-counter basis, unlike the PPro models).
684</p>
685 </div>
686 <div class="sect2" lang="en" xml:lang="en">
687 <div class="titlepage">
688 <div>
689 <div>
690 <h3 class="title"><a id="id2495021"></a>2.2. IA64 and perfmon</h3>
691 </div>
692 </div>
693 </div>
694 <p>
695The IA64 architecture provides a different interface from the other
696architectures, using the existing perfmon driver. Register programming
697is handled entirely in user-space (see
698<code class="filename">daemon/opd_perfmon.c</code> for the details). A process
699is forked for each CPU, which creates a perfmon context and sets the
700counter registers appropriately via the
701<code class="function">sys_perfmonctl</code> interface. In addition, the actual
702initiation and termination of the profiling session is handled via the
703same interface using <code class="constant">PFM_START</code> and
704<code class="constant">PFM_STOP</code>. On IA64, then, there are no oprofilefs
705files for the performance counters, as the kernel driver does not
706program the registers itself.
707</p>
708 <p>
709Instead, the perfmon driver for OProfile simply registers with the
710OProfile core with an OProfile-specific UUID. During a profiling
711session, the perfmon core calls into the OProfile perfmon driver and
712samples are registered with the OProfile core itself as usual (with
713<code class="function">oprofile_add_sample()</code>).
714</p>
715 </div>
716 </div>
717 </div>
718 <div class="chapter" lang="en" xml:lang="en">
719 <div class="titlepage">
720 <div>
721 <div>
722 <h2 class="title"><a id="collecting-samples"></a>Chapter 3. Collecting and processing samples</h2>
723 </div>
724 </div>
725 </div>
726 <div class="toc">
727 <p>
728 <b>Table of Contents</b>
729 </p>
730 <dl>
731 <dt>
732 <span class="sect1">
733 <a href="#receiving-interrupts">1. Receiving interrupts</a>
734 </span>
735 </dt>
736 <dt>
737 <span class="sect1">
738 <a href="#core-structure">2. Core data structures</a>
739 </span>
740 </dt>
741 <dt>
742 <span class="sect1">
743 <a href="#logging-sample">3. Logging a sample</a>
744 </span>
745 </dt>
746 <dt>
747 <span class="sect1">
748 <a href="#logging-stack">4. Logging stack traces</a>
749 </span>
750 </dt>
751 <dt>
752 <span class="sect1">
753 <a href="#synchronising-buffers">5. Synchronising the CPU buffers to the event buffer</a>
754 </span>
755 </dt>
756 <dt>
757 <span class="sect1">
758 <a href="#dentry-cookies">6. Identifying binary images</a>
759 </span>
760 </dt>
761 <dt>
762 <span class="sect1">
763 <a href="#finding-dentry">7. Finding a sample's binary image and offset</a>
764 </span>
765 </dt>
766 </dl>
767 </div>
768 <div class="sect1" lang="en" xml:lang="en">
769 <div class="titlepage">
770 <div>
771 <div>
772 <h2 class="title" style="clear: both"><a id="receiving-interrupts"></a>1. Receiving interrupts</h2>
773 </div>
774 </div>
775 </div>
776 <p>
777Naturally, how the overflow interrupts are received is specific
778to the hardware architecture, unless we are in "timer" mode, where the
779logging routine is called directly from the standard kernel timer
780interrupt handler.
781</p>
782 <p>
783On the i386 architecture, the local APIC is programmed such that when a
784counter overflows (that is, it receives an event that causes an integer
785overflow of the register value to zero), an NMI is generated. This calls
786into the general handler <code class="function">do_nmi()</code>; because OProfile
787has registered itself as capable of handling NMI interrupts, this will
788call into the OProfile driver code in
789<code class="filename">arch/i386/oprofile</code>. Here, the saved PC value (the
790CPU saves the register set at the time of interrupt on the stack
791available for inspection) is extracted, and the counters are examined to
792find out which one generated the interrupt. Also determined is whether
793the system was inside kernel or user space at the time of the interrupt.
794These three pieces of information are then forwarded onto the OProfile
795core via <code class="function">oprofile_add_sample()</code>. Finally, the
796counter values are reset to the chosen count value, to ensure another
797interrupt happens after another N events have occurred. Other
798architectures behave in a similar manner.
799</p>
800 </div>
801 <div class="sect1" lang="en" xml:lang="en">
802 <div class="titlepage">
803 <div>
804 <div>
805 <h2 class="title" style="clear: both"><a id="core-structure"></a>2. Core data structures</h2>
806 </div>
807 </div>
808 </div>
809 <p>
810Before considering what happens when we log a sample, we shall digress
811for a moment and look at the general structure of the data collection
812system.
813</p>
814 <p>
815OProfile maintains a small buffer for storing the logged samples for
816each CPU on the system. Only this buffer is altered when we actually log
817a sample (remember, we may still be in an NMI context, so no locking is
818possible). The buffer is managed by a two-handed system; the "head"
819iterator dictates where the next sample data should be placed in the
820buffer. Of course, overflow of the buffer is possible, in which case
821the sample is discarded.
822</p>
823 <p>
824It is critical to remember that at this point, the PC value is an
825absolute value, and is therefore only meaningful in the context of which
826task it was logged against. Thus, these per-CPU buffers also maintain
827details of which task each logged sample is for, as described in the
828next section. In addition, we store whether the sample was in kernel
829space or user space (on some architectures and configurations, the address
830space is not sub-divided neatly at a specific PC value, so we must store
831this information).
832</p>
833 <p>
834As well as these small per-CPU buffers, we have a considerably larger
835single buffer. This holds the data that is eventually copied out into
836the OProfile daemon. On certain system events, the per-CPU buffers are
837processed and entered (in mutated form) into the main buffer, known in
838the source as the "event buffer". The "tail" iterator indicates the
839point from which the CPU may be read, up to the position of the "head"
840iterator. This provides an entirely lock-free method for extracting data
841from the CPU buffers. This process is described in detail later in this chapter.
842</p>
843 <div class="figure">
844 <a id="id2495193"></a>
845 <p class="title">
846 <b>Figure 3.1. The OProfile buffers</b>
847 </p>
848 <div>
849 <img src="buffers.png" alt="The OProfile buffers" />
850 </div>
851 </div>
852 </div>
853 <div class="sect1" lang="en" xml:lang="en">
854 <div class="titlepage">
855 <div>
856 <div>
857 <h2 class="title" style="clear: both"><a id="logging-sample"></a>3. Logging a sample</h2>
858 </div>
859 </div>
860 </div>
861 <p>
862As mentioned, the sample is logged into the buffer specific to the
863current CPU. The CPU buffer is a simple array of pairs of unsigned long
864values; for a sample, they hold the PC value and the counter for the
865sample. (The counter value is later used to translate back into the relevant
866event type the counter was programmed to).
867</p>
868 <p>
869In addition to logging the sample itself, we also log task switches.
870This is simply done by storing the address of the last task to log a
871sample on that CPU in a data structure, and writing a task switch entry
872into the buffer if the new value of <code class="function">current()</code> has
873changed. Note that later we will directly de-reference this pointer;
874this imposes certain restrictions on when and how the CPU buffers need
875to be processed.
876</p>
877 <p>
878Finally, as mentioned, we log whether we have changed between kernel and
879userspace using a similar method. Both of these variables
880(<code class="varname">last_task</code> and <code class="varname">last_is_kernel</code>) are
881reset when the CPU buffer is read.
882</p>
883 </div>
884 <div class="sect1" lang="en" xml:lang="en">
885 <div class="titlepage">
886 <div>
887 <div>
888 <h2 class="title" style="clear: both"><a id="logging-stack"></a>4. Logging stack traces</h2>
889 </div>
890 </div>
891 </div>
892 <p>
893OProfile can also provide statistical samples of call chains (on x86). To
894do this, at sample time, the frame pointer chain is traversed, recording
895the return address for each stack frame. This will only work if the code
896was compiled with frame pointers, but we're careful to abort the
897traversal if the frame pointer appears bad. We store the set of return
898addresses straight into the CPU buffer. Note that, since this traversal
899is keyed off the standard sample interrupt, the number of times a
900function appears in a stack trace is not an indicator of how many times
901the call site was executed: rather, it's related to the number of
902samples we took where that call site was involved. Thus, the results for
903stack traces are not necessarily proportional to the call counts:
904typical programs will have many <code class="function">main()</code> samples.
905</p>
906 </div>
907 <div class="sect1" lang="en" xml:lang="en">
908 <div class="titlepage">
909 <div>
910 <div>
911 <h2 class="title" style="clear: both"><a id="synchronising-buffers"></a>5. Synchronising the CPU buffers to the event buffer</h2>
912 </div>
913 </div>
914 </div>
915 <p>
916At some point, we have to process the data in each CPU buffer and enter
917it into the main (event) buffer. The file
918<code class="filename">buffer_sync.c</code> contains the relevant code. We
919periodically (currently every <code class="constant">HZ</code>/4 jiffies) start
920the synchronisation process. In addition, we process the buffers on
921certain events, such as an application calling
922<code class="function">munmap()</code>. This is particularly important for
923<code class="function">exit()</code> - because the CPU buffers contain pointers
924to the task structure, if we don't process all the buffers before the
925task is actually destroyed and the task structure freed, then we could
926end up trying to dereference a bogus pointer in one of the CPU buffers.
927</p>
928 <p>
929We also add a notification when a kernel module is loaded; this is so
930that user-space can re-read <code class="filename">/proc/modules</code> to
931determine the load addresses of kernel module text sections. Without
932this notification, samples for a newly-loaded module could get lost or
933be attributed to the wrong module.
934</p>
935 <p>
936The synchronisation itself works in the following manner: first, mutual
937exclusion on the event buffer is taken. Remember, we do not need to do
938that for each CPU buffer, as we only read from the tail iterator (whilst
939interrupts might be arriving at the same buffer, but they will write to
940the position of the head iterator, leaving previously written entries
941intact). Then, we process each CPU buffer in turn. A CPU switch
942notification is added to the buffer first (for
943<code class="option">--separate=cpu</code> support). Then the processing of the
944actual data starts.
945</p>
946 <p>
947As mentioned, the CPU buffer consists of task switch entries and the
948actual samples. When the routine <code class="function">sync_buffer()</code> sees
949a task switch, the process ID and process group ID are recorded into the
950event buffer, along with a dcookie (see below) identifying the
951application binary (e.g. <code class="filename">/bin/bash</code>). The
952<code class="varname">mmap_sem</code> for the task is then taken, to allow safe
953iteration across the tasks' list of mapped areas. Each sample is then
954processed as described in the next section.
955</p>
956 <p>
957After a buffer has been read, the tail iterator is updated to reflect
958how much of the buffer was processed. Note that when we determined how
959much data there was to read in the CPU buffer, we also called
960<code class="function">cpu_buffer_reset()</code> to reset
961<code class="varname">last_task</code> and <code class="varname">last_is_kernel</code>, as
962we've already mentioned. During the processing, more samples may have
963been arriving in the CPU buffer; this is OK because we are careful to
964only update the tail iterator to how much we actually read - on the next
965buffer synchronisation, we will start again from that point.
966</p>
967 </div>
968 <div class="sect1" lang="en" xml:lang="en">
969 <div class="titlepage">
970 <div>
971 <div>
972 <h2 class="title" style="clear: both"><a id="dentry-cookies"></a>6. Identifying binary images</h2>
973 </div>
974 </div>
975 </div>
976 <p>
977In order to produce useful profiles, we need to be able to associate a
978particular PC value sample with an actual ELF binary on the disk. This
979leaves us with the problem of how to export this information to
980user-space. We create unique IDs that identify a particular directory
981entry (dentry), and write those IDs into the event buffer. Later on,
982the user-space daemon can call the <code class="function">lookup_dcookie</code>
983system call, which looks up the ID and fills in the full path of
984the binary image in the buffer user-space passes in. These IDs are
985maintained by the code in <code class="filename">fs/dcookies.c</code>; the
986cache lasts for as long as the daemon has the event buffer open.
987</p>
988 </div>
989 <div class="sect1" lang="en" xml:lang="en">
990 <div class="titlepage">
991 <div>
992 <div>
993 <h2 class="title" style="clear: both"><a id="finding-dentry"></a>7. Finding a sample's binary image and offset</h2>
994 </div>
995 </div>
996 </div>
997 <p>
998We haven't yet described how we process the absolute PC value into
999something usable by the user-space daemon. When we find a sample entered
1000into the CPU buffer, we traverse the list of mappings for the task
1001(remember, we will have seen a task switch earlier, so we know which
1002task's lists to look at). When a mapping is found that contains the PC
1003value, we look up the mapped file's dentry in the dcookie cache. This
1004gives the dcookie ID that will uniquely identify the mapped file. Then
1005we alter the absolute value such that it is an offset from the start of
1006the file being mapped (the mapping need not start at the start of the
1007actual file, so we have to consider the offset value of the mapping). We
1008store this dcookie ID into the event buffer; this identifies which
1009binary the samples following it are against.
1010In this manner, we have converted a PC value, which has transitory
1011meaning only, into a static offset value for later processing by the
1012daemon.
1013</p>
1014 <p>
1015We also attempt to avoid the relatively expensive lookup of the dentry
1016cookie value by storing the cookie value directly into the dentry
1017itself; then we can simply derive the cookie value immediately when we
1018find the correct mapping.
1019</p>
1020 </div>
1021 </div>
1022 <div class="chapter" lang="en" xml:lang="en">
1023 <div class="titlepage">
1024 <div>
1025 <div>
1026 <h2 class="title"><a id="sample-files"></a>Chapter 4. Generating sample files</h2>
1027 </div>
1028 </div>
1029 </div>
1030 <div class="toc">
1031 <p>
1032 <b>Table of Contents</b>
1033 </p>
1034 <dl>
1035 <dt>
1036 <span class="sect1">
1037 <a href="#processing-buffer">1. Processing the buffer</a>
1038 </span>
1039 </dt>
1040 <dd>
1041 <dl>
1042 <dt>
1043 <span class="sect2">
1044 <a href="#handling-kernel-samples">1.1. Handling kernel samples</a>
1045 </span>
1046 </dt>
1047 </dl>
1048 </dd>
1049 <dt>
1050 <span class="sect1">
1051 <a href="#sample-file-generation">2. Locating and creating sample files</a>
1052 </span>
1053 </dt>
1054 <dt>
1055 <span class="sect1">
1056 <a href="#sample-file-writing">3. Writing data to a sample file</a>
1057 </span>
1058 </dt>
1059 </dl>
1060 </div>
1061 <div class="sect1" lang="en" xml:lang="en">
1062 <div class="titlepage">
1063 <div>
1064 <div>
1065 <h2 class="title" style="clear: both"><a id="processing-buffer"></a>1. Processing the buffer</h2>
1066 </div>
1067 </div>
1068 </div>
1069 <p>
1070Now we can move onto user-space in our description of how raw interrupt
1071samples are processed into useful information. As we described in
1072previous sections, the kernel OProfile driver creates a large buffer of
1073sample data consisting of offset values, interspersed with
1074notification of changes in context. These context changes indicate how
1075following samples should be attributed, and include task switches, CPU
1076changes, and which dcookie the sample value is against. By processing
1077this buffer entry-by-entry, we can determine where the samples should
1078be accredited to. This is particularly important when using the
1079<code class="option">--separate</code>.
1080</p>
1081 <p>
1082The file <code class="filename">daemon/opd_trans.c</code> contains the basic routine
1083for the buffer processing. The <code class="varname">struct transient</code>
1084structure is used to hold changes in context. Its members are modified
1085as we process each entry; it is passed into the routines in
1086<code class="filename">daemon/opd_sfile.c</code> for actually logging the sample
1087to a particular sample file (which will be held in
1088<code class="filename">/var/lib/oprofile/samples/current</code>).
1089</p>
1090 <p>
1091The buffer format is designed for conciseness, as high sampling rates
1092can easily generate a lot of data. Thus, context changes are prefixed
1093by an escape code, identified by <code class="function">is_escape_code()</code>.
1094If an escape code is found, the next entry in the buffer identifies
1095what type of context change is being read. These are handed off to
1096various handlers (see the <code class="varname">handlers</code> array), which
1097modify the transient structure as appropriate. If it's not an escape
1098code, then it must be a PC offset value, and the very next entry will
1099be the numeric hardware counter. These values are read and recorded
1100in the transient structure; we then do a lookup to find the correct
1101sample file, and log the sample, as described in the next section.
1102</p>
1103 <div class="sect2" lang="en" xml:lang="en">
1104 <div class="titlepage">
1105 <div>
1106 <div>
1107 <h3 class="title"><a id="handling-kernel-samples"></a>1.1. Handling kernel samples</h3>
1108 </div>
1109 </div>
1110 </div>
1111 <p>
1112Samples from kernel code require a little special handling. Because
1113the binary text which the sample is against does not correspond to
1114any file that the kernel directly knows about, the OProfile driver
1115stores the absolute PC value in the buffer, instead of the file offset.
1116Of course, we need an offset against some particular binary. To handle
1117this, we keep a list of loaded modules by parsing
1118<code class="filename">/proc/modules</code> as needed. When a module is loaded,
1119a notification is placed in the OProfile buffer, and this triggers a
1120re-read. We store the module name, and the loading address and size.
1121This is also done for the main kernel image, as specified by the user.
1122The absolute PC value is matched against each address range, and
1123modified into an offset when the matching module is found. See
1124<code class="filename">daemon/opd_kernel.c</code> for the details.
1125</p>
1126 </div>
1127 </div>
1128 <div class="sect1" lang="en" xml:lang="en">
1129 <div class="titlepage">
1130 <div>
1131 <div>
1132 <h2 class="title" style="clear: both"><a id="sample-file-generation"></a>2. Locating and creating sample files</h2>
1133 </div>
1134 </div>
1135 </div>
1136 <p>
1137We have a sample value and its satellite data stored in a
1138<code class="varname">struct transient</code>, and we must locate an
1139actual sample file to store the sample in, using the context
1140information in the transient structure as a key. The transient data to
1141sample file lookup is handled in
1142<code class="filename">daemon/opd_sfile.c</code>. A hash is taken of the
1143transient values that are relevant (depending upon the setting of
1144<code class="option">--separate</code>, some values might be irrelevant), and the
1145hash value is used to lookup the list of currently open sample files.
1146Of course, the sample file might not be found, in which case we need
1147to create and open it.
1148</p>
1149 <p>
1150OProfile uses a rather complex scheme for naming sample files, in order
1151to make selecting relevant sample files easier for the post-profiling
1152utilities. The exact details of the scheme are given in
1153<code class="filename">oprofile-tests/pp_interface</code>, but for now it will
1154suffice to remember that the filename will include only relevant
1155information for the current settings, taken from the transient data. A
1156fully-specified filename looks something like :
1157</p>
1158 <code class="computeroutput">
1159/var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0
1160</code>
1161 <p>
1162It should be clear that this identifies such information as the
1163application binary, the dependent (library) binary, the hardware event,
1164and the process and thread ID. Typically, not all this information is
1165needed, in which cases some values may be replaced with the token
1166<code class="filename">all</code>.
1167</p>
1168 <p>
1169The code that generates this filename and opens the file is found in
1170<code class="filename">daemon/opd_mangling.c</code>. You may have realised that
1171at this point, we do not have the binary image file names, only the
1172dcookie values. In order to determine a file name, a dcookie value is
1173looked up in the dcookie cache. This is to be found in
1174<code class="filename">daemon/opd_cookie.c</code>. Since dcookies are both
1175persistent and unique during a sampling session, we can cache the
1176values. If the value is not found in the cache, then we ask the kernel
1177to do the lookup from value to file name for us by calling
1178<code class="function">lookup_dcookie()</code>. This looks up the value in a
1179kernel-side cache (see <code class="filename">fs/dcookies.c</code>) and returns
1180the fully-qualified file name to userspace.
1181</p>
1182 </div>
1183 <div class="sect1" lang="en" xml:lang="en">
1184 <div class="titlepage">
1185 <div>
1186 <div>
1187 <h2 class="title" style="clear: both"><a id="sample-file-writing"></a>3. Writing data to a sample file</h2>
1188 </div>
1189 </div>
1190 </div>
1191 <p>
1192Each specific sample file is a hashed collection, where the key is
1193the PC offset from the transient data, and the value is the number of
1194samples recorded against that offset. The files are
1195<code class="function">mmap()</code>ed into the daemon's memory space. The code
1196to actually log the write against the sample file can be found in
1197<code class="filename">libdb/</code>.
1198</p>
1199 <p>
1200For recording stack traces, we have a more complicated sample filename
1201mangling scheme that allows us to identify cross-binary calls. We use
1202the same sample file format, where the key is a 64-bit value composed
1203from the from,to pair of offsets.
1204</p>
1205 </div>
1206 </div>
1207 <div class="chapter" lang="en" xml:lang="en">
1208 <div class="titlepage">
1209 <div>
1210 <div>
1211 <h2 class="title"><a id="output"></a>Chapter 5. Generating useful output</h2>
1212 </div>
1213 </div>
1214 </div>
1215 <div class="toc">
1216 <p>
1217 <b>Table of Contents</b>
1218 </p>
1219 <dl>
1220 <dt>
1221 <span class="sect1">
1222 <a href="#profile-specification">1. Handling the profile specification</a>
1223 </span>
1224 </dt>
1225 <dt>
1226 <span class="sect1">
1227 <a href="#sample-file-collating">2. Collating the candidate sample files</a>
1228 </span>
1229 </dt>
1230 <dd>
1231 <dl>
1232 <dt>
1233 <span class="sect2">
1234 <a href="#sample-file-classifying">2.1. Classifying sample files</a>
1235 </span>
1236 </dt>
1237 <dt>
1238 <span class="sect2">
1239 <a href="#sample-file-inverting">2.2. Creating inverted profile lists</a>
1240 </span>
1241 </dt>
1242 </dl>
1243 </dd>
1244 <dt>
1245 <span class="sect1">
1246 <a href="#generating-profile-data">3. Generating profile data</a>
1247 </span>
1248 </dt>
1249 <dd>
1250 <dl>
1251 <dt>
1252 <span class="sect2">
1253 <a href="#bfd">3.1. Processing the binary image</a>
1254 </span>
1255 </dt>
1256 <dt>
1257 <span class="sect2">
1258 <a href="#processing-sample-files">3.2. Processing the sample files</a>
1259 </span>
1260 </dt>
1261 </dl>
1262 </dd>
1263 <dt>
1264 <span class="sect1">
1265 <a href="#generating-output">4. Generating output</a>
1266 </span>
1267 </dt>
1268 </dl>
1269 </div>
1270 <p>
1271All of the tools used to generate human-readable output have to take
1272roughly the same steps to collect the data for processing. First, the
1273profile specification given by the user has to be parsed. Next, a list
1274of sample files matching the specification has to obtained. Using this
1275list, we need to locate the binary file for each sample file, and then
1276use them to extract meaningful data, before a final collation and
1277presentation to the user.
1278</p>
1279 <div class="sect1" lang="en" xml:lang="en">
1280 <div class="titlepage">
1281 <div>
1282 <div>
1283 <h2 class="title" style="clear: both"><a id="profile-specification"></a>1. Handling the profile specification</h2>
1284 </div>
1285 </div>
1286 </div>
1287 <p>
1288The profile specification presented by the user is parsed in
1289the function <code class="function">profile_spec::create()</code>. This
1290creates an object representing the specification. Then we
1291use <code class="function">profile_spec::generate_file_list()</code>
1292to search for all sample files and match them against the
1293<code class="varname">profile_spec</code>.
1294</p>
1295 <p>
1296To enable this matching process to work, the attributes of
1297each sample file is encoded in its filename. This is a low-tech
1298approach to matching specifications against candidate sample
1299files, but it works reasonably well. A typical sample file
1300might look like these:
1301</p>
1302 <table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
1303 <tr>
1304 <td>
1305 <pre class="screen">
1306/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
1307/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
1308/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0
1309/var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all
1310</pre>
1311 </td>
1312 </tr>
1313 </table>
1314 <p>
1315This looks unnecessarily complex, but it's actually fairly simple. First
1316we have the session of the sample, here
1317<code class="filename">/var/lib/oprofile/samples/current</code>. This could
1318equally well be inside an archive from <span><strong class="command">oparchive</strong></span>.
1319Next we have one of the tokens <code class="filename">{root}</code> or
1320<code class="filename">{kern}</code>. <code class="filename">{root}</code> indicates
1321that the binary is found on a file system, and we will encode its path
1322in the next section (e.g. <code class="filename">/bin/ls</code>).
1323<code class="filename">{kern}</code> indicates a kernel module - on 2.6 kernels
1324the path information is not available from the kernel, so we have to
1325special-case kernel modules like this; we encode merely the name of the
1326module as loaded.
1327</p>
1328 <p>
1329Next there is a <code class="filename">{dep}</code> token, indicating another
1330token/path which identifies the dependent binary image. This is used even for
1331the "primary" binary (i.e. the one that was
1332<code class="function">execve()</code>d), as it simplifies processing. Finally,
1333if this sample file is a normal flat profile, the actual file is next in
1334the path. If it's a call-graph sample file, we need one further
1335specification, to allow us to identify cross-binary arcs in the call
1336graph.
1337</p>
1338 <p>
1339The actual sample file name is dot-separated, where the fields are, in
1340order: event name, event count, unit mask, task group ID, task ID, and
1341CPU number.
1342</p>
1343 <p>
1344This sample file can be reliably parsed (with
1345<code class="function">parse_filename()</code>) into a
1346<code class="varname">filename_spec</code>. Finally, we can check whether to
1347include the sample file in the final results by comparing this
1348<code class="varname">filename_spec</code> against the
1349<code class="varname">profile_spec</code> the user specified (for the interested,
1350see <code class="function">valid_candidate()</code> and
1351<code class="function">profile_spec::match</code>). Then comes the really
1352complicated bit...
1353</p>
1354 </div>
1355 <div class="sect1" lang="en" xml:lang="en">
1356 <div class="titlepage">
1357 <div>
1358 <div>
1359 <h2 class="title" style="clear: both"><a id="sample-file-collating"></a>2. Collating the candidate sample files</h2>
1360 </div>
1361 </div>
1362 </div>
1363 <p>
1364At this point we have a duplicate-free list of sample files we need
1365to process. But first we need to do some further arrangement: we
1366need to classify each sample file, and we may also need to "invert"
1367the profiles.
1368</p>
1369 <div class="sect2" lang="en" xml:lang="en">
1370 <div class="titlepage">
1371 <div>
1372 <div>
1373 <h3 class="title"><a id="sample-file-classifying"></a>2.1. Classifying sample files</h3>
1374 </div>
1375 </div>
1376 </div>
1377 <p>
1378It's possible for utilities like <span><strong class="command">opreport</strong></span> to show
1379data in columnar format: for example, we might want to show the results
1380of two threads within a process side-by-side. To do this, we need
1381to classify each sample file into classes - the classes correspond
1382with each <span><strong class="command">opreport</strong></span> column. The function that handles
1383this is <code class="function">arrange_profiles()</code>. Each sample file
1384is added to a particular class. If the sample file is the first in
1385its class, a template is generated from the sample file. Each template
1386describes a particular class (thus, in our example above, each template
1387will have a different thread ID, and this uniquely identifies each
1388class).
1389</p>
1390 <p>
1391Each class has a list of "profile sets" matching that class's template.
1392A profile set is either a profile of the primary binary image, or any of
1393its dependent images. After all sample files have been listed in one of
1394the profile sets belonging to the classes, we have to name each class and
1395perform error-checking. This is done by
1396<code class="function">identify_classes()</code>; each class is checked to ensure
1397that its "axis" is the same as all the others. This is needed because
1398<span><strong class="command">opreport</strong></span> can't produce results in 3D format: we can
1399only differ in one aspect, such as thread ID or event name.
1400</p>
1401 </div>
1402 <div class="sect2" lang="en" xml:lang="en">
1403 <div class="titlepage">
1404 <div>
1405 <div>
1406 <h3 class="title"><a id="sample-file-inverting"></a>2.2. Creating inverted profile lists</h3>
1407 </div>
1408 </div>
1409 </div>
1410 <p>
1411Remember that if we're using certain profile separation options, such as
1412"--separate=lib", a single binary could be a dependent image to many
1413different binaries. For example, the C library image would be a
1414dependent image for most programs that have been profiled. As it
1415happens, this can cause severe performance problems: without some
1416re-arrangement, these dependent binary images would be opened each
1417time we need to process sample files for each program.
1418</p>
1419 <p>
1420The solution is to "invert" the profiles via
1421<code class="function">invert_profiles()</code>. We create a new data structure
1422where the dependent binary is first, and the primary binary images using
1423that dependent binary are listed as sub-images. This helps our
1424performance problem, as now we only need to open each dependent image
1425once, when we process the list of inverted profiles.
1426</p>
1427 </div>
1428 </div>
1429 <div class="sect1" lang="en" xml:lang="en">
1430 <div class="titlepage">
1431 <div>
1432 <div>
1433 <h2 class="title" style="clear: both"><a id="generating-profile-data"></a>3. Generating profile data</h2>
1434 </div>
1435 </div>
1436 </div>
1437 <p>
1438Things don't get any simpler at this point, unfortunately. At this point
1439we've collected and classified the sample files into the set of inverted
1440profiles, as described in the previous section. Now we need to process
1441each inverted profile and make something of the data. The entry point
1442for this is <code class="function">populate_for_image()</code>.
1443</p>
1444 <div class="sect2" lang="en" xml:lang="en">
1445 <div class="titlepage">
1446 <div>
1447 <div>
1448 <h3 class="title"><a id="bfd"></a>3.1. Processing the binary image</h3>
1449 </div>
1450 </div>
1451 </div>
1452 <p>
1453The first thing we do with an inverted profile is attempt to open the
1454binary image (remember each inverted profile set is only for one binary
1455image, but may have many sample files to process). The
1456<code class="varname">op_bfd</code> class provides an abstracted interface to
1457this; internally it uses <code class="filename">libbfd</code>. The main purpose
1458of this class is to process the symbols for the binary image; this is
1459also where symbol filtering happens. This is actually quite tricky, but
1460should be clear from the source.
1461</p>
1462 </div>
1463 <div class="sect2" lang="en" xml:lang="en">
1464 <div class="titlepage">
1465 <div>
1466 <div>
1467 <h3 class="title"><a id="processing-sample-files"></a>3.2. Processing the sample files</h3>
1468 </div>
1469 </div>
1470 </div>
1471 <p>
1472The class <code class="varname">profile_container</code> is a hold-all that
1473contains all the processed results. It is a container of
1474<code class="varname">profile_t</code> objects. The
1475<code class="function">add_sample_files()</code> method uses
1476<code class="filename">libdb</code> to open the given sample file and add the
1477key/value types to the <code class="varname">profile_t</code>. Once this has been
1478done, <code class="function">profile_container::add()</code> is passed the
1479<code class="varname">profile_t</code> plus the <code class="varname">op_bfd</code> for
1480processing.
1481</p>
1482 <p>
1483<code class="function">profile_container::add()</code> walks through the symbols
1484collected in the <code class="varname">op_bfd</code>.
1485<code class="function">op_bfd::get_symbol_range()</code> gives us the start and
1486end of the symbol as an offset from the start of the binary image,
1487then we interrogate the <code class="varname">profile_t</code> for the relevant samples
1488for that offset range. We create a <code class="varname">symbol_entry</code>
1489object for this symbol and fill it in. If needed, here we also collect
1490debug information from the <code class="varname">op_bfd</code>, and possibly
1491record the detailed sample information (as used by <span><strong class="command">opreport
1492-d</strong></span> and <span><strong class="command">opannotate</strong></span>).
1493Finally the <code class="varname">symbol_entry</code> is added to
1494a private container of <code class="varname">profile_container</code> - this
1495<code class="varname">symbol_container</code> holds all such processed symbols.
1496</p>
1497 </div>
1498 </div>
1499 <div class="sect1" lang="en" xml:lang="en">
1500 <div class="titlepage">
1501 <div>
1502 <div>
1503 <h2 class="title" style="clear: both"><a id="generating-output"></a>4. Generating output</h2>
1504 </div>
1505 </div>
1506 </div>
1507 <p>
1508After the processing described in the previous section, we've now got
1509full details of what we need to output stored in the
1510<code class="varname">profile_container</code> on a symbol-by-symbol basis. To
1511produce output, we need to replay that data and format it suitably.
1512</p>
1513 <p>
1514<span><strong class="command">opreport</strong></span> first asks the
1515<code class="varname">profile_container</code> for a
1516<code class="varname">symbol_collection</code> (this is also where thresholding
1517happens).
1518This is sorted, then a
1519<code class="varname">opreport_formatter</code> is initialised.
1520This object initialises a set of field formatters as requested. Then
1521<code class="function">opreport_formatter::output()</code> is called. This
1522iterates through the (sorted) <code class="varname">symbol_collection</code>;
1523for each entry, the selected fields (as set by the
1524<code class="varname">format_flags</code> options) are output by calling the
1525field formatters, with the <code class="varname">symbol_entry</code> passed in.
1526</p>
1527 </div>
1528 </div>
1529 <div class="glossary">
1530 <div class="titlepage">
1531 <div>
1532 <div>
1533 <h2 class="title"><a id="glossary"></a>Glossary of OProfile source concepts and types</h2>
1534 </div>
1535 </div>
1536 </div>
1537 <dl>
1538 <dt>application image</dt>
1539 <dd>
1540 <p>
1541The primary binary image used by an application. This is derived
1542from the kernel and corresponds to the binary started upon running
1543an application: for example, <code class="filename">/bin/bash</code>.
1544</p>
1545 </dd>
1546 <dt>binary image</dt>
1547 <dd>
1548 <p>
1549An ELF file containing executable code: this includes kernel modules,
1550the kernel itself (a.k.a. <code class="filename">vmlinux</code>), shared libraries,
1551and application binaries.
1552</p>
1553 </dd>
1554 <dt>dcookie</dt>
1555 <dd>
1556 <p>
1557Short for "dentry cookie". A unique ID that can be looked up to provide
1558the full path name of a binary image.
1559</p>
1560 </dd>
1561 <dt>dependent image</dt>
1562 <dd>
1563 <p>
1564A binary image that is dependent upon an application, used with
1565per-application separation. Most commonly, shared libraries. For example,
1566if <code class="filename">/bin/bash</code> is running and we take
1567some samples inside the C library itself due to <span><strong class="command">bash</strong></span>
1568calling library code, then the image <code class="filename">/lib/libc.so</code>
1569would be dependent upon <code class="filename">/bin/bash</code>.
1570</p>
1571 </dd>
1572 <dt>merging</dt>
1573 <dd>
1574 <p>
1575This refers to the ability to merge several distinct sample files
1576into one set of data at runtime, in the post-profiling tools. For example,
1577per-thread sample files can be merged into one set of data, because
1578they are compatible (i.e. the aggregation of the data is meaningful),
1579but it's not possible to merge sample files for two different events,
1580because there would be no useful meaning to the results.
1581</p>
1582 </dd>
1583 <dt>profile class</dt>
1584 <dd>
1585 <p>
1586A collection of profile data that has been collected under the same
1587class template. For example, if we're using <span><strong class="command">opreport</strong></span>
1588to show results after profiling with two performance counters enabled
1589profiling <code class="constant">DATA_MEM_REFS</code> and <code class="constant">CPU_CLK_UNHALTED</code>,
1590there would be two profile classes, one for each event. Or if we're on
1591an SMP system and doing per-cpu profiling, and we request
1592<span><strong class="command">opreport</strong></span> to show results for each CPU side-by-side,
1593there would be a profile class for each CPU.
1594</p>
1595 </dd>
1596 <dt>profile specification</dt>
1597 <dd>
1598 <p>
1599The parameters the user passes to the post-profiling tools that limit
1600what sample files are used. This specification is matched against
1601the available sample files to generate a selection of profile data.
1602</p>
1603 </dd>
1604 <dt>profile template</dt>
1605 <dd>
1606 <p>
1607The parameters that define what goes in a particular profile class.
1608This includes a symbolic name (e.g. "cpu:1") and the code-usable
1609equivalent.
1610</p>
1611 </dd>
1612 </dl>
1613 </div>
1614 </div>
1615 </body>
1616</html>