Mike Dodd | 8cfa702 | 2010-11-17 11:12:26 -0800 | [diff] [blame] | 1 | <?xml version="1.0" encoding='ISO-8859-1'?> |
| 2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> |
| 3 | |
| 4 | <book id="oprofile-internals"> |
| 5 | <bookinfo> |
| 6 | <title>OProfile Internals</title> |
| 7 | |
| 8 | <authorgroup> |
| 9 | <author> |
| 10 | <firstname>John</firstname> |
| 11 | <surname>Levon</surname> |
| 12 | <affiliation> |
| 13 | <address><email>levon@movementarian.org</email></address> |
| 14 | </affiliation> |
| 15 | </author> |
| 16 | </authorgroup> |
| 17 | |
| 18 | <copyright> |
| 19 | <year>2003</year> |
| 20 | <holder>John Levon</holder> |
| 21 | </copyright> |
| 22 | </bookinfo> |
| 23 | |
| 24 | <toc></toc> |
| 25 | |
| 26 | <chapter id="introduction"> |
| 27 | <title>Introduction</title> |
| 28 | |
| 29 | <para> |
| 30 | This document is current for OProfile version <oprofileversion />. |
| 31 | This document provides some details on the internal workings of OProfile for the |
| 32 | interested hacker. This document assumes strong C, working C++, plus some knowledge of |
| 33 | kernel internals and CPU hardware. |
| 34 | </para> |
| 35 | <note> |
| 36 | <para> |
| 37 | Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 |
| 38 | uses a very different kernel module implementation and daemon to produce the sample files. |
| 39 | </para> |
| 40 | </note> |
| 41 | |
| 42 | <sect1 id="overview"> |
| 43 | <title>Overview</title> |
| 44 | <para> |
| 45 | OProfile is a statistical continuous profiler. In other words, profiles are generated by |
| 46 | regularly sampling the current registers on each CPU (from an interrupt handler, the |
| 47 | saved PC value at the time of interrupt is stored), and converting that runtime PC |
| 48 | value into something meaningful to the programmer. |
| 49 | </para> |
| 50 | <para> |
| 51 | OProfile achieves this by taking the stream of sampled PC values, along with the detail |
| 52 | of which task was running at the time of the interrupt, and converting into a file offset |
| 53 | against a particular binary file. Because applications <function>mmap()</function> |
| 54 | the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename> |
| 55 | or whatever), it's possible to find the relevant binary file and offset by walking |
| 56 | the task's list of mapped memory areas. Each PC value is thus converted into a tuple |
| 57 | of binary-image,offset. This is something that the userspace tools can use directly |
| 58 | to reconstruct where the code came from, including the particular assembly instructions, |
| 59 | symbol, and source line (via the binary's debug information if present). |
| 60 | </para> |
| 61 | <para> |
| 62 | Regularly sampling the PC value like this approximates what actually was executed and |
| 63 | how often - more often than not, this statistical approximation is good enough to |
| 64 | reflect reality. In common operation, the time between each sample interrupt is regulated |
| 65 | by a fixed number of clock cycles. This implies that the results will reflect where |
| 66 | the CPU is spending the most time; this is obviously a very useful information source |
| 67 | for performance analysis. |
| 68 | </para> |
| 69 | <para> |
| 70 | Sometimes though, an application programmer needs different kinds of information: for example, |
| 71 | "which of the source routines cause the most cache misses ?". The rise in importance of |
| 72 | such metrics in recent years has led many CPU manufacturers to provide hardware performance |
| 73 | counters capable of measuring these events on the hardware level. Typically, these counters |
| 74 | increment once per each event, and generate an interrupt on reaching some pre-defined |
| 75 | number of events. OProfile can use these interrupts to generate samples: then, the |
| 76 | profile results are a statistical approximation of which code caused how many of the |
| 77 | given event. |
| 78 | </para> |
| 79 | <para> |
| 80 | Consider a simplified system that only executes two functions A and B. A |
| 81 | takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at |
| 82 | 100 cycles a second, and we've set the performance counter to create an |
| 83 | interrupt after a set number of "events" (in this case an event is one |
| 84 | clock cycle). It should be clear that the chances of the interrupt |
| 85 | occurring in function A is 1/100, and 99/100 for function B. Thus, we |
| 86 | statistically approximate the actual relative performance features of |
| 87 | the two functions over time. This same analysis works for other types of |
| 88 | events, providing that the interrupt is tied to the number of events |
| 89 | occurring (that is, after N events, an interrupt is generated). |
| 90 | </para> |
| 91 | <para> |
| 92 | There are typically more than one of these counters, so it's possible to set up profiling |
| 93 | for several different event types. Using these counters gives us a powerful, low-overhead |
| 94 | way of gaining performance metrics. If OProfile, or the CPU, does not support performance |
| 95 | counters, then a simpler method is used: the kernel timer interrupt feeds samples |
| 96 | into OProfile itself. |
| 97 | </para> |
| 98 | <para> |
| 99 | The rest of this document concerns itself with how we get from receiving samples at |
| 100 | interrupt time to producing user-readable profile information. |
| 101 | </para> |
| 102 | </sect1> |
| 103 | |
| 104 | <sect1 id="components"> |
| 105 | <title>Components of the OProfile system</title> |
| 106 | |
| 107 | <sect2 id="arch-specific-components"> |
| 108 | <title>Architecture-specific components</title> |
| 109 | <para> |
| 110 | If OProfile supports the hardware performance counters found on |
| 111 | a particular architecture, code for managing the details of setting |
| 112 | up and managing these counters can be found in the kernel source |
| 113 | tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename> |
| 114 | directory. The architecture-specific implementation works via |
| 115 | filling in the oprofile_operations structure at init time. This |
| 116 | provides a set of operations such as <function>setup()</function>, |
| 117 | <function>start()</function>, <function>stop()</function>, etc. |
| 118 | that manage the hardware-specific details of fiddling with the |
| 119 | performance counter registers. |
| 120 | </para> |
| 121 | <para> |
| 122 | The other important facility available to the architecture code is |
| 123 | <function>oprofile_add_sample()</function>. This is where a particular sample |
| 124 | taken at interrupt time is fed into the generic OProfile driver code. |
| 125 | </para> |
| 126 | </sect2> |
| 127 | |
| 128 | <sect2 id="filesystem"> |
| 129 | <title>oprofilefs</title> |
| 130 | <para> |
| 131 | OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from |
| 132 | userspace at <filename>/dev/oprofile</filename>. This consists of small |
| 133 | files for reporting and receiving configuration from userspace, as well |
| 134 | as the actual character device that the OProfile userspace receives samples |
| 135 | from. At <function>setup()</function> time, the architecture-specific may |
| 136 | add further configuration files related to the details of the performance |
| 137 | counters. For example, on x86, one numbered directory for each hardware |
| 138 | performance counter is added, with files in each for the event type, |
| 139 | reset value, etc. |
| 140 | </para> |
| 141 | <para> |
| 142 | The filesystem also contains a <filename>stats</filename> directory with |
| 143 | a number of useful counters for various OProfile events. |
| 144 | </para> |
| 145 | </sect2> |
| 146 | |
| 147 | <sect2 id="driver"> |
| 148 | <title>Generic kernel driver</title> |
| 149 | <para> |
| 150 | This lives in <filename>drivers/oprofile/</filename>, and forms the core of |
| 151 | how OProfile works in the kernel. Its job is to take samples delivered |
| 152 | from the architecture-specific code (via <function>oprofile_add_sample()</function>), |
| 153 | and buffer this data, in a transformed form as described later, until releasing |
| 154 | the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename> |
| 155 | character device. |
| 156 | </para> |
| 157 | </sect2> |
| 158 | |
| 159 | <sect2 id="daemon"> |
| 160 | <title>The OProfile daemon</title> |
| 161 | <para> |
| 162 | The OProfile userspace daemon's job is to take the raw data provided by the |
| 163 | kernel and write it to the disk. It takes the single data stream from the |
| 164 | kernel and logs sample data against a number of sample files (found in |
| 165 | <filename>$SESSION_DIR/samples/current/</filename>, by default located at |
| 166 | <filename>/var/lib/oprofile/samples/current/</filename>. For the benefit |
| 167 | of the "separate" functionality, the names/paths of these sample files |
| 168 | are mangled to reflect where the samples were from: this can include |
| 169 | thread IDs, the binary file path, the event type used, and more. |
| 170 | </para> |
| 171 | <para> |
| 172 | After this final step from interrupt to disk file, the data is now |
| 173 | persistent (that is, changes in the running of the system do not invalidate |
| 174 | stored data). So the post-profiling tools can run on this data at any |
| 175 | time (assuming the original binary files are still available and unchanged, |
| 176 | naturally). |
| 177 | </para> |
| 178 | </sect2> |
| 179 | |
| 180 | <sect2 id="post-profiling"> |
| 181 | <title>Post-profiling tools</title> |
| 182 | So far, we've collected data, but we've yet to present it in a useful form |
| 183 | to the user. This is the job of the post-profiling tools. In general form, |
| 184 | they collate a subset of the available sample files, load and process each one |
| 185 | correlated against the relevant binary file, and finally produce user-readable |
| 186 | information. |
| 187 | </sect2> |
| 188 | |
| 189 | </sect1> |
| 190 | |
| 191 | </chapter> |
| 192 | |
| 193 | <chapter id="performance-counters"> |
| 194 | <title>Performance counter management</title> |
| 195 | |
| 196 | <sect1 id ="performance-counters-ui"> |
| 197 | <title>Providing a user interface</title> |
| 198 | |
| 199 | <para> |
| 200 | The performance counter registers need programming in order to set the |
| 201 | type of event to count, etc. OProfile uses a standard model across all |
| 202 | CPUs for defining these events as follows : |
| 203 | </para> |
| 204 | <informaltable frame="all"> |
| 205 | <tgroup cols='2'> |
| 206 | <tbody> |
| 207 | <row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row> |
| 208 | <row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row> |
| 209 | <row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row> |
| 210 | <row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row> |
| 211 | <row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row> |
| 212 | <row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row> |
| 213 | </tbody> |
| 214 | </tgroup> |
| 215 | </informaltable> |
| 216 | <para> |
| 217 | The term "unit mask" is borrowed from the Intel architectures, and can |
| 218 | further specify exactly when a counter is incremented (for example, |
| 219 | cache-related events can be restricted to particular state transitions |
| 220 | of the cache lines). |
| 221 | </para> |
| 222 | <para> |
| 223 | All of the available hardware events and their details are specified in |
| 224 | the textual files in the <filename>events</filename> directory. The |
| 225 | syntax of these files should be fairly obvious. The user specifies the |
| 226 | names and configuration details of the chosen counters via |
| 227 | <command>opcontrol</command>. These are then written to the kernel |
| 228 | module (in numerical form) via <filename>/dev/oprofile/N/</filename> |
| 229 | where N is the physical hardware counter (some events can only be used |
| 230 | on specific counters; OProfile hides these details from the user when |
| 231 | possible). On IA64, the perfmon-based interface behaves somewhat |
| 232 | differently, as described later. |
| 233 | </para> |
| 234 | |
| 235 | </sect1> |
| 236 | |
| 237 | <sect1 id="performance-counters-programming"> |
| 238 | <title>Programming the performance counter registers</title> |
| 239 | |
| 240 | <para> |
| 241 | We have described how the user interface fills in the desired |
| 242 | configuration of the counters and transmits the information to the |
| 243 | kernel. It is the job of the <function>->setup()</function> method |
| 244 | to actually program the performance counter registers. Clearly, the |
| 245 | details of how this is done is architecture-specific; it is also |
| 246 | model-specific on many architectures. For example, i386 provides methods |
| 247 | for each model type that programs the counter registers correctly |
| 248 | (see the <filename>op_model_*</filename> files in |
| 249 | <filename>arch/i386/oprofile</filename> for the details). The method |
| 250 | reads the values stored in the virtual oprofilefs files and programs |
| 251 | the registers appropriately, ready for starting the actual profiling |
| 252 | session. |
| 253 | </para> |
| 254 | <para> |
| 255 | The architecture-specific drivers make sure to save the old register |
| 256 | settings before doing OProfile setup. They are restored when OProfile |
| 257 | shuts down. This is useful, for example, on i386, where the NMI watchdog |
| 258 | uses the same performance counter registers as OProfile; they cannot |
| 259 | run concurrently, but OProfile makes sure to restore the setup it found |
| 260 | before it was running. |
| 261 | </para> |
| 262 | <para> |
| 263 | In addition to programming the counter registers themselves, other setup |
| 264 | is often necessary. For example, on i386, the local APIC needs |
| 265 | programming in order to make the counter's overflow interrupt appear as |
| 266 | an NMI (non-maskable interrupt). This allows sampling (and therefore |
| 267 | profiling) of regions where "normal" interrupts are masked, enabling |
| 268 | more reliable profiles. |
| 269 | </para> |
| 270 | |
| 271 | <sect2 id="performance-counters-start"> |
| 272 | <title>Starting and stopping the counters</title> |
| 273 | <para> |
| 274 | Initiating a profiling session is done via writing an ASCII '1' |
| 275 | to the file <filename>/dev/oprofile/enable</filename>. This sets up the |
| 276 | core, and calls into the architecture-specific driver to actually |
| 277 | enable each configured counter. Again, the details of how this is |
| 278 | done is model-specific (for example, the Athlon models can disable |
| 279 | or enable on a per-counter basis, unlike the PPro models). |
| 280 | </para> |
| 281 | </sect2> |
| 282 | |
| 283 | <sect2> |
| 284 | <title>IA64 and perfmon</title> |
| 285 | <para> |
| 286 | The IA64 architecture provides a different interface from the other |
| 287 | architectures, using the existing perfmon driver. Register programming |
| 288 | is handled entirely in user-space (see |
| 289 | <filename>daemon/opd_perfmon.c</filename> for the details). A process |
| 290 | is forked for each CPU, which creates a perfmon context and sets the |
| 291 | counter registers appropriately via the |
| 292 | <function>sys_perfmonctl</function> interface. In addition, the actual |
| 293 | initiation and termination of the profiling session is handled via the |
| 294 | same interface using <constant>PFM_START</constant> and |
| 295 | <constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs |
| 296 | files for the performance counters, as the kernel driver does not |
| 297 | program the registers itself. |
| 298 | </para> |
| 299 | <para> |
| 300 | Instead, the perfmon driver for OProfile simply registers with the |
| 301 | OProfile core with an OProfile-specific UUID. During a profiling |
| 302 | session, the perfmon core calls into the OProfile perfmon driver and |
| 303 | samples are registered with the OProfile core itself as usual (with |
| 304 | <function>oprofile_add_sample()</function>). |
| 305 | </para> |
| 306 | </sect2> |
| 307 | |
| 308 | </sect1> |
| 309 | |
| 310 | </chapter> |
| 311 | |
| 312 | <chapter id="collecting-samples"> |
| 313 | <title>Collecting and processing samples</title> |
| 314 | |
| 315 | <sect1 id="receiving-interrupts"> |
| 316 | <title>Receiving interrupts</title> |
| 317 | <para> |
| 318 | Naturally, how the overflow interrupts are received is specific |
| 319 | to the hardware architecture, unless we are in "timer" mode, where the |
| 320 | logging routine is called directly from the standard kernel timer |
| 321 | interrupt handler. |
| 322 | </para> |
| 323 | <para> |
| 324 | On the i386 architecture, the local APIC is programmed such that when a |
| 325 | counter overflows (that is, it receives an event that causes an integer |
| 326 | overflow of the register value to zero), an NMI is generated. This calls |
| 327 | into the general handler <function>do_nmi()</function>; because OProfile |
| 328 | has registered itself as capable of handling NMI interrupts, this will |
| 329 | call into the OProfile driver code in |
| 330 | <filename>arch/i386/oprofile</filename>. Here, the saved PC value (the |
| 331 | CPU saves the register set at the time of interrupt on the stack |
| 332 | available for inspection) is extracted, and the counters are examined to |
| 333 | find out which one generated the interrupt. Also determined is whether |
| 334 | the system was inside kernel or user space at the time of the interrupt. |
| 335 | These three pieces of information are then forwarded onto the OProfile |
| 336 | core via <function>oprofile_add_sample()</function>. Finally, the |
| 337 | counter values are reset to the chosen count value, to ensure another |
| 338 | interrupt happens after another N events have occurred. Other |
| 339 | architectures behave in a similar manner. |
| 340 | </para> |
| 341 | </sect1> |
| 342 | |
| 343 | <sect1 id="core-structure"> |
| 344 | <title>Core data structures</title> |
| 345 | <para> |
| 346 | Before considering what happens when we log a sample, we shall digress |
| 347 | for a moment and look at the general structure of the data collection |
| 348 | system. |
| 349 | </para> |
| 350 | <para> |
| 351 | OProfile maintains a small buffer for storing the logged samples for |
| 352 | each CPU on the system. Only this buffer is altered when we actually log |
| 353 | a sample (remember, we may still be in an NMI context, so no locking is |
| 354 | possible). The buffer is managed by a two-handed system; the "head" |
| 355 | iterator dictates where the next sample data should be placed in the |
| 356 | buffer. Of course, overflow of the buffer is possible, in which case |
| 357 | the sample is discarded. |
| 358 | </para> |
| 359 | <para> |
| 360 | It is critical to remember that at this point, the PC value is an |
| 361 | absolute value, and is therefore only meaningful in the context of which |
| 362 | task it was logged against. Thus, these per-CPU buffers also maintain |
| 363 | details of which task each logged sample is for, as described in the |
| 364 | next section. In addition, we store whether the sample was in kernel |
| 365 | space or user space (on some architectures and configurations, the address |
| 366 | space is not sub-divided neatly at a specific PC value, so we must store |
| 367 | this information). |
| 368 | </para> |
| 369 | <para> |
| 370 | As well as these small per-CPU buffers, we have a considerably larger |
| 371 | single buffer. This holds the data that is eventually copied out into |
| 372 | the OProfile daemon. On certain system events, the per-CPU buffers are |
| 373 | processed and entered (in mutated form) into the main buffer, known in |
| 374 | the source as the "event buffer". The "tail" iterator indicates the |
| 375 | point from which the CPU may be read, up to the position of the "head" |
| 376 | iterator. This provides an entirely lock-free method for extracting data |
| 377 | from the CPU buffers. This process is described in detail later in this chapter. |
| 378 | </para> |
| 379 | <figure><title>The OProfile buffers</title> |
| 380 | <graphic fileref="buffers.png" /> |
| 381 | </figure> |
| 382 | </sect1> |
| 383 | |
| 384 | <sect1 id="logging-sample"> |
| 385 | <title>Logging a sample</title> |
| 386 | <para> |
| 387 | As mentioned, the sample is logged into the buffer specific to the |
| 388 | current CPU. The CPU buffer is a simple array of pairs of unsigned long |
| 389 | values; for a sample, they hold the PC value and the counter for the |
| 390 | sample. (The counter value is later used to translate back into the relevant |
| 391 | event type the counter was programmed to). |
| 392 | </para> |
| 393 | <para> |
| 394 | In addition to logging the sample itself, we also log task switches. |
| 395 | This is simply done by storing the address of the last task to log a |
| 396 | sample on that CPU in a data structure, and writing a task switch entry |
| 397 | into the buffer if the new value of <function>current()</function> has |
| 398 | changed. Note that later we will directly de-reference this pointer; |
| 399 | this imposes certain restrictions on when and how the CPU buffers need |
| 400 | to be processed. |
| 401 | </para> |
| 402 | <para> |
| 403 | Finally, as mentioned, we log whether we have changed between kernel and |
| 404 | userspace using a similar method. Both of these variables |
| 405 | (<varname>last_task</varname> and <varname>last_is_kernel</varname>) are |
| 406 | reset when the CPU buffer is read. |
| 407 | </para> |
| 408 | </sect1> |
| 409 | |
| 410 | <sect1 id="logging-stack"> |
| 411 | <title>Logging stack traces</title> |
| 412 | <para> |
| 413 | OProfile can also provide statistical samples of call chains (on x86). To |
| 414 | do this, at sample time, the frame pointer chain is traversed, recording |
| 415 | the return address for each stack frame. This will only work if the code |
| 416 | was compiled with frame pointers, but we're careful to abort the |
| 417 | traversal if the frame pointer appears bad. We store the set of return |
| 418 | addresses straight into the CPU buffer. Note that, since this traversal |
| 419 | is keyed off the standard sample interrupt, the number of times a |
| 420 | function appears in a stack trace is not an indicator of how many times |
| 421 | the call site was executed: rather, it's related to the number of |
| 422 | samples we took where that call site was involved. Thus, the results for |
| 423 | stack traces are not necessarily proportional to the call counts: |
| 424 | typical programs will have many <function>main()</function> samples. |
| 425 | </para> |
| 426 | </sect1> |
| 427 | |
| 428 | <sect1 id="synchronising-buffers"> |
| 429 | <title>Synchronising the CPU buffers to the event buffer</title> |
| 430 | <!-- FIXME: update when percpu patch goes in --> |
| 431 | <para> |
| 432 | At some point, we have to process the data in each CPU buffer and enter |
| 433 | it into the main (event) buffer. The file |
| 434 | <filename>buffer_sync.c</filename> contains the relevant code. We |
| 435 | periodically (currently every <constant>HZ</constant>/4 jiffies) start |
| 436 | the synchronisation process. In addition, we process the buffers on |
| 437 | certain events, such as an application calling |
| 438 | <function>munmap()</function>. This is particularly important for |
| 439 | <function>exit()</function> - because the CPU buffers contain pointers |
| 440 | to the task structure, if we don't process all the buffers before the |
| 441 | task is actually destroyed and the task structure freed, then we could |
| 442 | end up trying to dereference a bogus pointer in one of the CPU buffers. |
| 443 | </para> |
| 444 | <para> |
| 445 | We also add a notification when a kernel module is loaded; this is so |
| 446 | that user-space can re-read <filename>/proc/modules</filename> to |
| 447 | determine the load addresses of kernel module text sections. Without |
| 448 | this notification, samples for a newly-loaded module could get lost or |
| 449 | be attributed to the wrong module. |
| 450 | </para> |
| 451 | <para> |
| 452 | The synchronisation itself works in the following manner: first, mutual |
| 453 | exclusion on the event buffer is taken. Remember, we do not need to do |
| 454 | that for each CPU buffer, as we only read from the tail iterator (whilst |
| 455 | interrupts might be arriving at the same buffer, but they will write to |
| 456 | the position of the head iterator, leaving previously written entries |
| 457 | intact). Then, we process each CPU buffer in turn. A CPU switch |
| 458 | notification is added to the buffer first (for |
| 459 | <option>--separate=cpu</option> support). Then the processing of the |
| 460 | actual data starts. |
| 461 | </para> |
| 462 | <para> |
| 463 | As mentioned, the CPU buffer consists of task switch entries and the |
| 464 | actual samples. When the routine <function>sync_buffer()</function> sees |
| 465 | a task switch, the process ID and process group ID are recorded into the |
| 466 | event buffer, along with a dcookie (see below) identifying the |
| 467 | application binary (e.g. <filename>/bin/bash</filename>). The |
| 468 | <varname>mmap_sem</varname> for the task is then taken, to allow safe |
| 469 | iteration across the tasks' list of mapped areas. Each sample is then |
| 470 | processed as described in the next section. |
| 471 | </para> |
| 472 | <para> |
| 473 | After a buffer has been read, the tail iterator is updated to reflect |
| 474 | how much of the buffer was processed. Note that when we determined how |
| 475 | much data there was to read in the CPU buffer, we also called |
| 476 | <function>cpu_buffer_reset()</function> to reset |
| 477 | <varname>last_task</varname> and <varname>last_is_kernel</varname>, as |
| 478 | we've already mentioned. During the processing, more samples may have |
| 479 | been arriving in the CPU buffer; this is OK because we are careful to |
| 480 | only update the tail iterator to how much we actually read - on the next |
| 481 | buffer synchronisation, we will start again from that point. |
| 482 | </para> |
| 483 | </sect1> |
| 484 | |
| 485 | <sect1 id="dentry-cookies"> |
| 486 | <title>Identifying binary images</title> |
| 487 | <para> |
| 488 | In order to produce useful profiles, we need to be able to associate a |
| 489 | particular PC value sample with an actual ELF binary on the disk. This |
| 490 | leaves us with the problem of how to export this information to |
| 491 | user-space. We create unique IDs that identify a particular directory |
| 492 | entry (dentry), and write those IDs into the event buffer. Later on, |
| 493 | the user-space daemon can call the <function>lookup_dcookie</function> |
| 494 | system call, which looks up the ID and fills in the full path of |
| 495 | the binary image in the buffer user-space passes in. These IDs are |
| 496 | maintained by the code in <filename>fs/dcookies.c</filename>; the |
| 497 | cache lasts for as long as the daemon has the event buffer open. |
| 498 | </para> |
| 499 | </sect1> |
| 500 | |
| 501 | <sect1 id="finding-dentry"> |
| 502 | <title>Finding a sample's binary image and offset</title> |
| 503 | <para> |
| 504 | We haven't yet described how we process the absolute PC value into |
| 505 | something usable by the user-space daemon. When we find a sample entered |
| 506 | into the CPU buffer, we traverse the list of mappings for the task |
| 507 | (remember, we will have seen a task switch earlier, so we know which |
| 508 | task's lists to look at). When a mapping is found that contains the PC |
| 509 | value, we look up the mapped file's dentry in the dcookie cache. This |
| 510 | gives the dcookie ID that will uniquely identify the mapped file. Then |
| 511 | we alter the absolute value such that it is an offset from the start of |
| 512 | the file being mapped (the mapping need not start at the start of the |
| 513 | actual file, so we have to consider the offset value of the mapping). We |
| 514 | store this dcookie ID into the event buffer; this identifies which |
| 515 | binary the samples following it are against. |
| 516 | In this manner, we have converted a PC value, which has transitory |
| 517 | meaning only, into a static offset value for later processing by the |
| 518 | daemon. |
| 519 | </para> |
| 520 | <para> |
| 521 | We also attempt to avoid the relatively expensive lookup of the dentry |
| 522 | cookie value by storing the cookie value directly into the dentry |
| 523 | itself; then we can simply derive the cookie value immediately when we |
| 524 | find the correct mapping. |
| 525 | </para> |
| 526 | </sect1> |
| 527 | |
| 528 | </chapter> |
| 529 | |
| 530 | <chapter id="sample-files"> |
| 531 | <title>Generating sample files</title> |
| 532 | |
| 533 | <sect1 id="processing-buffer"> |
| 534 | <title>Processing the buffer</title> |
| 535 | |
| 536 | <para> |
| 537 | Now we can move onto user-space in our description of how raw interrupt |
| 538 | samples are processed into useful information. As we described in |
| 539 | previous sections, the kernel OProfile driver creates a large buffer of |
| 540 | sample data consisting of offset values, interspersed with |
| 541 | notification of changes in context. These context changes indicate how |
| 542 | following samples should be attributed, and include task switches, CPU |
| 543 | changes, and which dcookie the sample value is against. By processing |
| 544 | this buffer entry-by-entry, we can determine where the samples should |
| 545 | be accredited to. This is particularly important when using the |
| 546 | <option>--separate</option>. |
| 547 | </para> |
| 548 | <para> |
| 549 | The file <filename>daemon/opd_trans.c</filename> contains the basic routine |
| 550 | for the buffer processing. The <varname>struct transient</varname> |
| 551 | structure is used to hold changes in context. Its members are modified |
| 552 | as we process each entry; it is passed into the routines in |
| 553 | <filename>daemon/opd_sfile.c</filename> for actually logging the sample |
| 554 | to a particular sample file (which will be held in |
| 555 | <filename>$SESSION_DIR/samples/current</filename>). |
| 556 | </para> |
| 557 | <para> |
| 558 | The buffer format is designed for conciseness, as high sampling rates |
| 559 | can easily generate a lot of data. Thus, context changes are prefixed |
| 560 | by an escape code, identified by <function>is_escape_code()</function>. |
| 561 | If an escape code is found, the next entry in the buffer identifies |
| 562 | what type of context change is being read. These are handed off to |
| 563 | various handlers (see the <varname>handlers</varname> array), which |
| 564 | modify the transient structure as appropriate. If it's not an escape |
| 565 | code, then it must be a PC offset value, and the very next entry will |
| 566 | be the numeric hardware counter. These values are read and recorded |
| 567 | in the transient structure; we then do a lookup to find the correct |
| 568 | sample file, and log the sample, as described in the next section. |
| 569 | </para> |
| 570 | |
| 571 | <sect2 id="handling-kernel-samples"> |
| 572 | <title>Handling kernel samples</title> |
| 573 | |
| 574 | <para> |
| 575 | Samples from kernel code require a little special handling. Because |
| 576 | the binary text which the sample is against does not correspond to |
| 577 | any file that the kernel directly knows about, the OProfile driver |
| 578 | stores the absolute PC value in the buffer, instead of the file offset. |
| 579 | Of course, we need an offset against some particular binary. To handle |
| 580 | this, we keep a list of loaded modules by parsing |
| 581 | <filename>/proc/modules</filename> as needed. When a module is loaded, |
| 582 | a notification is placed in the OProfile buffer, and this triggers a |
| 583 | re-read. We store the module name, and the loading address and size. |
| 584 | This is also done for the main kernel image, as specified by the user. |
| 585 | The absolute PC value is matched against each address range, and |
| 586 | modified into an offset when the matching module is found. See |
| 587 | <filename>daemon/opd_kernel.c</filename> for the details. |
| 588 | </para> |
| 589 | |
| 590 | </sect2> |
| 591 | |
| 592 | |
| 593 | </sect1> |
| 594 | |
| 595 | <sect1 id="sample-file-generation"> |
| 596 | <title>Locating and creating sample files</title> |
| 597 | |
| 598 | <para> |
| 599 | We have a sample value and its satellite data stored in a |
| 600 | <varname>struct transient</varname>, and we must locate an |
| 601 | actual sample file to store the sample in, using the context |
| 602 | information in the transient structure as a key. The transient data to |
| 603 | sample file lookup is handled in |
| 604 | <filename>daemon/opd_sfile.c</filename>. A hash is taken of the |
| 605 | transient values that are relevant (depending upon the setting of |
| 606 | <option>--separate</option>, some values might be irrelevant), and the |
| 607 | hash value is used to lookup the list of currently open sample files. |
| 608 | Of course, the sample file might not be found, in which case we need |
| 609 | to create and open it. |
| 610 | </para> |
| 611 | <para> |
| 612 | OProfile uses a rather complex scheme for naming sample files, in order |
| 613 | to make selecting relevant sample files easier for the post-profiling |
| 614 | utilities. The exact details of the scheme are given in |
| 615 | <filename>oprofile-tests/pp_interface</filename>, but for now it will |
| 616 | suffice to remember that the filename will include only relevant |
| 617 | information for the current settings, taken from the transient data. A |
| 618 | fully-specified filename looks something like : |
| 619 | </para> |
| 620 | <computeroutput> |
| 621 | /var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0 |
| 622 | </computeroutput> |
| 623 | <para> |
| 624 | It should be clear that this identifies such information as the |
| 625 | application binary, the dependent (library) binary, the hardware event, |
| 626 | and the process and thread ID. Typically, not all this information is |
| 627 | needed, in which cases some values may be replaced with the token |
| 628 | <filename>all</filename>. |
| 629 | </para> |
| 630 | <para> |
| 631 | The code that generates this filename and opens the file is found in |
| 632 | <filename>daemon/opd_mangling.c</filename>. You may have realised that |
| 633 | at this point, we do not have the binary image file names, only the |
| 634 | dcookie values. In order to determine a file name, a dcookie value is |
| 635 | looked up in the dcookie cache. This is to be found in |
| 636 | <filename>daemon/opd_cookie.c</filename>. Since dcookies are both |
| 637 | persistent and unique during a sampling session, we can cache the |
| 638 | values. If the value is not found in the cache, then we ask the kernel |
| 639 | to do the lookup from value to file name for us by calling |
| 640 | <function>lookup_dcookie()</function>. This looks up the value in a |
| 641 | kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns |
| 642 | the fully-qualified file name to userspace. |
| 643 | </para> |
| 644 | |
| 645 | </sect1> |
| 646 | |
| 647 | <sect1 id="sample-file-writing"> |
| 648 | <title>Writing data to a sample file</title> |
| 649 | |
| 650 | <para> |
| 651 | Each specific sample file is a hashed collection, where the key is |
| 652 | the PC offset from the transient data, and the value is the number of |
| 653 | samples recorded against that offset. The files are |
| 654 | <function>mmap()</function>ed into the daemon's memory space. The code |
| 655 | to actually log the write against the sample file can be found in |
| 656 | <filename>libdb/</filename>. |
| 657 | </para> |
| 658 | <para> |
| 659 | For recording stack traces, we have a more complicated sample filename |
| 660 | mangling scheme that allows us to identify cross-binary calls. We use |
| 661 | the same sample file format, where the key is a 64-bit value composed |
| 662 | from the from,to pair of offsets. |
| 663 | </para> |
| 664 | |
| 665 | </sect1> |
| 666 | |
| 667 | </chapter> |
| 668 | |
| 669 | <chapter id="output"> |
| 670 | <title>Generating useful output</title> |
| 671 | |
| 672 | <para> |
| 673 | All of the tools used to generate human-readable output have to take |
| 674 | roughly the same steps to collect the data for processing. First, the |
| 675 | profile specification given by the user has to be parsed. Next, a list |
| 676 | of sample files matching the specification has to obtained. Using this |
| 677 | list, we need to locate the binary file for each sample file, and then |
| 678 | use them to extract meaningful data, before a final collation and |
| 679 | presentation to the user. |
| 680 | </para> |
| 681 | |
| 682 | <sect1 id="profile-specification"> |
| 683 | <title>Handling the profile specification</title> |
| 684 | |
| 685 | <para> |
| 686 | The profile specification presented by the user is parsed in |
| 687 | the function <function>profile_spec::create()</function>. This |
| 688 | creates an object representing the specification. Then we |
| 689 | use <function>profile_spec::generate_file_list()</function> |
| 690 | to search for all sample files and match them against the |
| 691 | <varname>profile_spec</varname>. |
| 692 | </para> |
| 693 | |
| 694 | <para> |
| 695 | To enable this matching process to work, the attributes of |
| 696 | each sample file is encoded in its filename. This is a low-tech |
| 697 | approach to matching specifications against candidate sample |
| 698 | files, but it works reasonably well. A typical sample file |
| 699 | might look like these: |
| 700 | </para> |
| 701 | <screen> |
| 702 | /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all |
| 703 | /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all |
| 704 | /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0 |
| 705 | /var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all |
| 706 | </screen> |
| 707 | <para> |
| 708 | This looks unnecessarily complex, but it's actually fairly simple. First |
| 709 | we have the session of the sample, by default located here |
| 710 | <filename>/var/lib/oprofile/samples/current</filename>. This location |
| 711 | can be changed by specifying the --session-dir option at command-line. |
| 712 | This session could equally well be inside an archive from <command>oparchive</command>. |
| 713 | Next we have one of the tokens <filename>{root}</filename> or |
| 714 | <filename>{kern}</filename>. <filename>{root}</filename> indicates |
| 715 | that the binary is found on a file system, and we will encode its path |
| 716 | in the next section (e.g. <filename>/bin/ls</filename>). |
| 717 | <filename>{kern}</filename> indicates a kernel module - on 2.6 kernels |
| 718 | the path information is not available from the kernel, so we have to |
| 719 | special-case kernel modules like this; we encode merely the name of the |
| 720 | module as loaded. |
| 721 | </para> |
| 722 | <para> |
| 723 | Next there is a <filename>{dep}</filename> token, indicating another |
| 724 | token/path which identifies the dependent binary image. This is used even for |
| 725 | the "primary" binary (i.e. the one that was |
| 726 | <function>execve()</function>d), as it simplifies processing. Finally, |
| 727 | if this sample file is a normal flat profile, the actual file is next in |
| 728 | the path. If it's a call-graph sample file, we need one further |
| 729 | specification, to allow us to identify cross-binary arcs in the call |
| 730 | graph. |
| 731 | </para> |
| 732 | <para> |
| 733 | The actual sample file name is dot-separated, where the fields are, in |
| 734 | order: event name, event count, unit mask, task group ID, task ID, and |
| 735 | CPU number. |
| 736 | </para> |
| 737 | <para> |
| 738 | This sample file can be reliably parsed (with |
| 739 | <function>parse_filename()</function>) into a |
| 740 | <varname>filename_spec</varname>. Finally, we can check whether to |
| 741 | include the sample file in the final results by comparing this |
| 742 | <varname>filename_spec</varname> against the |
| 743 | <varname>profile_spec</varname> the user specified (for the interested, |
| 744 | see <function>valid_candidate()</function> and |
| 745 | <function>profile_spec::match</function>). Then comes the really |
| 746 | complicated bit... |
| 747 | </para> |
| 748 | |
| 749 | </sect1> |
| 750 | |
| 751 | <sect1 id="sample-file-collating"> |
| 752 | <title>Collating the candidate sample files</title> |
| 753 | |
| 754 | <para> |
| 755 | At this point we have a duplicate-free list of sample files we need |
| 756 | to process. But first we need to do some further arrangement: we |
| 757 | need to classify each sample file, and we may also need to "invert" |
| 758 | the profiles. |
| 759 | </para> |
| 760 | |
| 761 | <sect2 id="sample-file-classifying"> |
| 762 | <title>Classifying sample files</title> |
| 763 | |
| 764 | <para> |
| 765 | It's possible for utilities like <command>opreport</command> to show |
| 766 | data in columnar format: for example, we might want to show the results |
| 767 | of two threads within a process side-by-side. To do this, we need |
| 768 | to classify each sample file into classes - the classes correspond |
| 769 | with each <command>opreport</command> column. The function that handles |
| 770 | this is <function>arrange_profiles()</function>. Each sample file |
| 771 | is added to a particular class. If the sample file is the first in |
| 772 | its class, a template is generated from the sample file. Each template |
| 773 | describes a particular class (thus, in our example above, each template |
| 774 | will have a different thread ID, and this uniquely identifies each |
| 775 | class). |
| 776 | </para> |
| 777 | |
| 778 | <para> |
| 779 | Each class has a list of "profile sets" matching that class's template. |
| 780 | A profile set is either a profile of the primary binary image, or any of |
| 781 | its dependent images. After all sample files have been listed in one of |
| 782 | the profile sets belonging to the classes, we have to name each class and |
| 783 | perform error-checking. This is done by |
| 784 | <function>identify_classes()</function>; each class is checked to ensure |
| 785 | that its "axis" is the same as all the others. This is needed because |
| 786 | <command>opreport</command> can't produce results in 3D format: we can |
| 787 | only differ in one aspect, such as thread ID or event name. |
| 788 | </para> |
| 789 | |
| 790 | </sect2> |
| 791 | |
| 792 | <sect2 id="sample-file-inverting"> |
| 793 | <title>Creating inverted profile lists</title> |
| 794 | |
| 795 | <para> |
| 796 | Remember that if we're using certain profile separation options, such as |
| 797 | "--separate=lib", a single binary could be a dependent image to many |
| 798 | different binaries. For example, the C library image would be a |
| 799 | dependent image for most programs that have been profiled. As it |
| 800 | happens, this can cause severe performance problems: without some |
| 801 | re-arrangement, these dependent binary images would be opened each |
| 802 | time we need to process sample files for each program. |
| 803 | </para> |
| 804 | |
| 805 | <para> |
| 806 | The solution is to "invert" the profiles via |
| 807 | <function>invert_profiles()</function>. We create a new data structure |
| 808 | where the dependent binary is first, and the primary binary images using |
| 809 | that dependent binary are listed as sub-images. This helps our |
| 810 | performance problem, as now we only need to open each dependent image |
| 811 | once, when we process the list of inverted profiles. |
| 812 | </para> |
| 813 | |
| 814 | </sect2> |
| 815 | |
| 816 | </sect1> |
| 817 | |
| 818 | <sect1 id="generating-profile-data"> |
| 819 | <title>Generating profile data</title> |
| 820 | |
| 821 | <para> |
| 822 | Things don't get any simpler at this point, unfortunately. At this point |
| 823 | we've collected and classified the sample files into the set of inverted |
| 824 | profiles, as described in the previous section. Now we need to process |
| 825 | each inverted profile and make something of the data. The entry point |
| 826 | for this is <function>populate_for_image()</function>. |
| 827 | </para> |
| 828 | |
| 829 | <sect2 id="bfd"> |
| 830 | <title>Processing the binary image</title> |
| 831 | <para> |
| 832 | The first thing we do with an inverted profile is attempt to open the |
| 833 | binary image (remember each inverted profile set is only for one binary |
| 834 | image, but may have many sample files to process). The |
| 835 | <varname>op_bfd</varname> class provides an abstracted interface to |
| 836 | this; internally it uses <filename>libbfd</filename>. The main purpose |
| 837 | of this class is to process the symbols for the binary image; this is |
| 838 | also where symbol filtering happens. This is actually quite tricky, but |
| 839 | should be clear from the source. |
| 840 | </para> |
| 841 | </sect2> |
| 842 | |
| 843 | <sect2 id="processing-sample-files"> |
| 844 | <title>Processing the sample files</title> |
| 845 | <para> |
| 846 | The class <varname>profile_container</varname> is a hold-all that |
| 847 | contains all the processed results. It is a container of |
| 848 | <varname>profile_t</varname> objects. The |
| 849 | <function>add_sample_files()</function> method uses |
| 850 | <filename>libdb</filename> to open the given sample file and add the |
| 851 | key/value types to the <varname>profile_t</varname>. Once this has been |
| 852 | done, <function>profile_container::add()</function> is passed the |
| 853 | <varname>profile_t</varname> plus the <varname>op_bfd</varname> for |
| 854 | processing. |
| 855 | </para> |
| 856 | <para> |
| 857 | <function>profile_container::add()</function> walks through the symbols |
| 858 | collected in the <varname>op_bfd</varname>. |
| 859 | <function>op_bfd::get_symbol_range()</function> gives us the start and |
| 860 | end of the symbol as an offset from the start of the binary image, |
| 861 | then we interrogate the <varname>profile_t</varname> for the relevant samples |
| 862 | for that offset range. We create a <varname>symbol_entry</varname> |
| 863 | object for this symbol and fill it in. If needed, here we also collect |
| 864 | debug information from the <varname>op_bfd</varname>, and possibly |
| 865 | record the detailed sample information (as used by <command>opreport |
| 866 | -d</command> and <command>opannotate</command>). |
| 867 | Finally the <varname>symbol_entry</varname> is added to |
| 868 | a private container of <varname>profile_container</varname> - this |
| 869 | <varname>symbol_container</varname> holds all such processed symbols. |
| 870 | </para> |
| 871 | </sect2> |
| 872 | |
| 873 | </sect1> |
| 874 | |
| 875 | <sect1 id="generating-output"> |
| 876 | <title>Generating output</title> |
| 877 | |
| 878 | <para> |
| 879 | After the processing described in the previous section, we've now got |
| 880 | full details of what we need to output stored in the |
| 881 | <varname>profile_container</varname> on a symbol-by-symbol basis. To |
| 882 | produce output, we need to replay that data and format it suitably. |
| 883 | </para> |
| 884 | <para> |
| 885 | <command>opreport</command> first asks the |
| 886 | <varname>profile_container</varname> for a |
| 887 | <varname>symbol_collection</varname> (this is also where thresholding |
| 888 | happens). |
| 889 | This is sorted, then a |
| 890 | <varname>opreport_formatter</varname> is initialised. |
| 891 | This object initialises a set of field formatters as requested. Then |
| 892 | <function>opreport_formatter::output()</function> is called. This |
| 893 | iterates through the (sorted) <varname>symbol_collection</varname>; |
| 894 | for each entry, the selected fields (as set by the |
| 895 | <varname>format_flags</varname> options) are output by calling the |
| 896 | field formatters, with the <varname>symbol_entry</varname> passed in. |
| 897 | </para> |
| 898 | |
| 899 | </sect1> |
| 900 | |
| 901 | </chapter> |
| 902 | |
| 903 | <chapter id="ext"> |
| 904 | <title>Extended Feature Interface</title> |
| 905 | |
| 906 | <sect1 id="ext-intro"> |
| 907 | <title>Introduction</title> |
| 908 | |
| 909 | <para> |
| 910 | The Extended Feature Interface is a standard callback interface |
| 911 | designed to allow extension to the OProfile daemon's sample processing. |
| 912 | Each feature defines a set of callback handlers which can be enabled or |
| 913 | disabled through the OProfile daemon's command-line option. |
| 914 | This interface can be used to implement support for architecture-specific |
| 915 | features or features not commonly used by general OProfile users. |
| 916 | </para> |
| 917 | |
| 918 | </sect1> |
| 919 | |
| 920 | <sect1 id="ext-name-and-handlers"> |
| 921 | <title>Feature Name and Handlers</title> |
| 922 | |
| 923 | <para> |
| 924 | Each extended feature has an entry in the <varname>ext_feature_table</varname> |
| 925 | in <filename>opd_extended.cpp</filename>. Each entry contains a feature name, |
| 926 | and a corresponding set of handlers. Feature name is a unique string, which is |
| 927 | used to identify a feature in the table. Each feature provides a set |
| 928 | of handlers, which will be executed by the OProfile daemon from pre-determined |
| 929 | locations to perform certain tasks. At runtime, the OProfile daemon calls a feature |
| 930 | handler wrapper from one of the predetermined locations to check whether |
| 931 | an extended feature is enabled, and whether a particular handler exists. |
| 932 | Only the handlers of the enabled feature will be executed. |
| 933 | </para> |
| 934 | |
| 935 | </sect1> |
| 936 | |
| 937 | <sect1 id="ext-enable"> |
| 938 | <title>Enabling Features</title> |
| 939 | |
| 940 | <para> |
| 941 | Each feature is enabled using the OProfile daemon (oprofiled) command-line |
| 942 | option "--ext-feature=<extended-feature-name>:[args]". The |
| 943 | "extended-feature-name" is used to determine the feature to be enabled. |
| 944 | The optional "args" is passed into the feature-specific initialization handler |
| 945 | (<function>ext_init</function>). Currently, only one extended feature can be |
| 946 | enabled at a time. |
| 947 | </para> |
| 948 | |
| 949 | </sect1> |
| 950 | |
| 951 | <sect1 id="ext-types-of-handlers"> |
| 952 | <title>Type of Handlers</title> |
| 953 | |
| 954 | <para> |
| 955 | Each feature is responsible for providing its own set of handlers. |
| 956 | Types of handler are: |
| 957 | </para> |
| 958 | |
| 959 | <sect2 id="ext_init"> |
| 960 | <title>ext_init Handler</title> |
| 961 | |
| 962 | <para> |
| 963 | "ext_init" handles initialization of an extended feature. It takes |
| 964 | "args" parameter which is passed in through the "oprofiled --ext-feature=< |
| 965 | extended-feature-name>:[args]". This handler is executed in the function |
| 966 | <function>opd_options()</function> in the file <filename>daemon/oprofiled.c |
| 967 | </filename>. |
| 968 | </para> |
| 969 | |
| 970 | <note> |
| 971 | <para> |
| 972 | The ext_init handler is required for all features. |
| 973 | </para> |
| 974 | </note> |
| 975 | |
| 976 | </sect2> |
| 977 | |
| 978 | <sect2 id="ext_print_stats"> |
| 979 | <title>ext_print_stats Handler</title> |
| 980 | |
| 981 | <para> |
| 982 | "ext_print_stats" handles the extended feature statistics report. It adds |
| 983 | a new section in the OProfile daemon statistics report, which is normally |
| 984 | outputed to the file |
| 985 | <filename>/var/lib/oprofile/samples/oprofiled.log</filename>. |
| 986 | This handler is executed in the function <function>opd_print_stats()</function> |
| 987 | in the file <filename>daemon/opd_stats.c</filename>. |
| 988 | </para> |
| 989 | |
| 990 | </sect2> |
| 991 | |
| 992 | <sect2 id="ext_sfile_handlers"> |
| 993 | <title>ext_sfile Handler</title> |
| 994 | |
| 995 | <para> |
| 996 | "ext_sfile" contains a set of handlers related to operations on the extended |
| 997 | sample files (sample files for events related to extended feature). |
| 998 | These operations include <function>create_sfile()</function>, |
| 999 | <function>sfile_dup()</function>, <function>close_sfile()</function>, |
| 1000 | <function>sync_sfile()</function>, and <function>get_file()</function> |
| 1001 | as defined in <filename>daemon/opd_sfile.c</filename>. |
| 1002 | An additional field, <varname>odb_t * ext_file</varname>, is added to the |
| 1003 | <varname>struct sfile</varname> for storing extended sample files |
| 1004 | information. |
| 1005 | |
| 1006 | </para> |
| 1007 | |
| 1008 | </sect2> |
| 1009 | |
| 1010 | </sect1> |
| 1011 | |
| 1012 | <sect1 id="ext-implementation"> |
| 1013 | <title>Extended Feature Reference Implementation</title> |
| 1014 | |
| 1015 | <sect2 id="ext-ibs"> |
| 1016 | <title>Instruction-Based Sampling (IBS)</title> |
| 1017 | |
| 1018 | <para> |
| 1019 | An example of extended feature implementation can be seen by |
| 1020 | examining the AMD Instruction-Based Sampling support. |
| 1021 | </para> |
| 1022 | |
| 1023 | <sect3 id="ibs-init"> |
| 1024 | <title>IBS Initialization</title> |
| 1025 | |
| 1026 | <para> |
| 1027 | Instruction-Based Sampling (IBS) is a new performance measurement technique |
| 1028 | available on AMD Family 10h processors. Enabling IBS profiling is done simply |
| 1029 | by specifying IBS performance events through the "--event=" options. |
| 1030 | </para> |
| 1031 | |
| 1032 | <screen> |
| 1033 | opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user> |
| 1034 | opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user> |
| 1035 | |
| 1036 | Note: * Count and unitmask for all IBS fetch events must be the same, |
| 1037 | as do those for IBS op. |
| 1038 | </screen> |
| 1039 | |
| 1040 | <para> |
| 1041 | IBS performance events are listed in <function>opcontrol --list-events</function>. |
| 1042 | When users specify these events, opcontrol verifies them using ophelp, which |
| 1043 | checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname> |
| 1044 | tag in <filename>events/x86-64/family10/events</filename> file. |
| 1045 | Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and |
| 1046 | /dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows. |
| 1047 | </para> |
| 1048 | |
| 1049 | <screen> |
| 1050 | oprofiled \ |
| 1051 | --ext-feature=ibs:\ |
| 1052 | fetch:<IBS_FETCH_EVENT1>,<IBS_FETCH_EVENT2>,...,:<IBS fetch count>:<IBS Fetch um>|\ |
| 1053 | op:<IBS_OP_EVENT1>,<IBS_OP_EVENT2>,...,:<IBS op count>:<IBS op um> |
| 1054 | </screen> |
| 1055 | |
| 1056 | <para> |
| 1057 | Here, the OProfile daemon parses the <varname>--ext-feature</varname> |
| 1058 | option and checks the feature name ("ibs") before calling the |
| 1059 | the initialization function to handle the string |
| 1060 | containing IBS events, counts, and unitmasks. |
| 1061 | Then, it stores each event in the IBS virtual-counter table |
| 1062 | (<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and |
| 1063 | stores the event index in the IBS Virtual Counter Index (VCI) map |
| 1064 | (<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value |
| 1065 | as the map key. |
| 1066 | </para> |
| 1067 | </sect3> |
| 1068 | |
| 1069 | <sect3 id="ibs-data-processing"> |
| 1070 | <title>IBS Data Processing</title> |
| 1071 | |
| 1072 | <para> |
| 1073 | During a profile session, the OProfile daemon identifies IBS samples in the |
| 1074 | event buffer using the <varname>"IBS_FETCH_CODE"</varname> or |
| 1075 | <varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers |
| 1076 | <function>code_ibs_fetch_sample()</function> or |
| 1077 | <function>code_ibs_op_sample()</function> listed in the |
| 1078 | <varname>handler_t handlers[]</varname> vector in |
| 1079 | <filename>daemon/opd_trans.c </filename>. These handlers are responsible for |
| 1080 | processing IBS samples and translate them into IBS performance events. |
| 1081 | </para> |
| 1082 | |
| 1083 | <para> |
| 1084 | Unlike traditional performance events, each IBS sample can be derived into |
| 1085 | multiple IBS performance events. For each event that the user specifies, |
| 1086 | a combination of bits from Model-Specific Registers (MSR) are checked |
| 1087 | against the bitmask defining the event. If the condition is met, the event |
| 1088 | will then be recorded. The derivation logic is in the files |
| 1089 | <filename>daemon/opd_ibs_macro.h</filename> and |
| 1090 | <filename>daemon/opd_ibs_trans.[h,c]</filename>. |
| 1091 | </para> |
| 1092 | |
| 1093 | </sect3> |
| 1094 | |
| 1095 | <sect3 id="ibs-sample-file"> |
| 1096 | <title>IBS Sample File</title> |
| 1097 | |
| 1098 | <para> |
| 1099 | Traditionally, sample file information <varname>(odb_t)</varname> is stored |
| 1100 | in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>. |
| 1101 | Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on |
| 1102 | alpha-based system. Event index (the counter number on which the event |
| 1103 | is configured) is used to access the corresponding entry in the array. |
| 1104 | Unlike the traditional performance event, IBS does not use the actual |
| 1105 | counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>). |
| 1106 | Also, the number of performance events generated by IBS could be larger than |
| 1107 | <varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op |
| 1108 | events). Therefore IBS requires a special data structure and sfile |
| 1109 | handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing |
| 1110 | IBS sample files. IBS-sample-file information is stored in a memory |
| 1111 | allocated by handler <function>ibs_sfile_create()</function>, which can |
| 1112 | be accessed through <varname>struct sfile::odb_t * ext_files</varname>. |
| 1113 | </para> |
| 1114 | |
| 1115 | </sect3> |
| 1116 | |
| 1117 | </sect2> |
| 1118 | |
| 1119 | </sect1> |
| 1120 | |
| 1121 | </chapter> |
| 1122 | |
| 1123 | <glossary id="glossary"> |
| 1124 | <title>Glossary of OProfile source concepts and types</title> |
| 1125 | |
| 1126 | <glossentry><glossterm>application image</glossterm> |
| 1127 | <glossdef><para> |
| 1128 | The primary binary image used by an application. This is derived |
| 1129 | from the kernel and corresponds to the binary started upon running |
| 1130 | an application: for example, <filename>/bin/bash</filename>. |
| 1131 | </para></glossdef></glossentry> |
| 1132 | |
| 1133 | <glossentry><glossterm>binary image</glossterm> |
| 1134 | <glossdef><para> |
| 1135 | An ELF file containing executable code: this includes kernel modules, |
| 1136 | the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries, |
| 1137 | and application binaries. |
| 1138 | </para></glossdef></glossentry> |
| 1139 | |
| 1140 | <glossentry><glossterm>dcookie</glossterm> |
| 1141 | <glossdef><para> |
| 1142 | Short for "dentry cookie". A unique ID that can be looked up to provide |
| 1143 | the full path name of a binary image. |
| 1144 | </para></glossdef></glossentry> |
| 1145 | |
| 1146 | <glossentry><glossterm>dependent image</glossterm> |
| 1147 | <glossdef><para> |
| 1148 | A binary image that is dependent upon an application, used with |
| 1149 | per-application separation. Most commonly, shared libraries. For example, |
| 1150 | if <filename>/bin/bash</filename> is running and we take |
| 1151 | some samples inside the C library itself due to <command>bash</command> |
| 1152 | calling library code, then the image <filename>/lib/libc.so</filename> |
| 1153 | would be dependent upon <filename>/bin/bash</filename>. |
| 1154 | </para></glossdef></glossentry> |
| 1155 | |
| 1156 | <glossentry><glossterm>merging</glossterm> |
| 1157 | <glossdef><para> |
| 1158 | This refers to the ability to merge several distinct sample files |
| 1159 | into one set of data at runtime, in the post-profiling tools. For example, |
| 1160 | per-thread sample files can be merged into one set of data, because |
| 1161 | they are compatible (i.e. the aggregation of the data is meaningful), |
| 1162 | but it's not possible to merge sample files for two different events, |
| 1163 | because there would be no useful meaning to the results. |
| 1164 | </para></glossdef></glossentry> |
| 1165 | |
| 1166 | <glossentry><glossterm>profile class</glossterm> |
| 1167 | <glossdef><para> |
| 1168 | A collection of profile data that has been collected under the same |
| 1169 | class template. For example, if we're using <command>opreport</command> |
| 1170 | to show results after profiling with two performance counters enabled |
| 1171 | profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>, |
| 1172 | there would be two profile classes, one for each event. Or if we're on |
| 1173 | an SMP system and doing per-cpu profiling, and we request |
| 1174 | <command>opreport</command> to show results for each CPU side-by-side, |
| 1175 | there would be a profile class for each CPU. |
| 1176 | </para></glossdef></glossentry> |
| 1177 | |
| 1178 | <glossentry><glossterm>profile specification</glossterm> |
| 1179 | <glossdef><para> |
| 1180 | The parameters the user passes to the post-profiling tools that limit |
| 1181 | what sample files are used. This specification is matched against |
| 1182 | the available sample files to generate a selection of profile data. |
| 1183 | </para></glossdef></glossentry> |
| 1184 | |
| 1185 | <glossentry><glossterm>profile template</glossterm> |
| 1186 | <glossdef><para> |
| 1187 | The parameters that define what goes in a particular profile class. |
| 1188 | This includes a symbolic name (e.g. "cpu:1") and the code-usable |
| 1189 | equivalent. |
| 1190 | </para></glossdef></glossentry> |
| 1191 | |
| 1192 | </glossary> |
| 1193 | |
| 1194 | </book> |