| <html> |
| <head> |
| <title>Cachegrind: a cache-miss profiler</title> |
| </head> |
| |
| <body> |
| <a name="cg-top"></a> |
| <h2>4 <b>Cachegrind</b>: a cache-miss profiler</h2> |
| |
| To use this tool, you must specify <code>--tool=cachegrind</code> |
| on the Valgrind command line. |
| |
| <p> |
| Detailed technical documentation on how Cachegrind works is available |
| <A HREF="cg_techdocs.html">here</A>. If you want to know how |
| to <b>use</b> it, you only need to read this page. |
| |
| |
| <a name="cache"></a> |
| <h3>4.1 Cache profiling</h3> |
| Cachegrind is a tool for doing cache simulations and annotating your source |
| line-by-line with the number of cache misses. In particular, it records: |
| <ul> |
| <li>L1 instruction cache reads and misses; |
| <li>L1 data cache reads and read misses, writes and write misses; |
| <li>L2 unified cache reads and read misses, writes and writes misses. |
| </ul> |
| On a modern x86 machine, an L1 miss will typically cost around 10 cycles, |
| and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be |
| very useful for improving the performance of your program.<p> |
| |
| Also, since one instruction cache read is performed per instruction executed, |
| you can find out how many instructions are executed per line, which can be |
| useful for traditional profiling and test coverage.<p> |
| |
| Any feedback, bug-fixes, suggestions, etc, welcome. |
| |
| |
| <h3>4.2 Overview</h3> |
| First off, as for normal Valgrind use, you probably want to compile with |
| debugging info (the <code>-g</code> flag). But by contrast with normal |
| Valgrind use, you probably <b>do</b> want to turn optimisation on, since you |
| should profile your program as it will be normally run. |
| |
| The two steps are: |
| <ol> |
| <li>Run your program with <code>valgrind --tool=cachegrind</code> in front of |
| the normal command line invocation. When the program finishes, |
| Cachegrind will print summary cache statistics. It also collects |
| line-by-line information in a file |
| <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code> |
| is the program's process id. |
| <p> |
| This step should be done every time you want to collect |
| information about a new program, a changed program, or about the |
| same program with different input. |
| </li><p> |
| <li>Generate a function-by-function summary, and possibly annotate |
| source files, using the supplied |
| <code>cg_annotate</code> program. Source files to annotate can be |
| specified manually, or manually on the command line, or |
| "interesting" source files can be annotated automatically with |
| the <code>--auto=yes</code> option. You can annotate C/C++ |
| files or assembly language files equally easily. |
| <p> |
| This step can be performed as many times as you like for each |
| Step 2. You may want to do multiple annotations showing |
| different information each time. |
| </li><p> |
| </ol> |
| |
| The steps are described in detail in the following sections. |
| |
| |
| <h3>4.3 Cache simulation specifics</h3> |
| |
| Cachegrind uses a simulation for a machine with a split L1 cache and a unified |
| L2 cache. This configuration is used for all (modern) x86-based machines we |
| are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are |
| ancient history now.<p> |
| |
| The more specific characteristics of the simulation are as follows. |
| |
| <ul> |
| <li>Write-allocate: when a write miss occurs, the block written to |
| is brought into the D1 cache. Most modern caches have this |
| property.<p> |
| </li> |
| <p> |
| <li>Bit-selection hash function: the line(s) in the cache to which a |
| memory block maps is chosen by the middle bits M--(M+N-1) of the |
| byte address, where: |
| <ul> |
| <li> line size = 2^M bytes </li> |
| <li>(cache size / line size) = 2^N bytes</li> |
| </ul> |
| </li> |
| <p> |
| <li>Inclusive L2 cache: the L2 cache replicates all the entries of |
| the L1 cache. This is standard on Pentium chips, but AMD |
| Athlons use an exclusive L2 cache that only holds blocks evicted |
| from L1. Ditto AMD Durons and most modern VIAs.</li> |
| </ul> |
| |
| The cache configuration simulated (cache size, associativity and line size) is |
| determined automagically using the CPUID instruction. If you have an old |
| machine that (a) doesn't support the CPUID instruction, or (b) supports it in |
| an early incarnation that doesn't give any cache information, then Cachegrind |
| will fall back to using a default configuration (that of a model 3/4 Athlon). |
| Cachegrind will tell you if this happens. You can manually specify one, two or |
| all three levels (I1/D1/L2) of the cache from the command line using the |
| <code>--I1</code>, <code>--D1</code> and <code>--L2</code> options. |
| |
| <p> |
| Other noteworthy behaviour: |
| |
| <ul> |
| <li>References that straddle two cache lines are treated as follows: |
| <ul> |
| <li>If both blocks hit --> counted as one hit</li> |
| <li>If one block hits, the other misses --> counted as one miss</li> |
| <li>If both blocks miss --> counted as one miss (not two)</li> |
| </ul> |
| </li> |
| |
| <li>Instructions that modify a memory location (eg. <code>inc</code> and |
| <code>dec</code>) are counted as doing just a read, ie. a single data |
| reference. This may seem strange, but since the write can never cause a |
| miss (the read guarantees the block is in the cache) it's not very |
| interesting. |
| <p> |
| Thus it measures not the number of times the data cache is accessed, but |
| the number of times a data cache miss could occur.<p> |
| </li> |
| </ul> |
| |
| If you are interested in simulating a cache with different properties, it is |
| not particularly hard to write your own cache simulator, or to modify the |
| existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>, |
| <code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be |
| interested to hear from anyone who does. |
| |
| |
| <a name="profile"></a> |
| <h3>4.4 Profiling programs</h3> |
| |
| To gather cache profiling information about the program <code>ls -l</code>, |
| invoke Cachegrind like this: |
| |
| <blockquote><code>valgrind --tool=cachegrind ls -l</code></blockquote> |
| |
| The program will execute (slowly). Upon completion, summary statistics |
| that look like this will be printed: |
| |
| <pre> |
| ==31751== I refs: 27,742,716 |
| ==31751== I1 misses: 276 |
| ==31751== L2 misses: 275 |
| ==31751== I1 miss rate: 0.0% |
| ==31751== L2i miss rate: 0.0% |
| ==31751== |
| ==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) |
| ==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) |
| ==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr) |
| ==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) |
| ==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%) |
| ==31751== |
| ==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr) |
| ==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%) |
| </pre> |
| |
| Cache accesses for instruction fetches are summarised first, giving the |
| number of fetches made (this is the number of instructions executed, which |
| can be useful to know in its own right), the number of I1 misses, and the |
| number of L2 instruction (<code>L2i</code>) misses. |
| <p> |
| Cache accesses for data follow. The information is similar to that of the |
| instruction fetches, except that the values are also shown split between reads |
| and writes (note each row's <code>rd</code> and <code>wr</code> values add up |
| to the row's total). |
| <p> |
| Combined instruction and data figures for the L2 cache follow that. |
| |
| |
| <h3>4.5 Output file</h3> |
| |
| As well as printing summary information, Cachegrind also writes |
| line-by-line cache profiling information to a file named |
| <code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is |
| best interpreted by the accompanying program <code>cg_annotate</code>, |
| described in the next section. |
| <p> |
| Things to note about the <code>cachegrind.out.<i>pid</i></code> file: |
| <ul> |
| <li>It is written every time Cachegrind |
| is run, and will overwrite any existing |
| <code>cachegrind.out.<i>pid</i></code> in the current directory (but |
| that won't happen very often because it takes some time for process ids |
| to be recycled).</li><p> |
| <li>It can be huge: <code>ls -l</code> generates a file of about |
| 350KB. Browsing a few files and web pages with a Konqueror |
| built with full debugging information generates a file |
| of around 15 MB.</li> |
| </ul> |
| |
| Note that older versions of Cachegrind used a log file named |
| <code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix). |
| The suffix serves two purposes. Firstly, it means you don't have to |
| rename old log files that you don't want to overwrite. Secondly, and |
| more importantly, it allows correct profiling with the |
| <code>--trace-children=yes</code> option of programs that spawn child |
| processes. |
| |
| |
| <a name="profileflags"></a> |
| <h3>4.6 Cachegrind options</h3> |
| |
| Cache-simulation specific options are: |
| |
| <ul> |
| <li><code>--I1=<size>,<associativity>,<line_size></code><br> |
| <code>--D1=<size>,<associativity>,<line_size></code><br> |
| <code>--L2=<size>,<associativity>,<line_size></code><p> |
| [default: uses CPUID for automagic cache configuration]<p> |
| |
| Manually specifies the I1/D1/L2 cache configuration, where |
| <code>size</code> and <code>line_size</code> are measured in bytes. The |
| three items must be comma-separated, but with no spaces, eg: |
| |
| <blockquote> |
| <code>valgrind --tool=cachegrind --I1=65535,2,64</code> |
| </blockquote> |
| |
| You can specify one, two or three of the I1/D1/L2 caches. Any level not |
| manually specified will be simulated using the configuration found in the |
| normal way (via the CPUID instruction, or failing that, via defaults). |
| </ul> |
| |
| |
| <a name="annotate"></a> |
| <h3>4.7 Annotating C/C++ programs</h3> |
| |
| Before using <code>cg_annotate</code>, it is worth widening your |
| window to be at least 120-characters wide if possible, as the output |
| lines can be quite long. |
| <p> |
| To get a function-by-function summary, run <code>cg_annotate |
| --<i>pid</i></code> in a directory containing a |
| <code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code> |
| is required so that <code>cg_annotate</code> knows which log file to use when |
| several are present. |
| <p> |
| The output looks like this: |
| |
| <pre> |
| -------------------------------------------------------------------------------- |
| I1 cache: 65536 B, 64 B, 2-way associative |
| D1 cache: 65536 B, 64 B, 2-way associative |
| L2 cache: 262144 B, 64 B, 8-way associative |
| Command: concord vg_to_ucode.c |
| Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| Threshold: 99% |
| Chosen for annotation: |
| Auto-annotation: on |
| |
| -------------------------------------------------------------------------------- |
| Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| -------------------------------------------------------------------------------- |
| 27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS |
| |
| -------------------------------------------------------------------------------- |
| Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function |
| -------------------------------------------------------------------------------- |
| 8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc |
| 5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word |
| 2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp |
| 2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash |
| 2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower |
| 1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert |
| 897,991 51 51 897,831 95 30 62 1 1 ???:??? |
| 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile |
| 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile |
| 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc |
| 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing |
| 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER |
| 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table |
| 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create |
| 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 |
| 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 |
| 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node |
| 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue |
| </pre> |
| |
| First up is a summary of the annotation options: |
| |
| <ul> |
| <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the |
| configuration with which these results were obtained.</li><p> |
| |
| <li>Command: the command line invocation of the program under |
| examination.</li><p> |
| |
| <li>Events recorded: event abbreviations are:<p> |
| <ul> |
| <li><code>Ir </code>: I cache reads (ie. instructions executed)</li> |
| <li><code>I1mr</code>: I1 cache read misses</li> |
| <li><code>I2mr</code>: L2 cache instruction read misses</li> |
| <li><code>Dr </code>: D cache reads (ie. memory reads)</li> |
| <li><code>D1mr</code>: D1 cache read misses</li> |
| <li><code>D2mr</code>: L2 cache data read misses</li> |
| <li><code>Dw </code>: D cache writes (ie. memory writes)</li> |
| <li><code>D1mw</code>: D1 cache write misses</li> |
| <li><code>D2mw</code>: L2 cache data write misses</li> |
| </ul><p> |
| Note that D1 total accesses is given by <code>D1mr</code> + |
| <code>D1mw</code>, and that L2 total accesses is given by |
| <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p> |
| |
| <li>Events shown: the events shown (a subset of events gathered). This can |
| be adjusted with the <code>--show</code> option.</li><p> |
| |
| <li>Event sort order: the sort order in which functions are shown. For |
| example, in this case the functions are sorted from highest |
| <code>Ir</code> counts to lowest. If two functions have identical |
| <code>Ir</code> counts, they will then be sorted by <code>I1mr</code> |
| counts, and so on. This order can be adjusted with the |
| <code>--sort</code> option.<p> |
| |
| Note that this dictates the order the functions appear. It is <b>not</b> |
| the order in which the columns appear; that is dictated by the "events |
| shown" line (and can be changed with the <code>--show</code> option). |
| </li><p> |
| |
| <li>Threshold: <code>cg_annotate</code> by default omits functions |
| that cause very low numbers of misses to avoid drowning you in |
| information. In this case, cg_annotate shows summaries the |
| functions that account for 99% of the <code>Ir</code> counts; |
| <code>Ir</code> is chosen as the threshold event since it is the |
| primary sort event. The threshold can be adjusted with the |
| <code>--threshold</code> option.</li><p> |
| |
| <li>Chosen for annotation: names of files specified manually for annotation; |
| in this case none.</li><p> |
| |
| <li>Auto-annotation: whether auto-annotation was requested via the |
| <code>--auto=yes</code> option. In this case no.</li><p> |
| </ul> |
| |
| Then follows summary statistics for the whole program. These are similar |
| to the summary provided when running <code>valgrind --tool=cachegrind</code>.<p> |
| |
| Then follows function-by-function statistics. Each function is |
| identified by a <code>file_name:function_name</code> pair. If a column |
| contains only a dot it means the function never performs |
| that event (eg. the third row shows that <code>strcmp()</code> |
| contains no instructions that write to memory). The name |
| <code>???</code> is used if the the file name and/or function name |
| could not be determined from debugging information. If most of the |
| entries have the form <code>???:???</code> the program probably wasn't |
| compiled with <code>-g</code>. If any code was invalidated (either due to |
| self-modifying code or unloading of shared objects) its counts are aggregated |
| into a single cost centre written as <code>(discarded):(discarded)</code>.<p> |
| |
| It is worth noting that functions will come from three types of source files: |
| <ol> |
| <li> From the profiled program (<code>concord.c</code> in this example).</li> |
| <li>From libraries (eg. <code>getc.c</code>)</li> |
| <li>From Valgrind's implementation of some libc functions (eg. |
| <code>vg_clientmalloc.c:malloc</code>). These are recognisable because |
| the filename begins with <code>vg_</code>, and is probably one of |
| <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or |
| <code>vg_mylibc.c</code>. |
| </li> |
| </ol> |
| |
| There are two ways to annotate source files -- by choosing them |
| manually, or with the <code>--auto=yes</code> option. To do it |
| manually, just specify the filenames as arguments to |
| <code>cg_annotate</code>. For example, the output from running |
| <code>cg_annotate concord.c</code> for our example produces the same |
| output as above followed by an annotated version of |
| <code>concord.c</code>, a section of which looks like: |
| |
| <pre> |
| -------------------------------------------------------------------------------- |
| -- User-annotated source: concord.c |
| -------------------------------------------------------------------------------- |
| Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| |
| [snip] |
| |
| . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[]) |
| 3 1 1 . . . 1 0 0 { |
| . . . . . . . . . FILE *file_ptr; |
| . . . . . . . . . Word_Info *data; |
| 1 0 0 . . . 1 1 1 int line = 1, i; |
| . . . . . . . . . |
| 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info)); |
| . . . . . . . . . |
| 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++) |
| 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL; |
| . . . . . . . . . |
| . . . . . . . . . /* Open file, check it. */ |
| 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r"); |
| 2 0 0 1 0 0 . . . if (!(file_ptr)) { |
| . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name); |
| 1 1 1 . . . . . . exit(EXIT_FAILURE); |
| . . . . . . . . . } |
| . . . . . . . . . |
| 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF) |
| 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table); |
| . . . . . . . . . |
| 4 0 0 1 0 0 2 0 0 free(data); |
| 4 0 0 1 0 0 2 0 0 fclose(file_ptr); |
| 3 0 0 2 0 0 . . . } |
| </pre> |
| |
| (Although column widths are automatically minimised, a wide terminal is clearly |
| useful.)<p> |
| |
| Each source file is clearly marked (<code>User-annotated source</code>) as |
| having been chosen manually for annotation. If the file was found in one of |
| the directories specified with the <code>-I</code>/<code>--include</code> |
| option, the directory and file are both given.<p> |
| |
| Each line is annotated with its event counts. Events not applicable for a line |
| are represented by a `.'; this is useful for distinguishing between an event |
| which cannot happen, and one which can but did not.<p> |
| |
| Sometimes only a small section of a source file is executed. To minimise |
| uninteresting output, Valgrind only shows annotated lines and lines within a |
| small distance of annotated lines. Gaps are marked with the line numbers so |
| you know which part of a file the shown code comes from, eg: |
| |
| <pre> |
| (figures and code for line 704) |
| -- line 704 ---------------------------------------- |
| -- line 878 ---------------------------------------- |
| (figures and code for line 878) |
| </pre> |
| |
| The amount of context to show around annotated lines is controlled by the |
| <code>--context</code> option.<p> |
| |
| To get automatic annotation, run <code>cg_annotate --auto=yes</code>. |
| cg_annotate will automatically annotate every source file it can find that is |
| mentioned in the function-by-function summary. Therefore, the files chosen for |
| auto-annotation are affected by the <code>--sort</code> and |
| <code>--threshold</code> options. Each source file is clearly marked |
| (<code>Auto-annotated source</code>) as being chosen automatically. Any files |
| that could not be found are mentioned at the end of the output, eg: |
| |
| <pre> |
| -------------------------------------------------------------------------------- |
| The following files chosen for auto-annotation could not be found: |
| -------------------------------------------------------------------------------- |
| getc.c |
| ctype.c |
| ../sysdeps/generic/lockfile.c |
| </pre> |
| |
| This is quite common for library files, since libraries are usually compiled |
| with debugging information, but the source files are often not present on a |
| system. If a file is chosen for annotation <b>both</b> manually and |
| automatically, it is marked as <code>User-annotated source</code>. |
| |
| Use the <code>-I/--include</code> option to tell Valgrind where to look for |
| source files if the filenames found from the debugging information aren't |
| specific enough. |
| |
| Beware that cg_annotate can take some time to digest large |
| <code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also |
| beware that auto-annotation can produce a lot of output if your program is |
| large! |
| |
| |
| <h3>4.8 Annotating assembler programs</h3> |
| |
| Valgrind can annotate assembler programs too, or annotate the |
| assembler generated for your C program. Sometimes this is useful for |
| understanding what is really happening when an interesting line of C |
| code is translated into multiple instructions.<p> |
| |
| To do this, you just need to assemble your <code>.s</code> files with |
| assembler-level debug information. gcc doesn't do this, but you can |
| use the GNU assembler with the <code>--gstabs</code> option to |
| generate object files with this information, eg: |
| |
| <blockquote><code>as --gstabs foo.s</code></blockquote> |
| |
| You can then profile and annotate source files in the same way as for C/C++ |
| programs. |
| |
| |
| <h3>4.9 <code>cg_annotate</code> options</h3> |
| <ul> |
| <li><code>--<i>pid</i></code></li><p> |
| |
| Indicates which <code>cachegrind.out.<i>pid</i></code> file to read. |
| Not actually an option -- it is required. |
| |
| <li><code>-h, --help</code></li><p> |
| <li><code>-v, --version</code><p> |
| |
| Help and version, as usual.</li> |
| |
| <li><code>--sort=A,B,C</code> [default: order in |
| <code>cachegrind.out.<i>pid</i></code>]<p> |
| Specifies the events upon which the sorting of the function-by-function |
| entries will be based. Useful if you want to concentrate on eg. I cache |
| misses (<code>--sort=I1mr,I2mr</code>), or D cache misses |
| (<code>--sort=D1mr,D2mr</code>), or L2 misses |
| (<code>--sort=D2mr,I2mr</code>).</li><p> |
| |
| <li><code>--show=A,B,C</code> [default: all, using order in |
| <code>cachegrind.out.<i>pid</i></code>]<p> |
| Specifies which events to show (and the column order). Default is to use |
| all present in the <code>cachegrind.out.<i>pid</i></code> file (and use |
| the order in the file).</li><p> |
| |
| <li><code>--threshold=X</code> [default: 99%] <p> |
| Sets the threshold for the function-by-function summary. Functions are |
| shown that account for more than X% of the primary sort event. If |
| auto-annotating, also affects which files are annotated. |
| |
| Note: thresholds can be set for more than one of the events by appending |
| any events for the <code>--sort</code> option with a colon and a number |
| (no spaces, though). E.g. if you want to see the functions that cover |
| 99% of L2 read misses and 99% of L2 write misses, use this option: |
| |
| <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote> |
| </li><p> |
| |
| <li><code>--auto=no</code> [default]<br> |
| <code>--auto=yes</code> <p> |
| When enabled, automatically annotates every file that is mentioned in the |
| function-by-function summary that can be found. Also gives a list of |
| those that couldn't be found. |
| |
| <li><code>--context=N</code> [default: 8]<p> |
| Print N lines of context before and after each annotated line. Avoids |
| printing large sections of source files that were not executed. Use a |
| large number (eg. 10,000) to show all source lines. |
| </li><p> |
| |
| <li><code>-I=<dir>, --include=<dir></code> |
| [default: empty string]<p> |
| Adds a directory to the list in which to search for files. Multiple |
| -I/--include options can be given to add multiple directories. |
| </ul> |
| |
| |
| <h3>4.10 Warnings</h3> |
| There are a couple of situations in which cg_annotate issues warnings. |
| |
| <ul> |
| <li>If a source file is more recent than the |
| <code>cachegrind.out.<i>pid</i></code> file. This is because the |
| information in <code>cachegrind.out.<i>pid</i></code> is only recorded |
| with line numbers, so if the line numbers change at all in the source |
| (eg. lines added, deleted, swapped), any annotations will be |
| incorrect.<p> |
| |
| <li>If information is recorded about line numbers past the end of a file. |
| This can be caused by the above problem, ie. shortening the source file |
| while using an old <code>cachegrind.out.<i>pid</i></code> file. If this |
| happens, the figures for the bogus lines are printed anyway (clearly |
| marked as bogus) in case they are important.</li><p> |
| </ul> |
| |
| |
| <h3>4.11 Things to watch out for</h3> |
| Some odd things that can occur during annotation: |
| |
| <ul> |
| <li>If annotating at the assembler level, you might see something like this: |
| |
| <pre> |
| 1 0 0 . . . . . . leal -12(%ebp),%eax |
| 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) |
| 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) |
| . . . . . . . . . .align 4,0x90 |
| 1 0 0 . . . . . . movl $.LnrB,%eax |
| 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp) |
| </pre> |
| |
| How can the third instruction be executed twice when the others are |
| executed only once? As it turns out, it isn't. Here's a dump of the |
| executable, using <code>objdump -d</code>: |
| |
| <pre> |
| 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax |
| 8048f28: 89 43 54 mov %eax,0x54(%ebx) |
| 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) |
| 8048f32: 89 f6 mov %esi,%esi |
| 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax |
| 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp) |
| </pre> |
| |
| Notice the extra <code>mov %esi,%esi</code> instruction. Where did this |
| come from? The GNU assembler inserted it to serve as the two bytes of |
| padding needed to align the <code>movl $.LnrB,%eax</code> instruction on |
| a four-byte boundary, but pretended it didn't exist when adding debug |
| information. Thus when Valgrind reads the debug info it thinks that the |
| <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address |
| range 0x8048f2b--0x804833 by itself, and attributes the counts for the |
| <code>mov %esi,%esi</code> to it.<p> |
| </li> |
| |
| <li>Inlined functions can cause strange results in the function-by-function |
| summary. If a function <code>inline_me()</code> is defined in |
| <code>foo.h</code> and inlined in the functions <code>f1()</code>, |
| <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will |
| not be a <code>foo.h:inline_me()</code> function entry. Instead, there |
| will be separate function entries for each inlining site, ie. |
| <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and |
| <code>foo.h:f3()</code>. To find the total counts for |
| <code>foo.h:inline_me()</code>, add up the counts from each entry.<p> |
| |
| The reason for this is that although the debug info output by gcc |
| indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it |
| doesn't indicate the name of the function in <code>foo.h</code>, so |
| Valgrind keeps using the old one.<p> |
| |
| <li>Sometimes, the same filename might be represented with a relative name |
| and with an absolute name in different parts of the debug info, eg: |
| <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this |
| case, if you use auto-annotation, the file will be annotated twice with |
| the counts split between the two.<p> |
| </li> |
| |
| <li>Files with more than 65,535 lines cause difficulties for the stabs debug |
| info reader. This is because the line number in the <code>struct |
| nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit |
| value. Valgrind can handle some files with more than 65,535 lines |
| correctly by making some guesses to identify line number overflows. But |
| some cases are beyond it, in which case you'll get a warning message |
| explaining that annotations for the file might be incorrect.<p> |
| </li> |
| |
| <li>If you compile some files with <code>-g</code> and some without, some |
| events that take place in a file without debug info could be attributed |
| to the last line of a file with debug info (whichever one gets placed |
| before the non-debug-info file in the executable).<p> |
| </li> |
| </ul> |
| |
| This list looks long, but these cases should be fairly rare.<p> |
| |
| Note: stabs is not an easy format to read. If you come across bizarre |
| annotations that look like might be caused by a bug in the stabs reader, |
| please let us know.<p> |
| |
| |
| <h3>4.12 Accuracy</h3> |
| Valgrind's cache profiling has a number of shortcomings: |
| |
| <ul> |
| <li>It doesn't account for kernel activity -- the effect of system calls on |
| the cache contents is ignored.</li><p> |
| |
| <li>It doesn't account for other process activity (although this is probably |
| desirable when considering a single program).</li><p> |
| |
| <li>It doesn't account for virtual-to-physical address mappings; hence the |
| entire simulation is not a true representation of what's happening in the |
| cache.</li><p> |
| |
| <li>It doesn't account for cache misses not visible at the instruction level, |
| eg. those arising from TLB misses, or speculative execution.</li><p> |
| |
| <li>Valgrind's custom threads implementation will schedule threads |
| differently to the standard one. This could warp the results for |
| threaded programs. |
| </li><p> |
| |
| <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code> |
| will incorrectly be counted as doing a data read if both the arguments |
| are registers, eg: |
| |
| <blockquote><code>btsl %eax, %edx</code></blockquote> |
| |
| This should only happen rarely. |
| </li><p> |
| |
| <li>FPU instructions with data sizes of 28 and 108 bytes (e.g. |
| <code>fsave</code>) are treated as though they only access 16 bytes. |
| These instructions seem to be rare so hopefully this won't affect |
| accuracy much. |
| </li><p> |
| </ul> |
| |
| Another thing worth nothing is that results are very sensitive. Changing the |
| size of the <code>valgrind.so</code> file, the size of the program being |
| profiled, or even the length of its name can perturb the results. Variations |
| will be small, but don't expect perfectly repeatable results if your program |
| changes at all.<p> |
| |
| While these factors mean you shouldn't trust the results to be super-accurate, |
| hopefully they should be close enough to be useful.<p> |
| |
| |
| <h3>4.13 Todo</h3> |
| <ul> |
| <li>Program start-up/shut-down calls a lot of functions that aren't |
| interesting and just complicate the output. Would be nice to exclude |
| these somehow.</li> |
| <p> |
| </ul> |
| </body> |
| </html> |
| |