blob: 9a9617f73543eaf7dbc93d1a503107fbe9bd74a4 [file] [log] [blame]
sewardja9a2dcf2002-11-11 00:20:07 +00001<html>
2 <head>
sewardjf555ac72002-11-18 00:07:28 +00003 <title>Cachegrind: a cache-miss profiler</title>
sewardja9a2dcf2002-11-11 00:20:07 +00004 </head>
5
sewardjf555ac72002-11-18 00:07:28 +00006<body>
7<a name="cg-top"></a>
8<h2>4&nbsp; <b>Cachegrind</b>: a cache-miss profiler</h2>
sewardja9a2dcf2002-11-11 00:20:07 +00009
sewardjf555ac72002-11-18 00:07:28 +000010To use this skin, you must specify <code>--skin=cachegrind</code>
11on the Valgrind command line.
sewardja9a2dcf2002-11-11 00:20:07 +000012
13<p>
sewardjf555ac72002-11-18 00:07:28 +000014Detailed technical documentation on how Cachegrind works is available
15<A HREF="cg_techdocs.html">here</A>. If you want to know how
16to <b>use</b> it, you only need to read this page.
sewardja9a2dcf2002-11-11 00:20:07 +000017
18
19<a name="cache"></a>
sewardjf555ac72002-11-18 00:07:28 +000020<h3>4.1&nbsp; Cache profiling</h3>
21Cachegrind is a tool for doing cache simulations and annotating your source
sewardja9a2dcf2002-11-11 00:20:07 +000022line-by-line with the number of cache misses. In particular, it records:
23<ul>
24 <li>L1 instruction cache reads and misses;
25 <li>L1 data cache reads and read misses, writes and write misses;
26 <li>L2 unified cache reads and read misses, writes and writes misses.
27</ul>
28On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
29and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
30very useful for improving the performance of your program.<p>
31
32Also, since one instruction cache read is performed per instruction executed,
33you can find out how many instructions are executed per line, which can be
34useful for traditional profiling and test coverage.<p>
35
36Any feedback, bug-fixes, suggestions, etc, welcome.
37
38
sewardjf555ac72002-11-18 00:07:28 +000039<h3>4.2&nbsp; Overview</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000040First off, as for normal Valgrind use, you probably want to compile with
41debugging info (the <code>-g</code> flag). But by contrast with normal
42Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
43should profile your program as it will be normally run.
44
45The two steps are:
46<ol>
47 <li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
48 the normal command line invocation. When the program finishes,
sewardjf555ac72002-11-18 00:07:28 +000049 Cachegrind will print summary cache statistics. It also collects
sewardja9a2dcf2002-11-11 00:20:07 +000050 line-by-line information in a file
51 <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
52 is the program's process id.
53 <p>
54 This step should be done every time you want to collect
55 information about a new program, a changed program, or about the
56 same program with different input.
57 </li>
58 <p>
59 <li>Generate a function-by-function summary, and possibly annotate
sewardjf555ac72002-11-18 00:07:28 +000060 source files, using the supplied
61 <code>cg_annotate</code> program. Source files to annotate can be
sewardja9a2dcf2002-11-11 00:20:07 +000062 specified manually, or manually on the command line, or
63 "interesting" source files can be annotated automatically with
64 the <code>--auto=yes</code> option. You can annotate C/C++
65 files or assembly language files equally easily.
66 <p>
67 This step can be performed as many times as you like for each
68 Step 2. You may want to do multiple annotations showing
69 different information each time.<p>
70 </li>
71</ol>
72
73The steps are described in detail in the following sections.<p>
74
75
sewardjf555ac72002-11-18 00:07:28 +000076<h4>4.3&nbsp; Cache simulation specifics</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000077
78Cachegrind uses a simulation for a machine with a split L1 cache and a unified
79L2 cache. This configuration is used for all (modern) x86-based machines we
80are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
81ancient history now.<p>
82
83The more specific characteristics of the simulation are as follows.
84
85<ul>
86 <li>Write-allocate: when a write miss occurs, the block written to
87 is brought into the D1 cache. Most modern caches have this
88 property.</li><p>
89
90 <li>Bit-selection hash function: the line(s) in the cache to which a
91 memory block maps is chosen by the middle bits M--(M+N-1) of the
92 byte address, where:
93 <ul>
94 <li>&nbsp;line size = 2^M bytes&nbsp;</li>
95 <li>(cache size / line size) = 2^N bytes</li>
96 </ul> </li><p>
97
98 <li>Inclusive L2 cache: the L2 cache replicates all the entries of
99 the L1 cache. This is standard on Pentium chips, but AMD
100 Athlons use an exclusive L2 cache that only holds blocks evicted
101 from L1. Ditto AMD Durons and most modern VIAs.</li><p>
102</ul>
103
104The cache configuration simulated (cache size, associativity and line size) is
105determined automagically using the CPUID instruction. If you have an old
106machine that (a) doesn't support the CPUID instruction, or (b) supports it in
107an early incarnation that doesn't give any cache information, then Cachegrind
108will fall back to using a default configuration (that of a model 3/4 Athlon).
109Cachegrind will tell you if this happens. You can manually specify one, two or
110all three levels (I1/D1/L2) of the cache from the command line using the
111<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>
112
113Other noteworthy behaviour:
114
115<ul>
116 <li>References that straddle two cache lines are treated as follows:
117 <ul>
118 <li>If both blocks hit --&gt; counted as one hit</li>
119 <li>If one block hits, the other misses --&gt; counted as one miss</li>
120 <li>If both blocks miss --&gt; counted as one miss (not two)</li>
121 </ul><p></li>
122
123 <li>Instructions that modify a memory location (eg. <code>inc</code> and
124 <code>dec</code>) are counted as doing just a read, ie. a single data
125 reference. This may seem strange, but since the write can never cause a
126 miss (the read guarantees the block is in the cache) it's not very
127 interesting.<p>
128
129 Thus it measures not the number of times the data cache is accessed, but
130 the number of times a data cache miss could occur.<p>
131 </li>
132</ul>
133
134If you are interested in simulating a cache with different properties, it is
135not particularly hard to write your own cache simulator, or to modify the
136existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
137<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
138interested to hear from anyone who does.
139
sewardjf555ac72002-11-18 00:07:28 +0000140
sewardja9a2dcf2002-11-11 00:20:07 +0000141<a name="profile"></a>
sewardjf555ac72002-11-18 00:07:28 +0000142<h3>4.4&nbsp; Profiling programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000143
144Cache profiling is enabled by using the <code>--skin=cachegrind</code>
145option to the <code>valgrind</code> shell script. To gather cache profiling
146information about the program <code>ls -l</code>, type:
147
148<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
149
150The program will execute (slowly). Upon completion, summary statistics
151that look like this will be printed:
152
153<pre>
154==31751== I refs: 27,742,716
155==31751== I1 misses: 276
156==31751== L2 misses: 275
157==31751== I1 miss rate: 0.0%
158==31751== L2i miss rate: 0.0%
159==31751==
160==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
161==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
162==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
163==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
164==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
165==31751==
166==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
167==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
168</pre>
169
170Cache accesses for instruction fetches are summarised first, giving the
171number of fetches made (this is the number of instructions executed, which
172can be useful to know in its own right), the number of I1 misses, and the
173number of L2 instruction (<code>L2i</code>) misses.<p>
174
175Cache accesses for data follow. The information is similar to that of the
176instruction fetches, except that the values are also shown split between reads
177and writes (note each row's <code>rd</code> and <code>wr</code> values add up
178to the row's total).<p>
179
180Combined instruction and data figures for the L2 cache follow that.<p>
181
182
sewardjf555ac72002-11-18 00:07:28 +0000183<h3>4.5&nbsp; Output file</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000184
185As well as printing summary information, Cachegrind also writes
186line-by-line cache profiling information to a file named
187<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
188best interpreted by the accompanying program <code>cg_annotate</code>,
189described in the next section.
190<p>
191Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
192<ul>
193 <li>It is written every time <code>valgrind --skin=cachegrind</code>
194 is run, and will overwrite any existing
195 <code>cachegrind.out.<i>pid</i></code> in the current directory (but
196 that won't happen very often because it takes some time for process ids
197 to be recycled).</li>
198 <p>
199 <li>It can be huge: <code>ls -l</code> generates a file of about
200 350KB. Browsing a few files and web pages with a Konqueror
201 built with full debugging information generates a file
202 of around 15 MB.</li>
203</ul>
204
205Note that older versions of Cachegrind used a log file named
206<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
sewardjf555ac72002-11-18 00:07:28 +0000207The suffix serves two purposes. Firstly, it means you don't have to
208rename old log files that you don't want to overwrite. Secondly, and
209more importantly, it allows correct profiling with the
210<code>--trace-children=yes</code> option of programs that spawn child
211processes.
212
sewardja9a2dcf2002-11-11 00:20:07 +0000213
214<a name="profileflags"></a>
sewardjf555ac72002-11-18 00:07:28 +0000215<h3>4.6&nbsp; Cachegrind options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000216
sewardjf555ac72002-11-18 00:07:28 +0000217Cache-simulation specific options are:
sewardja9a2dcf2002-11-11 00:20:07 +0000218
219<ul>
220 <li><code>--I1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
221 <code>--D1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
222 <code>--L2=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><p>
223 [default: uses CPUID for automagic cache configuration]<p>
224
225 Manually specifies the I1/D1/L2 cache configuration, where
226 <code>size</code> and <code>line_size</code> are measured in bytes. The
227 three items must be comma-separated, but with no spaces, eg:
228
229 <blockquote>
230 <code>valgrind --skin=cachegrind --I1=65535,2,64</code>
231 </blockquote>
232
233 You can specify one, two or three of the I1/D1/L2 caches. Any level not
234 manually specified will be simulated using the configuration found in the
235 normal way (via the CPUID instruction, or failing that, via defaults).
236</ul>
237
238
239<a name="annotate"></a>
sewardjf555ac72002-11-18 00:07:28 +0000240<h3>4.7&nbsp; Annotating C/C++ programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000241
242Before using <code>cg_annotate</code>, it is worth widening your
243window to be at least 120-characters wide if possible, as the output
244lines can be quite long.
245<p>
246To get a function-by-function summary, run <code>cg_annotate
247--<i>pid</i></code> in a directory containing a
248<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
249is required so that <code>cg_annotate</code> knows which log file to use when
250several are present.
251<p>
252The output looks like this:
253
254<pre>
255--------------------------------------------------------------------------------
256I1 cache: 65536 B, 64 B, 2-way associative
257D1 cache: 65536 B, 64 B, 2-way associative
258L2 cache: 262144 B, 64 B, 8-way associative
259Command: concord vg_to_ucode.c
260Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
261Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
262Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
263Threshold: 99%
264Chosen for annotation:
265Auto-annotation: on
266
267--------------------------------------------------------------------------------
268Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
269--------------------------------------------------------------------------------
27027,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
271
272--------------------------------------------------------------------------------
273Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
274--------------------------------------------------------------------------------
2758,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
2765,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
2772,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
2782,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
2792,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
2801,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
281 897,991 51 51 897,831 95 30 62 1 1 ???:???
282 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
283 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
284 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
285 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
286 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
287 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
288 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
289 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
290 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
291 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
292 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
293</pre>
294
295First up is a summary of the annotation options:
296
297<ul>
298 <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
299 configuration with which these results were obtained.</li><p>
300
301 <li>Command: the command line invocation of the program under
302 examination.</li><p>
303
304 <li>Events recorded: event abbreviations are:<p>
305 <ul>
306 <li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
307 <li><code>I1mr</code>: I1 cache read misses</li>
308 <li><code>I2mr</code>: L2 cache instruction read misses</li>
309 <li><code>Dr </code>: D cache reads (ie. memory reads)</li>
310 <li><code>D1mr</code>: D1 cache read misses</li>
311 <li><code>D2mr</code>: L2 cache data read misses</li>
312 <li><code>Dw </code>: D cache writes (ie. memory writes)</li>
313 <li><code>D1mw</code>: D1 cache write misses</li>
314 <li><code>D2mw</code>: L2 cache data write misses</li>
315 </ul><p>
316 Note that D1 total accesses is given by <code>D1mr</code> +
317 <code>D1mw</code>, and that L2 total accesses is given by
318 <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
319
320 <li>Events shown: the events shown (a subset of events gathered). This can
321 be adjusted with the <code>--show</code> option.</li><p>
322
323 <li>Event sort order: the sort order in which functions are shown. For
324 example, in this case the functions are sorted from highest
325 <code>Ir</code> counts to lowest. If two functions have identical
326 <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
327 counts, and so on. This order can be adjusted with the
328 <code>--sort</code> option.<p>
329
330 Note that this dictates the order the functions appear. It is <b>not</b>
331 the order in which the columns appear; that is dictated by the "events
332 shown" line (and can be changed with the <code>--show</code> option).
333 </li><p>
334
335 <li>Threshold: <code>cg_annotate</code> by default omits functions
336 that cause very low numbers of misses to avoid drowning you in
337 information. In this case, cg_annotate shows summaries the
338 functions that account for 99% of the <code>Ir</code> counts;
339 <code>Ir</code> is chosen as the threshold event since it is the
340 primary sort event. The threshold can be adjusted with the
341 <code>--threshold</code> option.</li><p>
342
343 <li>Chosen for annotation: names of files specified manually for annotation;
344 in this case none.</li><p>
345
346 <li>Auto-annotation: whether auto-annotation was requested via the
347 <code>--auto=yes</code> option. In this case no.</li><p>
348</ul>
349
350Then follows summary statistics for the whole program. These are similar
351to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
352
353Then follows function-by-function statistics. Each function is
354identified by a <code>file_name:function_name</code> pair. If a column
355contains only a dot it means the function never performs
356that event (eg. the third row shows that <code>strcmp()</code>
357contains no instructions that write to memory). The name
358<code>???</code> is used if the the file name and/or function name
359could not be determined from debugging information. If most of the
360entries have the form <code>???:???</code> the program probably wasn't
361compiled with <code>-g</code>. If any code was invalidated (either due to
362self-modifying code or unloading of shared objects) its counts are aggregated
363into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
364
365It is worth noting that functions will come from three types of source files:
366<ol>
367 <li> From the profiled program (<code>concord.c</code> in this example).</li>
368 <li>From libraries (eg. <code>getc.c</code>)</li>
369 <li>From Valgrind's implementation of some libc functions (eg.
370 <code>vg_clientmalloc.c:malloc</code>). These are recognisable because
371 the filename begins with <code>vg_</code>, and is probably one of
372 <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
373 <code>vg_mylibc.c</code>.
374 </li>
375</ol>
376
377There are two ways to annotate source files -- by choosing them
378manually, or with the <code>--auto=yes</code> option. To do it
379manually, just specify the filenames as arguments to
380<code>cg_annotate</code>. For example, the output from running
381<code>cg_annotate concord.c</code> for our example produces the same
382output as above followed by an annotated version of
383<code>concord.c</code>, a section of which looks like:
384
385<pre>
386--------------------------------------------------------------------------------
387-- User-annotated source: concord.c
388--------------------------------------------------------------------------------
389Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
390
391[snip]
392
393 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
394 3 1 1 . . . 1 0 0 {
395 . . . . . . . . . FILE *file_ptr;
396 . . . . . . . . . Word_Info *data;
397 1 0 0 . . . 1 1 1 int line = 1, i;
398 . . . . . . . . .
399 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
400 . . . . . . . . .
401 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
402 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
403 . . . . . . . . .
404 . . . . . . . . . /* Open file, check it. */
405 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
406 2 0 0 1 0 0 . . . if (!(file_ptr)) {
407 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
408 1 1 1 . . . . . . exit(EXIT_FAILURE);
409 . . . . . . . . . }
410 . . . . . . . . .
411 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
412 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
413 . . . . . . . . .
414 4 0 0 1 0 0 2 0 0 free(data);
415 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
416 3 0 0 2 0 0 . . . }
417</pre>
418
419(Although column widths are automatically minimised, a wide terminal is clearly
420useful.)<p>
421
422Each source file is clearly marked (<code>User-annotated source</code>) as
423having been chosen manually for annotation. If the file was found in one of
424the directories specified with the <code>-I</code>/<code>--include</code>
425option, the directory and file are both given.<p>
426
427Each line is annotated with its event counts. Events not applicable for a line
428are represented by a `.'; this is useful for distinguishing between an event
429which cannot happen, and one which can but did not.<p>
430
431Sometimes only a small section of a source file is executed. To minimise
432uninteresting output, Valgrind only shows annotated lines and lines within a
433small distance of annotated lines. Gaps are marked with the line numbers so
434you know which part of a file the shown code comes from, eg:
435
436<pre>
437(figures and code for line 704)
438-- line 704 ----------------------------------------
439-- line 878 ----------------------------------------
440(figures and code for line 878)
441</pre>
442
443The amount of context to show around annotated lines is controlled by the
444<code>--context</code> option.<p>
445
446To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
447cg_annotate will automatically annotate every source file it can find that is
448mentioned in the function-by-function summary. Therefore, the files chosen for
449auto-annotation are affected by the <code>--sort</code> and
450<code>--threshold</code> options. Each source file is clearly marked
451(<code>Auto-annotated source</code>) as being chosen automatically. Any files
452that could not be found are mentioned at the end of the output, eg:
453
454<pre>
455--------------------------------------------------------------------------------
456The following files chosen for auto-annotation could not be found:
457--------------------------------------------------------------------------------
458 getc.c
459 ctype.c
460 ../sysdeps/generic/lockfile.c
461</pre>
462
463This is quite common for library files, since libraries are usually compiled
464with debugging information, but the source files are often not present on a
465system. If a file is chosen for annotation <b>both</b> manually and
466automatically, it is marked as <code>User-annotated source</code>.
467
468Use the <code>-I/--include</code> option to tell Valgrind where to look for
469source files if the filenames found from the debugging information aren't
470specific enough.
471
472Beware that cg_annotate can take some time to digest large
473<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
474beware that auto-annotation can produce a lot of output if your program is
475large!
476
477
sewardjf555ac72002-11-18 00:07:28 +0000478<h3>4.8&nbsp; Annotating assembler programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000479
480Valgrind can annotate assembler programs too, or annotate the
481assembler generated for your C program. Sometimes this is useful for
482understanding what is really happening when an interesting line of C
483code is translated into multiple instructions.<p>
484
485To do this, you just need to assemble your <code>.s</code> files with
486assembler-level debug information. gcc doesn't do this, but you can
487use the GNU assembler with the <code>--gstabs</code> option to
488generate object files with this information, eg:
489
490<blockquote><code>as --gstabs foo.s</code></blockquote>
491
492You can then profile and annotate source files in the same way as for C/C++
493programs.
494
495
sewardjf555ac72002-11-18 00:07:28 +0000496<h3>4.9&nbsp; <code>cg_annotate</code> options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000497<ul>
498 <li><code>--<i>pid</i></code></li><p>
499
500 Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
501 Not actually an option -- it is required.
502
503 <li><code>-h, --help</code></li><p>
504 <li><code>-v, --version</code><p>
505
506 Help and version, as usual.</li>
507
508 <li><code>--sort=A,B,C</code> [default: order in
509 <code>cachegrind.out.<i>pid</i></code>]<p>
510 Specifies the events upon which the sorting of the function-by-function
511 entries will be based. Useful if you want to concentrate on eg. I cache
512 misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
513 (<code>--sort=D1mr,D2mr</code>), or L2 misses
514 (<code>--sort=D2mr,I2mr</code>).</li><p>
515
516 <li><code>--show=A,B,C</code> [default: all, using order in
517 <code>cachegrind.out.<i>pid</i></code>]<p>
518 Specifies which events to show (and the column order). Default is to use
519 all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
520 the order in the file).</li><p>
521
522 <li><code>--threshold=X</code> [default: 99%] <p>
523 Sets the threshold for the function-by-function summary. Functions are
524 shown that account for more than X% of the primary sort event. If
525 auto-annotating, also affects which files are annotated.
526
527 Note: thresholds can be set for more than one of the events by appending
528 any events for the <code>--sort</code> option with a colon and a number
529 (no spaces, though). E.g. if you want to see the functions that cover
530 99% of L2 read misses and 99% of L2 write misses, use this option:
531
532 <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
533 </li><p>
534
535 <li><code>--auto=no</code> [default]<br>
536 <code>--auto=yes</code> <p>
537 When enabled, automatically annotates every file that is mentioned in the
538 function-by-function summary that can be found. Also gives a list of
539 those that couldn't be found.
540
541 <li><code>--context=N</code> [default: 8]<p>
542 Print N lines of context before and after each annotated line. Avoids
543 printing large sections of source files that were not executed. Use a
544 large number (eg. 10,000) to show all source lines.
545 </li><p>
546
547 <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code>
548 [default: empty string]<p>
549 Adds a directory to the list in which to search for files. Multiple
550 -I/--include options can be given to add multiple directories.
551</ul>
552
553
sewardjf555ac72002-11-18 00:07:28 +0000554<h3>4.10&nbsp; Warnings</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000555There are a couple of situations in which cg_annotate issues warnings.
556
557<ul>
558 <li>If a source file is more recent than the
559 <code>cachegrind.out.<i>pid</i></code> file. This is because the
560 information in <code>cachegrind.out.<i>pid</i></code> is only recorded
561 with line numbers, so if the line numbers change at all in the source
562 (eg. lines added, deleted, swapped), any annotations will be
563 incorrect.<p>
564
565 <li>If information is recorded about line numbers past the end of a file.
566 This can be caused by the above problem, ie. shortening the source file
567 while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
568 happens, the figures for the bogus lines are printed anyway (clearly
569 marked as bogus) in case they are important.</li><p>
570</ul>
571
572
sewardjf555ac72002-11-18 00:07:28 +0000573<h3>4.11&nbsp; Things to watch out for</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000574Some odd things that can occur during annotation:
575
576<ul>
577 <li>If annotating at the assembler level, you might see something like this:
578
579 <pre>
580 1 0 0 . . . . . . leal -12(%ebp),%eax
581 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
582 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
583 . . . . . . . . . .align 4,0x90
584 1 0 0 . . . . . . movl $.LnrB,%eax
585 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
586 </pre>
587
588 How can the third instruction be executed twice when the others are
589 executed only once? As it turns out, it isn't. Here's a dump of the
590 executable, using <code>objdump -d</code>:
591
592 <pre>
593 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
594 8048f28: 89 43 54 mov %eax,0x54(%ebx)
595 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
596 8048f32: 89 f6 mov %esi,%esi
597 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
598 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
599 </pre>
600
601 Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
602 come from? The GNU assembler inserted it to serve as the two bytes of
603 padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
604 a four-byte boundary, but pretended it didn't exist when adding debug
605 information. Thus when Valgrind reads the debug info it thinks that the
606 <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
607 range 0x8048f2b--0x804833 by itself, and attributes the counts for the
608 <code>mov %esi,%esi</code> to it.<p>
609 </li>
610
611 <li>Inlined functions can cause strange results in the function-by-function
612 summary. If a function <code>inline_me()</code> is defined in
613 <code>foo.h</code> and inlined in the functions <code>f1()</code>,
614 <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
615 not be a <code>foo.h:inline_me()</code> function entry. Instead, there
616 will be separate function entries for each inlining site, ie.
617 <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
618 <code>foo.h:f3()</code>. To find the total counts for
619 <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
620
621 The reason for this is that although the debug info output by gcc
622 indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
623 doesn't indicate the name of the function in <code>foo.h</code>, so
624 Valgrind keeps using the old one.<p>
625
626 <li>Sometimes, the same filename might be represented with a relative name
627 and with an absolute name in different parts of the debug info, eg:
628 <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
629 case, if you use auto-annotation, the file will be annotated twice with
630 the counts split between the two.<p>
631 </li>
632
633 <li>Files with more than 65,535 lines cause difficulties for the stabs debug
634 info reader. This is because the line number in the <code>struct
635 nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
636 value. Valgrind can handle some files with more than 65,535 lines
637 correctly by making some guesses to identify line number overflows. But
638 some cases are beyond it, in which case you'll get a warning message
639 explaining that annotations for the file might be incorrect.<p>
640 </li>
641
642 <li>If you compile some files with <code>-g</code> and some without, some
643 events that take place in a file without debug info could be attributed
644 to the last line of a file with debug info (whichever one gets placed
645 before the non-debug-info file in the executable).<p>
646 </li>
647</ul>
648
649This list looks long, but these cases should be fairly rare.<p>
650
651Note: stabs is not an easy format to read. If you come across bizarre
652annotations that look like might be caused by a bug in the stabs reader,
653please let us know.<p>
654
655
sewardjf555ac72002-11-18 00:07:28 +0000656<h3>4.12&nbsp; Accuracy</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000657Valgrind's cache profiling has a number of shortcomings:
658
659<ul>
660 <li>It doesn't account for kernel activity -- the effect of system calls on
661 the cache contents is ignored.</li><p>
662
663 <li>It doesn't account for other process activity (although this is probably
664 desirable when considering a single program).</li><p>
665
666 <li>It doesn't account for virtual-to-physical address mappings; hence the
667 entire simulation is not a true representation of what's happening in the
668 cache.</li><p>
669
670 <li>It doesn't account for cache misses not visible at the instruction level,
671 eg. those arising from TLB misses, or speculative execution.</li><p>
672
673 <li>Valgrind's custom <code>malloc()</code> will allocate memory in different
674 ways to the standard <code>malloc()</code>, which could warp the results.
675 </li><p>
676
677 <li>Valgrind's custom threads implementation will schedule threads
678 differently to the standard one. This too could warp the results for
679 threaded programs.
680 </li><p>
681
682 <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
683 will incorrectly be counted as doing a data read if both the arguments
684 are registers, eg:
685
686 <blockquote><code>btsl %eax, %edx</code></blockquote>
687
688 This should only happen rarely.
689 </li><p>
690
691 <li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
692 <code>fsave</code>) are treated as though they only access 16 bytes.
693 These instructions seem to be rare so hopefully this won't affect
694 accuracy much.
695 </li><p>
696</ul>
697
698Another thing worth nothing is that results are very sensitive. Changing the
699size of the <code>valgrind.so</code> file, the size of the program being
700profiled, or even the length of its name can perturb the results. Variations
701will be small, but don't expect perfectly repeatable results if your program
702changes at all.<p>
703
704While these factors mean you shouldn't trust the results to be super-accurate,
705hopefully they should be close enough to be useful.<p>
706
707
sewardjf555ac72002-11-18 00:07:28 +0000708<h3>4.13&nbsp; Todo</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000709<ul>
710 <li>Program start-up/shut-down calls a lot of functions that aren't
711 interesting and just complicate the output. Would be nice to exclude
712 these somehow.</li>
713 <p>
714</ul>
sewardja9a2dcf2002-11-11 00:20:07 +0000715</body>
716</html>
717