blob: 85462560e69466be9e944f2e99ea7aeafd6f8839 [file] [log] [blame]
sewardja9a2dcf2002-11-11 00:20:07 +00001<html>
2 <head>
3 <style type="text/css">
4 body { background-color: #ffffff;
5 color: #000000;
6 font-family: Times, Helvetica, Arial;
7 font-size: 14pt}
8 h4 { margin-bottom: 0.3em}
9 code { color: #000000;
10 font-family: Courier;
11 font-size: 13pt }
12 pre { color: #000000;
13 font-family: Courier;
14 font-size: 13pt }
15 a:link { color: #0000C0;
16 text-decoration: none; }
17 a:visited { color: #0000C0;
18 text-decoration: none; }
19 a:active { color: #0000C0;
20 text-decoration: none; }
21 </style>
22 <title>Cachegrind</title>
23 </head>
24
25<body bgcolor="#ffffff">
26
27<a name="title">&nbsp;</a>
28<h1 align=center>Cachegrind, version 1.0.0</h1>
29<center>This manual was last updated on 20020726</center>
30<p>
31
32<center>
33<a href="mailto:jseward@acm.org">jseward@acm.org</a><br>
34Copyright &copy; 2000-2002 Julian Seward
35<p>
36Cachegrind is licensed under the GNU General Public License,
37version 2<br>
38An open-source tool for finding memory-management problems in
39Linux-x86 executables.
40</center>
41
42<p>
43
44<hr width="100%">
45<a name="contents"></a>
46<h2>Contents of this manual</h2>
47
48<h4>1&nbsp; <a href="#cache">How to use Cachegrind</a></h4>
49
50<h4>2&nbsp; <a href="techdocs.html">How Cachegrind works</a></h4>
51
52<hr width="100%">
53
54
55<a name="cache"></a>
56<h2>1&nbsp; Cache profiling</h2>
57Cachegrind is a tool for doing cache simulations and annotate your source
58line-by-line with the number of cache misses. In particular, it records:
59<ul>
60 <li>L1 instruction cache reads and misses;
61 <li>L1 data cache reads and read misses, writes and write misses;
62 <li>L2 unified cache reads and read misses, writes and writes misses.
63</ul>
64On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
65and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
66very useful for improving the performance of your program.<p>
67
68Also, since one instruction cache read is performed per instruction executed,
69you can find out how many instructions are executed per line, which can be
70useful for traditional profiling and test coverage.<p>
71
72Any feedback, bug-fixes, suggestions, etc, welcome.
73
74
75<h3>1.1&nbsp; Overview</h3>
76First off, as for normal Valgrind use, you probably want to compile with
77debugging info (the <code>-g</code> flag). But by contrast with normal
78Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
79should profile your program as it will be normally run.
80
81The two steps are:
82<ol>
83 <li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
84 the normal command line invocation. When the program finishes,
85 Valgrind will print summary cache statistics. It also collects
86 line-by-line information in a file
87 <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
88 is the program's process id.
89 <p>
90 This step should be done every time you want to collect
91 information about a new program, a changed program, or about the
92 same program with different input.
93 </li>
94 <p>
95 <li>Generate a function-by-function summary, and possibly annotate
96 source files with 'cg_annotate'. Source files to annotate can be
97 specified manually, or manually on the command line, or
98 "interesting" source files can be annotated automatically with
99 the <code>--auto=yes</code> option. You can annotate C/C++
100 files or assembly language files equally easily.
101 <p>
102 This step can be performed as many times as you like for each
103 Step 2. You may want to do multiple annotations showing
104 different information each time.<p>
105 </li>
106</ol>
107
108The steps are described in detail in the following sections.<p>
109
110
111<h3>1.2&nbsp; Cache simulation specifics</h3>
112
113Cachegrind uses a simulation for a machine with a split L1 cache and a unified
114L2 cache. This configuration is used for all (modern) x86-based machines we
115are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
116ancient history now.<p>
117
118The more specific characteristics of the simulation are as follows.
119
120<ul>
121 <li>Write-allocate: when a write miss occurs, the block written to
122 is brought into the D1 cache. Most modern caches have this
123 property.</li><p>
124
125 <li>Bit-selection hash function: the line(s) in the cache to which a
126 memory block maps is chosen by the middle bits M--(M+N-1) of the
127 byte address, where:
128 <ul>
129 <li>&nbsp;line size = 2^M bytes&nbsp;</li>
130 <li>(cache size / line size) = 2^N bytes</li>
131 </ul> </li><p>
132
133 <li>Inclusive L2 cache: the L2 cache replicates all the entries of
134 the L1 cache. This is standard on Pentium chips, but AMD
135 Athlons use an exclusive L2 cache that only holds blocks evicted
136 from L1. Ditto AMD Durons and most modern VIAs.</li><p>
137</ul>
138
139The cache configuration simulated (cache size, associativity and line size) is
140determined automagically using the CPUID instruction. If you have an old
141machine that (a) doesn't support the CPUID instruction, or (b) supports it in
142an early incarnation that doesn't give any cache information, then Cachegrind
143will fall back to using a default configuration (that of a model 3/4 Athlon).
144Cachegrind will tell you if this happens. You can manually specify one, two or
145all three levels (I1/D1/L2) of the cache from the command line using the
146<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>
147
148Other noteworthy behaviour:
149
150<ul>
151 <li>References that straddle two cache lines are treated as follows:
152 <ul>
153 <li>If both blocks hit --&gt; counted as one hit</li>
154 <li>If one block hits, the other misses --&gt; counted as one miss</li>
155 <li>If both blocks miss --&gt; counted as one miss (not two)</li>
156 </ul><p></li>
157
158 <li>Instructions that modify a memory location (eg. <code>inc</code> and
159 <code>dec</code>) are counted as doing just a read, ie. a single data
160 reference. This may seem strange, but since the write can never cause a
161 miss (the read guarantees the block is in the cache) it's not very
162 interesting.<p>
163
164 Thus it measures not the number of times the data cache is accessed, but
165 the number of times a data cache miss could occur.<p>
166 </li>
167</ul>
168
169If you are interested in simulating a cache with different properties, it is
170not particularly hard to write your own cache simulator, or to modify the
171existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
172<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
173interested to hear from anyone who does.
174
175<a name="profile"></a>
176<h3>1.3&nbsp; Profiling programs</h3>
177
178Cache profiling is enabled by using the <code>--skin=cachegrind</code>
179option to the <code>valgrind</code> shell script. To gather cache profiling
180information about the program <code>ls -l</code>, type:
181
182<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
183
184The program will execute (slowly). Upon completion, summary statistics
185that look like this will be printed:
186
187<pre>
188==31751== I refs: 27,742,716
189==31751== I1 misses: 276
190==31751== L2 misses: 275
191==31751== I1 miss rate: 0.0%
192==31751== L2i miss rate: 0.0%
193==31751==
194==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
195==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
196==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
197==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
198==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
199==31751==
200==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
201==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
202</pre>
203
204Cache accesses for instruction fetches are summarised first, giving the
205number of fetches made (this is the number of instructions executed, which
206can be useful to know in its own right), the number of I1 misses, and the
207number of L2 instruction (<code>L2i</code>) misses.<p>
208
209Cache accesses for data follow. The information is similar to that of the
210instruction fetches, except that the values are also shown split between reads
211and writes (note each row's <code>rd</code> and <code>wr</code> values add up
212to the row's total).<p>
213
214Combined instruction and data figures for the L2 cache follow that.<p>
215
216
217<h3>1.4&nbsp; Output file</h3>
218
219As well as printing summary information, Cachegrind also writes
220line-by-line cache profiling information to a file named
221<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
222best interpreted by the accompanying program <code>cg_annotate</code>,
223described in the next section.
224<p>
225Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
226<ul>
227 <li>It is written every time <code>valgrind --skin=cachegrind</code>
228 is run, and will overwrite any existing
229 <code>cachegrind.out.<i>pid</i></code> in the current directory (but
230 that won't happen very often because it takes some time for process ids
231 to be recycled).</li>
232 <p>
233 <li>It can be huge: <code>ls -l</code> generates a file of about
234 350KB. Browsing a few files and web pages with a Konqueror
235 built with full debugging information generates a file
236 of around 15 MB.</li>
237</ul>
238
239Note that older versions of Cachegrind used a log file named
240<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
241The suffix serves two purposes. Firstly, it means you don't have to rename old
242log files that you don't want to overwrite. Secondly, and more importantly,
243it allows correct profiling with the <code>--trace-children=yes</code> option
244of programs that spawn child processes.
245
246<a name="profileflags"></a>
247<h3>1.5&nbsp; Cachegrind options</h3>
248Cachegrind accepts all the options that Valgrind does, although some of them
249(ones related to memory checking) don't do anything when cache profiling.<p>
250
251The interesting cache-simulation specific options are:
252
253<ul>
254 <li><code>--I1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
255 <code>--D1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
256 <code>--L2=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><p>
257 [default: uses CPUID for automagic cache configuration]<p>
258
259 Manually specifies the I1/D1/L2 cache configuration, where
260 <code>size</code> and <code>line_size</code> are measured in bytes. The
261 three items must be comma-separated, but with no spaces, eg:
262
263 <blockquote>
264 <code>valgrind --skin=cachegrind --I1=65535,2,64</code>
265 </blockquote>
266
267 You can specify one, two or three of the I1/D1/L2 caches. Any level not
268 manually specified will be simulated using the configuration found in the
269 normal way (via the CPUID instruction, or failing that, via defaults).
270</ul>
271
272
273<a name="annotate"></a>
274<h3>1.6&nbsp; Annotating C/C++ programs</h3>
275
276Before using <code>cg_annotate</code>, it is worth widening your
277window to be at least 120-characters wide if possible, as the output
278lines can be quite long.
279<p>
280To get a function-by-function summary, run <code>cg_annotate
281--<i>pid</i></code> in a directory containing a
282<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
283is required so that <code>cg_annotate</code> knows which log file to use when
284several are present.
285<p>
286The output looks like this:
287
288<pre>
289--------------------------------------------------------------------------------
290I1 cache: 65536 B, 64 B, 2-way associative
291D1 cache: 65536 B, 64 B, 2-way associative
292L2 cache: 262144 B, 64 B, 8-way associative
293Command: concord vg_to_ucode.c
294Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
295Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
296Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
297Threshold: 99%
298Chosen for annotation:
299Auto-annotation: on
300
301--------------------------------------------------------------------------------
302Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
303--------------------------------------------------------------------------------
30427,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
305
306--------------------------------------------------------------------------------
307Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
308--------------------------------------------------------------------------------
3098,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
3105,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
3112,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
3122,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
3132,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
3141,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
315 897,991 51 51 897,831 95 30 62 1 1 ???:???
316 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
317 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
318 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
319 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
320 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
321 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
322 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
323 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
324 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
325 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
326 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
327</pre>
328
329First up is a summary of the annotation options:
330
331<ul>
332 <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
333 configuration with which these results were obtained.</li><p>
334
335 <li>Command: the command line invocation of the program under
336 examination.</li><p>
337
338 <li>Events recorded: event abbreviations are:<p>
339 <ul>
340 <li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
341 <li><code>I1mr</code>: I1 cache read misses</li>
342 <li><code>I2mr</code>: L2 cache instruction read misses</li>
343 <li><code>Dr </code>: D cache reads (ie. memory reads)</li>
344 <li><code>D1mr</code>: D1 cache read misses</li>
345 <li><code>D2mr</code>: L2 cache data read misses</li>
346 <li><code>Dw </code>: D cache writes (ie. memory writes)</li>
347 <li><code>D1mw</code>: D1 cache write misses</li>
348 <li><code>D2mw</code>: L2 cache data write misses</li>
349 </ul><p>
350 Note that D1 total accesses is given by <code>D1mr</code> +
351 <code>D1mw</code>, and that L2 total accesses is given by
352 <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
353
354 <li>Events shown: the events shown (a subset of events gathered). This can
355 be adjusted with the <code>--show</code> option.</li><p>
356
357 <li>Event sort order: the sort order in which functions are shown. For
358 example, in this case the functions are sorted from highest
359 <code>Ir</code> counts to lowest. If two functions have identical
360 <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
361 counts, and so on. This order can be adjusted with the
362 <code>--sort</code> option.<p>
363
364 Note that this dictates the order the functions appear. It is <b>not</b>
365 the order in which the columns appear; that is dictated by the "events
366 shown" line (and can be changed with the <code>--show</code> option).
367 </li><p>
368
369 <li>Threshold: <code>cg_annotate</code> by default omits functions
370 that cause very low numbers of misses to avoid drowning you in
371 information. In this case, cg_annotate shows summaries the
372 functions that account for 99% of the <code>Ir</code> counts;
373 <code>Ir</code> is chosen as the threshold event since it is the
374 primary sort event. The threshold can be adjusted with the
375 <code>--threshold</code> option.</li><p>
376
377 <li>Chosen for annotation: names of files specified manually for annotation;
378 in this case none.</li><p>
379
380 <li>Auto-annotation: whether auto-annotation was requested via the
381 <code>--auto=yes</code> option. In this case no.</li><p>
382</ul>
383
384Then follows summary statistics for the whole program. These are similar
385to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
386
387Then follows function-by-function statistics. Each function is
388identified by a <code>file_name:function_name</code> pair. If a column
389contains only a dot it means the function never performs
390that event (eg. the third row shows that <code>strcmp()</code>
391contains no instructions that write to memory). The name
392<code>???</code> is used if the the file name and/or function name
393could not be determined from debugging information. If most of the
394entries have the form <code>???:???</code> the program probably wasn't
395compiled with <code>-g</code>. If any code was invalidated (either due to
396self-modifying code or unloading of shared objects) its counts are aggregated
397into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
398
399It is worth noting that functions will come from three types of source files:
400<ol>
401 <li> From the profiled program (<code>concord.c</code> in this example).</li>
402 <li>From libraries (eg. <code>getc.c</code>)</li>
403 <li>From Valgrind's implementation of some libc functions (eg.
404 <code>vg_clientmalloc.c:malloc</code>). These are recognisable because
405 the filename begins with <code>vg_</code>, and is probably one of
406 <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
407 <code>vg_mylibc.c</code>.
408 </li>
409</ol>
410
411There are two ways to annotate source files -- by choosing them
412manually, or with the <code>--auto=yes</code> option. To do it
413manually, just specify the filenames as arguments to
414<code>cg_annotate</code>. For example, the output from running
415<code>cg_annotate concord.c</code> for our example produces the same
416output as above followed by an annotated version of
417<code>concord.c</code>, a section of which looks like:
418
419<pre>
420--------------------------------------------------------------------------------
421-- User-annotated source: concord.c
422--------------------------------------------------------------------------------
423Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
424
425[snip]
426
427 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
428 3 1 1 . . . 1 0 0 {
429 . . . . . . . . . FILE *file_ptr;
430 . . . . . . . . . Word_Info *data;
431 1 0 0 . . . 1 1 1 int line = 1, i;
432 . . . . . . . . .
433 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
434 . . . . . . . . .
435 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
436 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
437 . . . . . . . . .
438 . . . . . . . . . /* Open file, check it. */
439 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
440 2 0 0 1 0 0 . . . if (!(file_ptr)) {
441 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
442 1 1 1 . . . . . . exit(EXIT_FAILURE);
443 . . . . . . . . . }
444 . . . . . . . . .
445 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
446 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
447 . . . . . . . . .
448 4 0 0 1 0 0 2 0 0 free(data);
449 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
450 3 0 0 2 0 0 . . . }
451</pre>
452
453(Although column widths are automatically minimised, a wide terminal is clearly
454useful.)<p>
455
456Each source file is clearly marked (<code>User-annotated source</code>) as
457having been chosen manually for annotation. If the file was found in one of
458the directories specified with the <code>-I</code>/<code>--include</code>
459option, the directory and file are both given.<p>
460
461Each line is annotated with its event counts. Events not applicable for a line
462are represented by a `.'; this is useful for distinguishing between an event
463which cannot happen, and one which can but did not.<p>
464
465Sometimes only a small section of a source file is executed. To minimise
466uninteresting output, Valgrind only shows annotated lines and lines within a
467small distance of annotated lines. Gaps are marked with the line numbers so
468you know which part of a file the shown code comes from, eg:
469
470<pre>
471(figures and code for line 704)
472-- line 704 ----------------------------------------
473-- line 878 ----------------------------------------
474(figures and code for line 878)
475</pre>
476
477The amount of context to show around annotated lines is controlled by the
478<code>--context</code> option.<p>
479
480To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
481cg_annotate will automatically annotate every source file it can find that is
482mentioned in the function-by-function summary. Therefore, the files chosen for
483auto-annotation are affected by the <code>--sort</code> and
484<code>--threshold</code> options. Each source file is clearly marked
485(<code>Auto-annotated source</code>) as being chosen automatically. Any files
486that could not be found are mentioned at the end of the output, eg:
487
488<pre>
489--------------------------------------------------------------------------------
490The following files chosen for auto-annotation could not be found:
491--------------------------------------------------------------------------------
492 getc.c
493 ctype.c
494 ../sysdeps/generic/lockfile.c
495</pre>
496
497This is quite common for library files, since libraries are usually compiled
498with debugging information, but the source files are often not present on a
499system. If a file is chosen for annotation <b>both</b> manually and
500automatically, it is marked as <code>User-annotated source</code>.
501
502Use the <code>-I/--include</code> option to tell Valgrind where to look for
503source files if the filenames found from the debugging information aren't
504specific enough.
505
506Beware that cg_annotate can take some time to digest large
507<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
508beware that auto-annotation can produce a lot of output if your program is
509large!
510
511
512<h3>1.7&nbsp; Annotating assembler programs</h3>
513
514Valgrind can annotate assembler programs too, or annotate the
515assembler generated for your C program. Sometimes this is useful for
516understanding what is really happening when an interesting line of C
517code is translated into multiple instructions.<p>
518
519To do this, you just need to assemble your <code>.s</code> files with
520assembler-level debug information. gcc doesn't do this, but you can
521use the GNU assembler with the <code>--gstabs</code> option to
522generate object files with this information, eg:
523
524<blockquote><code>as --gstabs foo.s</code></blockquote>
525
526You can then profile and annotate source files in the same way as for C/C++
527programs.
528
529
530<h3>1.8&nbsp; <code>cg_annotate</code> options</h3>
531<ul>
532 <li><code>--<i>pid</i></code></li><p>
533
534 Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
535 Not actually an option -- it is required.
536
537 <li><code>-h, --help</code></li><p>
538 <li><code>-v, --version</code><p>
539
540 Help and version, as usual.</li>
541
542 <li><code>--sort=A,B,C</code> [default: order in
543 <code>cachegrind.out.<i>pid</i></code>]<p>
544 Specifies the events upon which the sorting of the function-by-function
545 entries will be based. Useful if you want to concentrate on eg. I cache
546 misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
547 (<code>--sort=D1mr,D2mr</code>), or L2 misses
548 (<code>--sort=D2mr,I2mr</code>).</li><p>
549
550 <li><code>--show=A,B,C</code> [default: all, using order in
551 <code>cachegrind.out.<i>pid</i></code>]<p>
552 Specifies which events to show (and the column order). Default is to use
553 all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
554 the order in the file).</li><p>
555
556 <li><code>--threshold=X</code> [default: 99%] <p>
557 Sets the threshold for the function-by-function summary. Functions are
558 shown that account for more than X% of the primary sort event. If
559 auto-annotating, also affects which files are annotated.
560
561 Note: thresholds can be set for more than one of the events by appending
562 any events for the <code>--sort</code> option with a colon and a number
563 (no spaces, though). E.g. if you want to see the functions that cover
564 99% of L2 read misses and 99% of L2 write misses, use this option:
565
566 <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
567 </li><p>
568
569 <li><code>--auto=no</code> [default]<br>
570 <code>--auto=yes</code> <p>
571 When enabled, automatically annotates every file that is mentioned in the
572 function-by-function summary that can be found. Also gives a list of
573 those that couldn't be found.
574
575 <li><code>--context=N</code> [default: 8]<p>
576 Print N lines of context before and after each annotated line. Avoids
577 printing large sections of source files that were not executed. Use a
578 large number (eg. 10,000) to show all source lines.
579 </li><p>
580
581 <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code>
582 [default: empty string]<p>
583 Adds a directory to the list in which to search for files. Multiple
584 -I/--include options can be given to add multiple directories.
585</ul>
586
587
588<h3>1.9&nbsp; Warnings</h3>
589There are a couple of situations in which cg_annotate issues warnings.
590
591<ul>
592 <li>If a source file is more recent than the
593 <code>cachegrind.out.<i>pid</i></code> file. This is because the
594 information in <code>cachegrind.out.<i>pid</i></code> is only recorded
595 with line numbers, so if the line numbers change at all in the source
596 (eg. lines added, deleted, swapped), any annotations will be
597 incorrect.<p>
598
599 <li>If information is recorded about line numbers past the end of a file.
600 This can be caused by the above problem, ie. shortening the source file
601 while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
602 happens, the figures for the bogus lines are printed anyway (clearly
603 marked as bogus) in case they are important.</li><p>
604</ul>
605
606
607<h3>1.10&nbsp; Things to watch out for</h3>
608Some odd things that can occur during annotation:
609
610<ul>
611 <li>If annotating at the assembler level, you might see something like this:
612
613 <pre>
614 1 0 0 . . . . . . leal -12(%ebp),%eax
615 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
616 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
617 . . . . . . . . . .align 4,0x90
618 1 0 0 . . . . . . movl $.LnrB,%eax
619 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
620 </pre>
621
622 How can the third instruction be executed twice when the others are
623 executed only once? As it turns out, it isn't. Here's a dump of the
624 executable, using <code>objdump -d</code>:
625
626 <pre>
627 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
628 8048f28: 89 43 54 mov %eax,0x54(%ebx)
629 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
630 8048f32: 89 f6 mov %esi,%esi
631 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
632 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
633 </pre>
634
635 Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
636 come from? The GNU assembler inserted it to serve as the two bytes of
637 padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
638 a four-byte boundary, but pretended it didn't exist when adding debug
639 information. Thus when Valgrind reads the debug info it thinks that the
640 <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
641 range 0x8048f2b--0x804833 by itself, and attributes the counts for the
642 <code>mov %esi,%esi</code> to it.<p>
643 </li>
644
645 <li>Inlined functions can cause strange results in the function-by-function
646 summary. If a function <code>inline_me()</code> is defined in
647 <code>foo.h</code> and inlined in the functions <code>f1()</code>,
648 <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
649 not be a <code>foo.h:inline_me()</code> function entry. Instead, there
650 will be separate function entries for each inlining site, ie.
651 <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
652 <code>foo.h:f3()</code>. To find the total counts for
653 <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
654
655 The reason for this is that although the debug info output by gcc
656 indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
657 doesn't indicate the name of the function in <code>foo.h</code>, so
658 Valgrind keeps using the old one.<p>
659
660 <li>Sometimes, the same filename might be represented with a relative name
661 and with an absolute name in different parts of the debug info, eg:
662 <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
663 case, if you use auto-annotation, the file will be annotated twice with
664 the counts split between the two.<p>
665 </li>
666
667 <li>Files with more than 65,535 lines cause difficulties for the stabs debug
668 info reader. This is because the line number in the <code>struct
669 nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
670 value. Valgrind can handle some files with more than 65,535 lines
671 correctly by making some guesses to identify line number overflows. But
672 some cases are beyond it, in which case you'll get a warning message
673 explaining that annotations for the file might be incorrect.<p>
674 </li>
675
676 <li>If you compile some files with <code>-g</code> and some without, some
677 events that take place in a file without debug info could be attributed
678 to the last line of a file with debug info (whichever one gets placed
679 before the non-debug-info file in the executable).<p>
680 </li>
681</ul>
682
683This list looks long, but these cases should be fairly rare.<p>
684
685Note: stabs is not an easy format to read. If you come across bizarre
686annotations that look like might be caused by a bug in the stabs reader,
687please let us know.<p>
688
689
690<h3>1.11&nbsp; Accuracy</h3>
691Valgrind's cache profiling has a number of shortcomings:
692
693<ul>
694 <li>It doesn't account for kernel activity -- the effect of system calls on
695 the cache contents is ignored.</li><p>
696
697 <li>It doesn't account for other process activity (although this is probably
698 desirable when considering a single program).</li><p>
699
700 <li>It doesn't account for virtual-to-physical address mappings; hence the
701 entire simulation is not a true representation of what's happening in the
702 cache.</li><p>
703
704 <li>It doesn't account for cache misses not visible at the instruction level,
705 eg. those arising from TLB misses, or speculative execution.</li><p>
706
707 <li>Valgrind's custom <code>malloc()</code> will allocate memory in different
708 ways to the standard <code>malloc()</code>, which could warp the results.
709 </li><p>
710
711 <li>Valgrind's custom threads implementation will schedule threads
712 differently to the standard one. This too could warp the results for
713 threaded programs.
714 </li><p>
715
716 <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
717 will incorrectly be counted as doing a data read if both the arguments
718 are registers, eg:
719
720 <blockquote><code>btsl %eax, %edx</code></blockquote>
721
722 This should only happen rarely.
723 </li><p>
724
725 <li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
726 <code>fsave</code>) are treated as though they only access 16 bytes.
727 These instructions seem to be rare so hopefully this won't affect
728 accuracy much.
729 </li><p>
730</ul>
731
732Another thing worth nothing is that results are very sensitive. Changing the
733size of the <code>valgrind.so</code> file, the size of the program being
734profiled, or even the length of its name can perturb the results. Variations
735will be small, but don't expect perfectly repeatable results if your program
736changes at all.<p>
737
738While these factors mean you shouldn't trust the results to be super-accurate,
739hopefully they should be close enough to be useful.<p>
740
741
742<h3>1.12&nbsp; Todo</h3>
743<ul>
744 <li>Program start-up/shut-down calls a lot of functions that aren't
745 interesting and just complicate the output. Would be nice to exclude
746 these somehow.</li>
747 <p>
748</ul>
749<hr width="100%">
750</body>
751</html>
752