blob: 6449e81e12c31c6951131821977c095f132cd330 [file] [log] [blame]
sewardja9a2dcf2002-11-11 00:20:07 +00001<html>
2 <head>
sewardjf555ac72002-11-18 00:07:28 +00003 <title>Cachegrind: a cache-miss profiler</title>
sewardja9a2dcf2002-11-11 00:20:07 +00004 </head>
5
sewardjf555ac72002-11-18 00:07:28 +00006<body>
7<a name="cg-top"></a>
8<h2>4&nbsp; <b>Cachegrind</b>: a cache-miss profiler</h2>
sewardja9a2dcf2002-11-11 00:20:07 +00009
sewardjf555ac72002-11-18 00:07:28 +000010To use this skin, you must specify <code>--skin=cachegrind</code>
11on the Valgrind command line.
sewardja9a2dcf2002-11-11 00:20:07 +000012
13<p>
sewardjf555ac72002-11-18 00:07:28 +000014Detailed technical documentation on how Cachegrind works is available
15<A HREF="cg_techdocs.html">here</A>. If you want to know how
16to <b>use</b> it, you only need to read this page.
sewardja9a2dcf2002-11-11 00:20:07 +000017
18
19<a name="cache"></a>
sewardjf555ac72002-11-18 00:07:28 +000020<h3>4.1&nbsp; Cache profiling</h3>
21Cachegrind is a tool for doing cache simulations and annotating your source
sewardja9a2dcf2002-11-11 00:20:07 +000022line-by-line with the number of cache misses. In particular, it records:
23<ul>
24 <li>L1 instruction cache reads and misses;
25 <li>L1 data cache reads and read misses, writes and write misses;
26 <li>L2 unified cache reads and read misses, writes and writes misses.
27</ul>
28On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
29and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
30very useful for improving the performance of your program.<p>
31
32Also, since one instruction cache read is performed per instruction executed,
33you can find out how many instructions are executed per line, which can be
34useful for traditional profiling and test coverage.<p>
35
36Any feedback, bug-fixes, suggestions, etc, welcome.
37
38
sewardjf555ac72002-11-18 00:07:28 +000039<h3>4.2&nbsp; Overview</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000040First off, as for normal Valgrind use, you probably want to compile with
41debugging info (the <code>-g</code> flag). But by contrast with normal
42Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
43should profile your program as it will be normally run.
44
45The two steps are:
46<ol>
47 <li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
48 the normal command line invocation. When the program finishes,
sewardjf555ac72002-11-18 00:07:28 +000049 Cachegrind will print summary cache statistics. It also collects
sewardja9a2dcf2002-11-11 00:20:07 +000050 line-by-line information in a file
51 <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
52 is the program's process id.
53 <p>
54 This step should be done every time you want to collect
55 information about a new program, a changed program, or about the
56 same program with different input.
njn3e87f7e2003-04-08 11:08:45 +000057 </li><p>
sewardja9a2dcf2002-11-11 00:20:07 +000058 <li>Generate a function-by-function summary, and possibly annotate
sewardjf555ac72002-11-18 00:07:28 +000059 source files, using the supplied
60 <code>cg_annotate</code> program. Source files to annotate can be
sewardja9a2dcf2002-11-11 00:20:07 +000061 specified manually, or manually on the command line, or
62 "interesting" source files can be annotated automatically with
63 the <code>--auto=yes</code> option. You can annotate C/C++
64 files or assembly language files equally easily.
65 <p>
66 This step can be performed as many times as you like for each
67 Step 2. You may want to do multiple annotations showing
njn3e87f7e2003-04-08 11:08:45 +000068 different information each time.
69 </li><p>
sewardja9a2dcf2002-11-11 00:20:07 +000070</ol>
71
njn3e87f7e2003-04-08 11:08:45 +000072The steps are described in detail in the following sections.
sewardja9a2dcf2002-11-11 00:20:07 +000073
74
njn3e87f7e2003-04-08 11:08:45 +000075<h3>4.3&nbsp; Cache simulation specifics</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000076
77Cachegrind uses a simulation for a machine with a split L1 cache and a unified
78L2 cache. This configuration is used for all (modern) x86-based machines we
79are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
80ancient history now.<p>
81
82The more specific characteristics of the simulation are as follows.
83
84<ul>
85 <li>Write-allocate: when a write miss occurs, the block written to
86 is brought into the D1 cache. Most modern caches have this
njn3e87f7e2003-04-08 11:08:45 +000087 property.<p>
88 </li>
89 <p>
sewardja9a2dcf2002-11-11 00:20:07 +000090 <li>Bit-selection hash function: the line(s) in the cache to which a
91 memory block maps is chosen by the middle bits M--(M+N-1) of the
92 byte address, where:
93 <ul>
94 <li>&nbsp;line size = 2^M bytes&nbsp;</li>
95 <li>(cache size / line size) = 2^N bytes</li>
njn3e87f7e2003-04-08 11:08:45 +000096 </ul>
97 </li>
98 <p>
sewardja9a2dcf2002-11-11 00:20:07 +000099 <li>Inclusive L2 cache: the L2 cache replicates all the entries of
100 the L1 cache. This is standard on Pentium chips, but AMD
101 Athlons use an exclusive L2 cache that only holds blocks evicted
njn3e87f7e2003-04-08 11:08:45 +0000102 from L1. Ditto AMD Durons and most modern VIAs.</li>
sewardja9a2dcf2002-11-11 00:20:07 +0000103</ul>
104
105The cache configuration simulated (cache size, associativity and line size) is
106determined automagically using the CPUID instruction. If you have an old
107machine that (a) doesn't support the CPUID instruction, or (b) supports it in
108an early incarnation that doesn't give any cache information, then Cachegrind
109will fall back to using a default configuration (that of a model 3/4 Athlon).
110Cachegrind will tell you if this happens. You can manually specify one, two or
111all three levels (I1/D1/L2) of the cache from the command line using the
njn3e87f7e2003-04-08 11:08:45 +0000112<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.
sewardja9a2dcf2002-11-11 00:20:07 +0000113
njn3e87f7e2003-04-08 11:08:45 +0000114<p>
sewardja9a2dcf2002-11-11 00:20:07 +0000115Other noteworthy behaviour:
116
117<ul>
118 <li>References that straddle two cache lines are treated as follows:
119 <ul>
120 <li>If both blocks hit --&gt; counted as one hit</li>
121 <li>If one block hits, the other misses --&gt; counted as one miss</li>
122 <li>If both blocks miss --&gt; counted as one miss (not two)</li>
njn3e87f7e2003-04-08 11:08:45 +0000123 </ul>
124 </li>
sewardja9a2dcf2002-11-11 00:20:07 +0000125
126 <li>Instructions that modify a memory location (eg. <code>inc</code> and
127 <code>dec</code>) are counted as doing just a read, ie. a single data
128 reference. This may seem strange, but since the write can never cause a
129 miss (the read guarantees the block is in the cache) it's not very
njn3e87f7e2003-04-08 11:08:45 +0000130 interesting.
131 <p>
sewardja9a2dcf2002-11-11 00:20:07 +0000132 Thus it measures not the number of times the data cache is accessed, but
133 the number of times a data cache miss could occur.<p>
134 </li>
135</ul>
136
137If you are interested in simulating a cache with different properties, it is
138not particularly hard to write your own cache simulator, or to modify the
139existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
140<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
141interested to hear from anyone who does.
142
sewardjf555ac72002-11-18 00:07:28 +0000143
sewardja9a2dcf2002-11-11 00:20:07 +0000144<a name="profile"></a>
sewardjf555ac72002-11-18 00:07:28 +0000145<h3>4.4&nbsp; Profiling programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000146
147Cache profiling is enabled by using the <code>--skin=cachegrind</code>
148option to the <code>valgrind</code> shell script. To gather cache profiling
149information about the program <code>ls -l</code>, type:
150
151<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
152
153The program will execute (slowly). Upon completion, summary statistics
154that look like this will be printed:
155
156<pre>
157==31751== I refs: 27,742,716
158==31751== I1 misses: 276
159==31751== L2 misses: 275
160==31751== I1 miss rate: 0.0%
161==31751== L2i miss rate: 0.0%
162==31751==
163==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
164==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
165==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
166==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
167==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
168==31751==
169==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
170==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
171</pre>
172
173Cache accesses for instruction fetches are summarised first, giving the
174number of fetches made (this is the number of instructions executed, which
175can be useful to know in its own right), the number of I1 misses, and the
njn3e87f7e2003-04-08 11:08:45 +0000176number of L2 instruction (<code>L2i</code>) misses.
177<p>
sewardja9a2dcf2002-11-11 00:20:07 +0000178Cache accesses for data follow. The information is similar to that of the
179instruction fetches, except that the values are also shown split between reads
180and writes (note each row's <code>rd</code> and <code>wr</code> values add up
njn3e87f7e2003-04-08 11:08:45 +0000181to the row's total).
182<p>
183Combined instruction and data figures for the L2 cache follow that.
sewardja9a2dcf2002-11-11 00:20:07 +0000184
185
sewardjf555ac72002-11-18 00:07:28 +0000186<h3>4.5&nbsp; Output file</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000187
188As well as printing summary information, Cachegrind also writes
189line-by-line cache profiling information to a file named
190<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
191best interpreted by the accompanying program <code>cg_annotate</code>,
192described in the next section.
193<p>
194Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
195<ul>
196 <li>It is written every time <code>valgrind --skin=cachegrind</code>
197 is run, and will overwrite any existing
198 <code>cachegrind.out.<i>pid</i></code> in the current directory (but
199 that won't happen very often because it takes some time for process ids
njn3e87f7e2003-04-08 11:08:45 +0000200 to be recycled).</li><p>
sewardja9a2dcf2002-11-11 00:20:07 +0000201 <li>It can be huge: <code>ls -l</code> generates a file of about
202 350KB. Browsing a few files and web pages with a Konqueror
203 built with full debugging information generates a file
204 of around 15 MB.</li>
205</ul>
206
207Note that older versions of Cachegrind used a log file named
208<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
sewardjf555ac72002-11-18 00:07:28 +0000209The suffix serves two purposes. Firstly, it means you don't have to
210rename old log files that you don't want to overwrite. Secondly, and
211more importantly, it allows correct profiling with the
212<code>--trace-children=yes</code> option of programs that spawn child
213processes.
214
sewardja9a2dcf2002-11-11 00:20:07 +0000215
216<a name="profileflags"></a>
sewardjf555ac72002-11-18 00:07:28 +0000217<h3>4.6&nbsp; Cachegrind options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000218
sewardjf555ac72002-11-18 00:07:28 +0000219Cache-simulation specific options are:
sewardja9a2dcf2002-11-11 00:20:07 +0000220
221<ul>
222 <li><code>--I1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
223 <code>--D1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
224 <code>--L2=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><p>
225 [default: uses CPUID for automagic cache configuration]<p>
226
227 Manually specifies the I1/D1/L2 cache configuration, where
228 <code>size</code> and <code>line_size</code> are measured in bytes. The
229 three items must be comma-separated, but with no spaces, eg:
230
231 <blockquote>
232 <code>valgrind --skin=cachegrind --I1=65535,2,64</code>
233 </blockquote>
234
235 You can specify one, two or three of the I1/D1/L2 caches. Any level not
236 manually specified will be simulated using the configuration found in the
237 normal way (via the CPUID instruction, or failing that, via defaults).
238</ul>
239
240
241<a name="annotate"></a>
sewardjf555ac72002-11-18 00:07:28 +0000242<h3>4.7&nbsp; Annotating C/C++ programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000243
244Before using <code>cg_annotate</code>, it is worth widening your
245window to be at least 120-characters wide if possible, as the output
246lines can be quite long.
247<p>
248To get a function-by-function summary, run <code>cg_annotate
249--<i>pid</i></code> in a directory containing a
250<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
251is required so that <code>cg_annotate</code> knows which log file to use when
252several are present.
253<p>
254The output looks like this:
255
256<pre>
257--------------------------------------------------------------------------------
258I1 cache: 65536 B, 64 B, 2-way associative
259D1 cache: 65536 B, 64 B, 2-way associative
260L2 cache: 262144 B, 64 B, 8-way associative
261Command: concord vg_to_ucode.c
262Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
263Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
264Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
265Threshold: 99%
266Chosen for annotation:
267Auto-annotation: on
268
269--------------------------------------------------------------------------------
270Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
271--------------------------------------------------------------------------------
27227,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
273
274--------------------------------------------------------------------------------
275Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
276--------------------------------------------------------------------------------
2778,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
2785,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
2792,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
2802,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
2812,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
2821,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
283 897,991 51 51 897,831 95 30 62 1 1 ???:???
284 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
285 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
286 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
287 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
288 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
289 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
290 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
291 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
292 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
293 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
294 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
295</pre>
296
297First up is a summary of the annotation options:
298
299<ul>
300 <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
301 configuration with which these results were obtained.</li><p>
302
303 <li>Command: the command line invocation of the program under
304 examination.</li><p>
305
306 <li>Events recorded: event abbreviations are:<p>
307 <ul>
308 <li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
309 <li><code>I1mr</code>: I1 cache read misses</li>
310 <li><code>I2mr</code>: L2 cache instruction read misses</li>
311 <li><code>Dr </code>: D cache reads (ie. memory reads)</li>
312 <li><code>D1mr</code>: D1 cache read misses</li>
313 <li><code>D2mr</code>: L2 cache data read misses</li>
314 <li><code>Dw </code>: D cache writes (ie. memory writes)</li>
315 <li><code>D1mw</code>: D1 cache write misses</li>
316 <li><code>D2mw</code>: L2 cache data write misses</li>
317 </ul><p>
318 Note that D1 total accesses is given by <code>D1mr</code> +
319 <code>D1mw</code>, and that L2 total accesses is given by
320 <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
321
322 <li>Events shown: the events shown (a subset of events gathered). This can
323 be adjusted with the <code>--show</code> option.</li><p>
324
325 <li>Event sort order: the sort order in which functions are shown. For
326 example, in this case the functions are sorted from highest
327 <code>Ir</code> counts to lowest. If two functions have identical
328 <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
329 counts, and so on. This order can be adjusted with the
330 <code>--sort</code> option.<p>
331
332 Note that this dictates the order the functions appear. It is <b>not</b>
333 the order in which the columns appear; that is dictated by the "events
334 shown" line (and can be changed with the <code>--show</code> option).
335 </li><p>
336
337 <li>Threshold: <code>cg_annotate</code> by default omits functions
338 that cause very low numbers of misses to avoid drowning you in
339 information. In this case, cg_annotate shows summaries the
340 functions that account for 99% of the <code>Ir</code> counts;
341 <code>Ir</code> is chosen as the threshold event since it is the
342 primary sort event. The threshold can be adjusted with the
343 <code>--threshold</code> option.</li><p>
344
345 <li>Chosen for annotation: names of files specified manually for annotation;
346 in this case none.</li><p>
347
348 <li>Auto-annotation: whether auto-annotation was requested via the
349 <code>--auto=yes</code> option. In this case no.</li><p>
350</ul>
351
352Then follows summary statistics for the whole program. These are similar
353to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
354
355Then follows function-by-function statistics. Each function is
356identified by a <code>file_name:function_name</code> pair. If a column
357contains only a dot it means the function never performs
358that event (eg. the third row shows that <code>strcmp()</code>
359contains no instructions that write to memory). The name
360<code>???</code> is used if the the file name and/or function name
361could not be determined from debugging information. If most of the
362entries have the form <code>???:???</code> the program probably wasn't
363compiled with <code>-g</code>. If any code was invalidated (either due to
364self-modifying code or unloading of shared objects) its counts are aggregated
365into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
366
367It is worth noting that functions will come from three types of source files:
368<ol>
369 <li> From the profiled program (<code>concord.c</code> in this example).</li>
370 <li>From libraries (eg. <code>getc.c</code>)</li>
371 <li>From Valgrind's implementation of some libc functions (eg.
372 <code>vg_clientmalloc.c:malloc</code>). These are recognisable because
373 the filename begins with <code>vg_</code>, and is probably one of
374 <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
375 <code>vg_mylibc.c</code>.
376 </li>
377</ol>
378
379There are two ways to annotate source files -- by choosing them
380manually, or with the <code>--auto=yes</code> option. To do it
381manually, just specify the filenames as arguments to
382<code>cg_annotate</code>. For example, the output from running
383<code>cg_annotate concord.c</code> for our example produces the same
384output as above followed by an annotated version of
385<code>concord.c</code>, a section of which looks like:
386
387<pre>
388--------------------------------------------------------------------------------
389-- User-annotated source: concord.c
390--------------------------------------------------------------------------------
391Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
392
393[snip]
394
395 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
396 3 1 1 . . . 1 0 0 {
397 . . . . . . . . . FILE *file_ptr;
398 . . . . . . . . . Word_Info *data;
399 1 0 0 . . . 1 1 1 int line = 1, i;
400 . . . . . . . . .
401 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
402 . . . . . . . . .
403 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
404 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
405 . . . . . . . . .
406 . . . . . . . . . /* Open file, check it. */
407 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
408 2 0 0 1 0 0 . . . if (!(file_ptr)) {
409 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
410 1 1 1 . . . . . . exit(EXIT_FAILURE);
411 . . . . . . . . . }
412 . . . . . . . . .
413 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
414 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
415 . . . . . . . . .
416 4 0 0 1 0 0 2 0 0 free(data);
417 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
418 3 0 0 2 0 0 . . . }
419</pre>
420
421(Although column widths are automatically minimised, a wide terminal is clearly
422useful.)<p>
423
424Each source file is clearly marked (<code>User-annotated source</code>) as
425having been chosen manually for annotation. If the file was found in one of
426the directories specified with the <code>-I</code>/<code>--include</code>
427option, the directory and file are both given.<p>
428
429Each line is annotated with its event counts. Events not applicable for a line
430are represented by a `.'; this is useful for distinguishing between an event
431which cannot happen, and one which can but did not.<p>
432
433Sometimes only a small section of a source file is executed. To minimise
434uninteresting output, Valgrind only shows annotated lines and lines within a
435small distance of annotated lines. Gaps are marked with the line numbers so
436you know which part of a file the shown code comes from, eg:
437
438<pre>
439(figures and code for line 704)
440-- line 704 ----------------------------------------
441-- line 878 ----------------------------------------
442(figures and code for line 878)
443</pre>
444
445The amount of context to show around annotated lines is controlled by the
446<code>--context</code> option.<p>
447
448To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
449cg_annotate will automatically annotate every source file it can find that is
450mentioned in the function-by-function summary. Therefore, the files chosen for
451auto-annotation are affected by the <code>--sort</code> and
452<code>--threshold</code> options. Each source file is clearly marked
453(<code>Auto-annotated source</code>) as being chosen automatically. Any files
454that could not be found are mentioned at the end of the output, eg:
455
456<pre>
457--------------------------------------------------------------------------------
458The following files chosen for auto-annotation could not be found:
459--------------------------------------------------------------------------------
460 getc.c
461 ctype.c
462 ../sysdeps/generic/lockfile.c
463</pre>
464
465This is quite common for library files, since libraries are usually compiled
466with debugging information, but the source files are often not present on a
467system. If a file is chosen for annotation <b>both</b> manually and
468automatically, it is marked as <code>User-annotated source</code>.
469
470Use the <code>-I/--include</code> option to tell Valgrind where to look for
471source files if the filenames found from the debugging information aren't
472specific enough.
473
474Beware that cg_annotate can take some time to digest large
475<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
476beware that auto-annotation can produce a lot of output if your program is
477large!
478
479
sewardjf555ac72002-11-18 00:07:28 +0000480<h3>4.8&nbsp; Annotating assembler programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000481
482Valgrind can annotate assembler programs too, or annotate the
483assembler generated for your C program. Sometimes this is useful for
484understanding what is really happening when an interesting line of C
485code is translated into multiple instructions.<p>
486
487To do this, you just need to assemble your <code>.s</code> files with
488assembler-level debug information. gcc doesn't do this, but you can
489use the GNU assembler with the <code>--gstabs</code> option to
490generate object files with this information, eg:
491
492<blockquote><code>as --gstabs foo.s</code></blockquote>
493
494You can then profile and annotate source files in the same way as for C/C++
495programs.
496
497
sewardjf555ac72002-11-18 00:07:28 +0000498<h3>4.9&nbsp; <code>cg_annotate</code> options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000499<ul>
500 <li><code>--<i>pid</i></code></li><p>
501
502 Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
503 Not actually an option -- it is required.
504
505 <li><code>-h, --help</code></li><p>
506 <li><code>-v, --version</code><p>
507
508 Help and version, as usual.</li>
509
510 <li><code>--sort=A,B,C</code> [default: order in
511 <code>cachegrind.out.<i>pid</i></code>]<p>
512 Specifies the events upon which the sorting of the function-by-function
513 entries will be based. Useful if you want to concentrate on eg. I cache
514 misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
515 (<code>--sort=D1mr,D2mr</code>), or L2 misses
516 (<code>--sort=D2mr,I2mr</code>).</li><p>
517
518 <li><code>--show=A,B,C</code> [default: all, using order in
519 <code>cachegrind.out.<i>pid</i></code>]<p>
520 Specifies which events to show (and the column order). Default is to use
521 all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
522 the order in the file).</li><p>
523
524 <li><code>--threshold=X</code> [default: 99%] <p>
525 Sets the threshold for the function-by-function summary. Functions are
526 shown that account for more than X% of the primary sort event. If
527 auto-annotating, also affects which files are annotated.
528
529 Note: thresholds can be set for more than one of the events by appending
530 any events for the <code>--sort</code> option with a colon and a number
531 (no spaces, though). E.g. if you want to see the functions that cover
532 99% of L2 read misses and 99% of L2 write misses, use this option:
533
534 <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
535 </li><p>
536
537 <li><code>--auto=no</code> [default]<br>
538 <code>--auto=yes</code> <p>
539 When enabled, automatically annotates every file that is mentioned in the
540 function-by-function summary that can be found. Also gives a list of
541 those that couldn't be found.
542
543 <li><code>--context=N</code> [default: 8]<p>
544 Print N lines of context before and after each annotated line. Avoids
545 printing large sections of source files that were not executed. Use a
546 large number (eg. 10,000) to show all source lines.
547 </li><p>
548
549 <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code>
550 [default: empty string]<p>
551 Adds a directory to the list in which to search for files. Multiple
552 -I/--include options can be given to add multiple directories.
553</ul>
554
555
sewardjf555ac72002-11-18 00:07:28 +0000556<h3>4.10&nbsp; Warnings</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000557There are a couple of situations in which cg_annotate issues warnings.
558
559<ul>
560 <li>If a source file is more recent than the
561 <code>cachegrind.out.<i>pid</i></code> file. This is because the
562 information in <code>cachegrind.out.<i>pid</i></code> is only recorded
563 with line numbers, so if the line numbers change at all in the source
564 (eg. lines added, deleted, swapped), any annotations will be
565 incorrect.<p>
566
567 <li>If information is recorded about line numbers past the end of a file.
568 This can be caused by the above problem, ie. shortening the source file
569 while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
570 happens, the figures for the bogus lines are printed anyway (clearly
571 marked as bogus) in case they are important.</li><p>
572</ul>
573
574
sewardjf555ac72002-11-18 00:07:28 +0000575<h3>4.11&nbsp; Things to watch out for</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000576Some odd things that can occur during annotation:
577
578<ul>
579 <li>If annotating at the assembler level, you might see something like this:
580
581 <pre>
582 1 0 0 . . . . . . leal -12(%ebp),%eax
583 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
584 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
585 . . . . . . . . . .align 4,0x90
586 1 0 0 . . . . . . movl $.LnrB,%eax
587 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
588 </pre>
589
590 How can the third instruction be executed twice when the others are
591 executed only once? As it turns out, it isn't. Here's a dump of the
592 executable, using <code>objdump -d</code>:
593
594 <pre>
595 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
596 8048f28: 89 43 54 mov %eax,0x54(%ebx)
597 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
598 8048f32: 89 f6 mov %esi,%esi
599 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
600 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
601 </pre>
602
603 Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
604 come from? The GNU assembler inserted it to serve as the two bytes of
605 padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
606 a four-byte boundary, but pretended it didn't exist when adding debug
607 information. Thus when Valgrind reads the debug info it thinks that the
608 <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
609 range 0x8048f2b--0x804833 by itself, and attributes the counts for the
610 <code>mov %esi,%esi</code> to it.<p>
611 </li>
612
613 <li>Inlined functions can cause strange results in the function-by-function
614 summary. If a function <code>inline_me()</code> is defined in
615 <code>foo.h</code> and inlined in the functions <code>f1()</code>,
616 <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
617 not be a <code>foo.h:inline_me()</code> function entry. Instead, there
618 will be separate function entries for each inlining site, ie.
619 <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
620 <code>foo.h:f3()</code>. To find the total counts for
621 <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
622
623 The reason for this is that although the debug info output by gcc
624 indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
625 doesn't indicate the name of the function in <code>foo.h</code>, so
626 Valgrind keeps using the old one.<p>
627
628 <li>Sometimes, the same filename might be represented with a relative name
629 and with an absolute name in different parts of the debug info, eg:
630 <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
631 case, if you use auto-annotation, the file will be annotated twice with
632 the counts split between the two.<p>
633 </li>
634
635 <li>Files with more than 65,535 lines cause difficulties for the stabs debug
636 info reader. This is because the line number in the <code>struct
637 nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
638 value. Valgrind can handle some files with more than 65,535 lines
639 correctly by making some guesses to identify line number overflows. But
640 some cases are beyond it, in which case you'll get a warning message
641 explaining that annotations for the file might be incorrect.<p>
642 </li>
643
644 <li>If you compile some files with <code>-g</code> and some without, some
645 events that take place in a file without debug info could be attributed
646 to the last line of a file with debug info (whichever one gets placed
647 before the non-debug-info file in the executable).<p>
648 </li>
649</ul>
650
651This list looks long, but these cases should be fairly rare.<p>
652
653Note: stabs is not an easy format to read. If you come across bizarre
654annotations that look like might be caused by a bug in the stabs reader,
655please let us know.<p>
656
657
sewardjf555ac72002-11-18 00:07:28 +0000658<h3>4.12&nbsp; Accuracy</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000659Valgrind's cache profiling has a number of shortcomings:
660
661<ul>
662 <li>It doesn't account for kernel activity -- the effect of system calls on
663 the cache contents is ignored.</li><p>
664
665 <li>It doesn't account for other process activity (although this is probably
666 desirable when considering a single program).</li><p>
667
668 <li>It doesn't account for virtual-to-physical address mappings; hence the
669 entire simulation is not a true representation of what's happening in the
670 cache.</li><p>
671
672 <li>It doesn't account for cache misses not visible at the instruction level,
673 eg. those arising from TLB misses, or speculative execution.</li><p>
674
sewardja9a2dcf2002-11-11 00:20:07 +0000675 <li>Valgrind's custom threads implementation will schedule threads
njn3e884182003-04-15 13:03:23 +0000676 differently to the standard one. This could warp the results for
sewardja9a2dcf2002-11-11 00:20:07 +0000677 threaded programs.
678 </li><p>
679
680 <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
681 will incorrectly be counted as doing a data read if both the arguments
682 are registers, eg:
683
684 <blockquote><code>btsl %eax, %edx</code></blockquote>
685
686 This should only happen rarely.
687 </li><p>
688
689 <li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
690 <code>fsave</code>) are treated as though they only access 16 bytes.
691 These instructions seem to be rare so hopefully this won't affect
692 accuracy much.
693 </li><p>
694</ul>
695
696Another thing worth nothing is that results are very sensitive. Changing the
697size of the <code>valgrind.so</code> file, the size of the program being
698profiled, or even the length of its name can perturb the results. Variations
699will be small, but don't expect perfectly repeatable results if your program
700changes at all.<p>
701
702While these factors mean you shouldn't trust the results to be super-accurate,
703hopefully they should be close enough to be useful.<p>
704
705
sewardjf555ac72002-11-18 00:07:28 +0000706<h3>4.13&nbsp; Todo</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000707<ul>
708 <li>Program start-up/shut-down calls a lot of functions that aren't
709 interesting and just complicate the output. Would be nice to exclude
710 these somehow.</li>
711 <p>
712</ul>
sewardja9a2dcf2002-11-11 00:20:07 +0000713</body>
714</html>
715