blob: 545748ae9cc41465252cae65036a6b02f7a800c6 [file] [log] [blame]
sewardja9a2dcf2002-11-11 00:20:07 +00001<html>
2 <head>
sewardjf555ac72002-11-18 00:07:28 +00003 <title>Cachegrind: a cache-miss profiler</title>
sewardja9a2dcf2002-11-11 00:20:07 +00004 </head>
5
sewardjf555ac72002-11-18 00:07:28 +00006<body>
7<a name="cg-top"></a>
8<h2>4&nbsp; <b>Cachegrind</b>: a cache-miss profiler</h2>
sewardja9a2dcf2002-11-11 00:20:07 +00009
nethercote137bc552003-11-14 17:47:54 +000010To use this tool, you must specify <code>--tool=cachegrind</code>
sewardjf555ac72002-11-18 00:07:28 +000011on the Valgrind command line.
sewardja9a2dcf2002-11-11 00:20:07 +000012
13<p>
sewardjf555ac72002-11-18 00:07:28 +000014Detailed technical documentation on how Cachegrind works is available
15<A HREF="cg_techdocs.html">here</A>. If you want to know how
16to <b>use</b> it, you only need to read this page.
sewardja9a2dcf2002-11-11 00:20:07 +000017
18
19<a name="cache"></a>
sewardjf555ac72002-11-18 00:07:28 +000020<h3>4.1&nbsp; Cache profiling</h3>
21Cachegrind is a tool for doing cache simulations and annotating your source
sewardja9a2dcf2002-11-11 00:20:07 +000022line-by-line with the number of cache misses. In particular, it records:
23<ul>
24 <li>L1 instruction cache reads and misses;
25 <li>L1 data cache reads and read misses, writes and write misses;
26 <li>L2 unified cache reads and read misses, writes and writes misses.
27</ul>
28On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
29and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
30very useful for improving the performance of your program.<p>
31
32Also, since one instruction cache read is performed per instruction executed,
33you can find out how many instructions are executed per line, which can be
34useful for traditional profiling and test coverage.<p>
35
36Any feedback, bug-fixes, suggestions, etc, welcome.
37
38
sewardjf555ac72002-11-18 00:07:28 +000039<h3>4.2&nbsp; Overview</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000040First off, as for normal Valgrind use, you probably want to compile with
41debugging info (the <code>-g</code> flag). But by contrast with normal
42Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
43should profile your program as it will be normally run.
44
45The two steps are:
46<ol>
nethercote137bc552003-11-14 17:47:54 +000047 <li>Run your program with <code>valgrind --tool=cachegrind</code> in front of
sewardja9a2dcf2002-11-11 00:20:07 +000048 the normal command line invocation. When the program finishes,
sewardjf555ac72002-11-18 00:07:28 +000049 Cachegrind will print summary cache statistics. It also collects
sewardja9a2dcf2002-11-11 00:20:07 +000050 line-by-line information in a file
51 <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
52 is the program's process id.
53 <p>
54 This step should be done every time you want to collect
55 information about a new program, a changed program, or about the
56 same program with different input.
njn3e87f7e2003-04-08 11:08:45 +000057 </li><p>
sewardja9a2dcf2002-11-11 00:20:07 +000058 <li>Generate a function-by-function summary, and possibly annotate
sewardjf555ac72002-11-18 00:07:28 +000059 source files, using the supplied
60 <code>cg_annotate</code> program. Source files to annotate can be
sewardja9a2dcf2002-11-11 00:20:07 +000061 specified manually, or manually on the command line, or
62 "interesting" source files can be annotated automatically with
63 the <code>--auto=yes</code> option. You can annotate C/C++
64 files or assembly language files equally easily.
65 <p>
66 This step can be performed as many times as you like for each
67 Step 2. You may want to do multiple annotations showing
njn3e87f7e2003-04-08 11:08:45 +000068 different information each time.
69 </li><p>
sewardja9a2dcf2002-11-11 00:20:07 +000070</ol>
71
njn3e87f7e2003-04-08 11:08:45 +000072The steps are described in detail in the following sections.
sewardja9a2dcf2002-11-11 00:20:07 +000073
74
njn3e87f7e2003-04-08 11:08:45 +000075<h3>4.3&nbsp; Cache simulation specifics</h3>
sewardja9a2dcf2002-11-11 00:20:07 +000076
77Cachegrind uses a simulation for a machine with a split L1 cache and a unified
78L2 cache. This configuration is used for all (modern) x86-based machines we
79are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
80ancient history now.<p>
81
82The more specific characteristics of the simulation are as follows.
83
84<ul>
85 <li>Write-allocate: when a write miss occurs, the block written to
86 is brought into the D1 cache. Most modern caches have this
njn3e87f7e2003-04-08 11:08:45 +000087 property.<p>
88 </li>
89 <p>
sewardja9a2dcf2002-11-11 00:20:07 +000090 <li>Bit-selection hash function: the line(s) in the cache to which a
91 memory block maps is chosen by the middle bits M--(M+N-1) of the
92 byte address, where:
93 <ul>
94 <li>&nbsp;line size = 2^M bytes&nbsp;</li>
95 <li>(cache size / line size) = 2^N bytes</li>
njn3e87f7e2003-04-08 11:08:45 +000096 </ul>
97 </li>
98 <p>
sewardja9a2dcf2002-11-11 00:20:07 +000099 <li>Inclusive L2 cache: the L2 cache replicates all the entries of
100 the L1 cache. This is standard on Pentium chips, but AMD
101 Athlons use an exclusive L2 cache that only holds blocks evicted
njn3e87f7e2003-04-08 11:08:45 +0000102 from L1. Ditto AMD Durons and most modern VIAs.</li>
sewardja9a2dcf2002-11-11 00:20:07 +0000103</ul>
104
105The cache configuration simulated (cache size, associativity and line size) is
106determined automagically using the CPUID instruction. If you have an old
107machine that (a) doesn't support the CPUID instruction, or (b) supports it in
108an early incarnation that doesn't give any cache information, then Cachegrind
109will fall back to using a default configuration (that of a model 3/4 Athlon).
110Cachegrind will tell you if this happens. You can manually specify one, two or
111all three levels (I1/D1/L2) of the cache from the command line using the
njn3e87f7e2003-04-08 11:08:45 +0000112<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.
sewardja9a2dcf2002-11-11 00:20:07 +0000113
njn3e87f7e2003-04-08 11:08:45 +0000114<p>
sewardja9a2dcf2002-11-11 00:20:07 +0000115Other noteworthy behaviour:
116
117<ul>
118 <li>References that straddle two cache lines are treated as follows:
119 <ul>
120 <li>If both blocks hit --&gt; counted as one hit</li>
121 <li>If one block hits, the other misses --&gt; counted as one miss</li>
122 <li>If both blocks miss --&gt; counted as one miss (not two)</li>
njn3e87f7e2003-04-08 11:08:45 +0000123 </ul>
124 </li>
sewardja9a2dcf2002-11-11 00:20:07 +0000125
126 <li>Instructions that modify a memory location (eg. <code>inc</code> and
127 <code>dec</code>) are counted as doing just a read, ie. a single data
128 reference. This may seem strange, but since the write can never cause a
129 miss (the read guarantees the block is in the cache) it's not very
njn3e87f7e2003-04-08 11:08:45 +0000130 interesting.
131 <p>
sewardja9a2dcf2002-11-11 00:20:07 +0000132 Thus it measures not the number of times the data cache is accessed, but
133 the number of times a data cache miss could occur.<p>
134 </li>
135</ul>
136
137If you are interested in simulating a cache with different properties, it is
138not particularly hard to write your own cache simulator, or to modify the
139existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
140<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
141interested to hear from anyone who does.
142
sewardjf555ac72002-11-18 00:07:28 +0000143
sewardja9a2dcf2002-11-11 00:20:07 +0000144<a name="profile"></a>
sewardjf555ac72002-11-18 00:07:28 +0000145<h3>4.4&nbsp; Profiling programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000146
nethercote137bc552003-11-14 17:47:54 +0000147To gather cache profiling information about the program <code>ls -l</code>,
148invoke Cachegrind like this:
sewardja9a2dcf2002-11-11 00:20:07 +0000149
nethercote137bc552003-11-14 17:47:54 +0000150<blockquote><code>valgrind --tool=cachegrind ls -l</code></blockquote>
sewardja9a2dcf2002-11-11 00:20:07 +0000151
152The program will execute (slowly). Upon completion, summary statistics
153that look like this will be printed:
154
155<pre>
156==31751== I refs: 27,742,716
157==31751== I1 misses: 276
158==31751== L2 misses: 275
159==31751== I1 miss rate: 0.0%
160==31751== L2i miss rate: 0.0%
161==31751==
162==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
163==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
164==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
165==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
166==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
167==31751==
168==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
169==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
170</pre>
171
172Cache accesses for instruction fetches are summarised first, giving the
173number of fetches made (this is the number of instructions executed, which
174can be useful to know in its own right), the number of I1 misses, and the
njn3e87f7e2003-04-08 11:08:45 +0000175number of L2 instruction (<code>L2i</code>) misses.
176<p>
sewardja9a2dcf2002-11-11 00:20:07 +0000177Cache accesses for data follow. The information is similar to that of the
178instruction fetches, except that the values are also shown split between reads
179and writes (note each row's <code>rd</code> and <code>wr</code> values add up
njn3e87f7e2003-04-08 11:08:45 +0000180to the row's total).
181<p>
182Combined instruction and data figures for the L2 cache follow that.
sewardja9a2dcf2002-11-11 00:20:07 +0000183
184
sewardjf555ac72002-11-18 00:07:28 +0000185<h3>4.5&nbsp; Output file</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000186
187As well as printing summary information, Cachegrind also writes
188line-by-line cache profiling information to a file named
189<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
190best interpreted by the accompanying program <code>cg_annotate</code>,
191described in the next section.
192<p>
193Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
194<ul>
nethercote137bc552003-11-14 17:47:54 +0000195 <li>It is written every time Cachegrind
sewardja9a2dcf2002-11-11 00:20:07 +0000196 is run, and will overwrite any existing
197 <code>cachegrind.out.<i>pid</i></code> in the current directory (but
198 that won't happen very often because it takes some time for process ids
njn3e87f7e2003-04-08 11:08:45 +0000199 to be recycled).</li><p>
sewardja9a2dcf2002-11-11 00:20:07 +0000200 <li>It can be huge: <code>ls -l</code> generates a file of about
201 350KB. Browsing a few files and web pages with a Konqueror
202 built with full debugging information generates a file
203 of around 15 MB.</li>
204</ul>
205
206Note that older versions of Cachegrind used a log file named
207<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
sewardjf555ac72002-11-18 00:07:28 +0000208The suffix serves two purposes. Firstly, it means you don't have to
209rename old log files that you don't want to overwrite. Secondly, and
210more importantly, it allows correct profiling with the
211<code>--trace-children=yes</code> option of programs that spawn child
212processes.
213
sewardja9a2dcf2002-11-11 00:20:07 +0000214
215<a name="profileflags"></a>
sewardjf555ac72002-11-18 00:07:28 +0000216<h3>4.6&nbsp; Cachegrind options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000217
sewardjf555ac72002-11-18 00:07:28 +0000218Cache-simulation specific options are:
sewardja9a2dcf2002-11-11 00:20:07 +0000219
220<ul>
221 <li><code>--I1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
222 <code>--D1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
223 <code>--L2=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><p>
224 [default: uses CPUID for automagic cache configuration]<p>
225
226 Manually specifies the I1/D1/L2 cache configuration, where
227 <code>size</code> and <code>line_size</code> are measured in bytes. The
228 three items must be comma-separated, but with no spaces, eg:
229
230 <blockquote>
nethercote137bc552003-11-14 17:47:54 +0000231 <code>valgrind --tool=cachegrind --I1=65535,2,64</code>
sewardja9a2dcf2002-11-11 00:20:07 +0000232 </blockquote>
233
234 You can specify one, two or three of the I1/D1/L2 caches. Any level not
235 manually specified will be simulated using the configuration found in the
236 normal way (via the CPUID instruction, or failing that, via defaults).
237</ul>
238
239
240<a name="annotate"></a>
sewardjf555ac72002-11-18 00:07:28 +0000241<h3>4.7&nbsp; Annotating C/C++ programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000242
243Before using <code>cg_annotate</code>, it is worth widening your
244window to be at least 120-characters wide if possible, as the output
245lines can be quite long.
246<p>
247To get a function-by-function summary, run <code>cg_annotate
248--<i>pid</i></code> in a directory containing a
249<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
250is required so that <code>cg_annotate</code> knows which log file to use when
251several are present.
252<p>
253The output looks like this:
254
255<pre>
256--------------------------------------------------------------------------------
257I1 cache: 65536 B, 64 B, 2-way associative
258D1 cache: 65536 B, 64 B, 2-way associative
259L2 cache: 262144 B, 64 B, 8-way associative
260Command: concord vg_to_ucode.c
261Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
262Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
263Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
264Threshold: 99%
265Chosen for annotation:
266Auto-annotation: on
267
268--------------------------------------------------------------------------------
269Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
270--------------------------------------------------------------------------------
27127,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
272
273--------------------------------------------------------------------------------
274Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
275--------------------------------------------------------------------------------
2768,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
2775,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
2782,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
2792,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
2802,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
2811,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
282 897,991 51 51 897,831 95 30 62 1 1 ???:???
283 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
284 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
285 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
286 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
287 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
288 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
289 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
290 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
291 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
292 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
293 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
294</pre>
295
296First up is a summary of the annotation options:
297
298<ul>
299 <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
300 configuration with which these results were obtained.</li><p>
301
302 <li>Command: the command line invocation of the program under
303 examination.</li><p>
304
305 <li>Events recorded: event abbreviations are:<p>
306 <ul>
307 <li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
308 <li><code>I1mr</code>: I1 cache read misses</li>
309 <li><code>I2mr</code>: L2 cache instruction read misses</li>
310 <li><code>Dr </code>: D cache reads (ie. memory reads)</li>
311 <li><code>D1mr</code>: D1 cache read misses</li>
312 <li><code>D2mr</code>: L2 cache data read misses</li>
313 <li><code>Dw </code>: D cache writes (ie. memory writes)</li>
314 <li><code>D1mw</code>: D1 cache write misses</li>
315 <li><code>D2mw</code>: L2 cache data write misses</li>
316 </ul><p>
317 Note that D1 total accesses is given by <code>D1mr</code> +
318 <code>D1mw</code>, and that L2 total accesses is given by
319 <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
320
321 <li>Events shown: the events shown (a subset of events gathered). This can
322 be adjusted with the <code>--show</code> option.</li><p>
323
324 <li>Event sort order: the sort order in which functions are shown. For
325 example, in this case the functions are sorted from highest
326 <code>Ir</code> counts to lowest. If two functions have identical
327 <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
328 counts, and so on. This order can be adjusted with the
329 <code>--sort</code> option.<p>
330
331 Note that this dictates the order the functions appear. It is <b>not</b>
332 the order in which the columns appear; that is dictated by the "events
333 shown" line (and can be changed with the <code>--show</code> option).
334 </li><p>
335
336 <li>Threshold: <code>cg_annotate</code> by default omits functions
337 that cause very low numbers of misses to avoid drowning you in
338 information. In this case, cg_annotate shows summaries the
339 functions that account for 99% of the <code>Ir</code> counts;
340 <code>Ir</code> is chosen as the threshold event since it is the
341 primary sort event. The threshold can be adjusted with the
342 <code>--threshold</code> option.</li><p>
343
344 <li>Chosen for annotation: names of files specified manually for annotation;
345 in this case none.</li><p>
346
347 <li>Auto-annotation: whether auto-annotation was requested via the
348 <code>--auto=yes</code> option. In this case no.</li><p>
349</ul>
350
351Then follows summary statistics for the whole program. These are similar
nethercote137bc552003-11-14 17:47:54 +0000352to the summary provided when running <code>valgrind --tool=cachegrind</code>.<p>
sewardja9a2dcf2002-11-11 00:20:07 +0000353
354Then follows function-by-function statistics. Each function is
355identified by a <code>file_name:function_name</code> pair. If a column
356contains only a dot it means the function never performs
357that event (eg. the third row shows that <code>strcmp()</code>
358contains no instructions that write to memory). The name
359<code>???</code> is used if the the file name and/or function name
360could not be determined from debugging information. If most of the
361entries have the form <code>???:???</code> the program probably wasn't
362compiled with <code>-g</code>. If any code was invalidated (either due to
363self-modifying code or unloading of shared objects) its counts are aggregated
364into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
365
366It is worth noting that functions will come from three types of source files:
367<ol>
368 <li> From the profiled program (<code>concord.c</code> in this example).</li>
369 <li>From libraries (eg. <code>getc.c</code>)</li>
370 <li>From Valgrind's implementation of some libc functions (eg.
371 <code>vg_clientmalloc.c:malloc</code>). These are recognisable because
372 the filename begins with <code>vg_</code>, and is probably one of
373 <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
374 <code>vg_mylibc.c</code>.
375 </li>
376</ol>
377
378There are two ways to annotate source files -- by choosing them
379manually, or with the <code>--auto=yes</code> option. To do it
380manually, just specify the filenames as arguments to
381<code>cg_annotate</code>. For example, the output from running
382<code>cg_annotate concord.c</code> for our example produces the same
383output as above followed by an annotated version of
384<code>concord.c</code>, a section of which looks like:
385
386<pre>
387--------------------------------------------------------------------------------
388-- User-annotated source: concord.c
389--------------------------------------------------------------------------------
390Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
391
392[snip]
393
394 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
395 3 1 1 . . . 1 0 0 {
396 . . . . . . . . . FILE *file_ptr;
397 . . . . . . . . . Word_Info *data;
398 1 0 0 . . . 1 1 1 int line = 1, i;
399 . . . . . . . . .
400 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
401 . . . . . . . . .
402 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
403 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
404 . . . . . . . . .
405 . . . . . . . . . /* Open file, check it. */
406 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
407 2 0 0 1 0 0 . . . if (!(file_ptr)) {
408 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
409 1 1 1 . . . . . . exit(EXIT_FAILURE);
410 . . . . . . . . . }
411 . . . . . . . . .
412 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
413 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
414 . . . . . . . . .
415 4 0 0 1 0 0 2 0 0 free(data);
416 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
417 3 0 0 2 0 0 . . . }
418</pre>
419
420(Although column widths are automatically minimised, a wide terminal is clearly
421useful.)<p>
422
423Each source file is clearly marked (<code>User-annotated source</code>) as
424having been chosen manually for annotation. If the file was found in one of
425the directories specified with the <code>-I</code>/<code>--include</code>
426option, the directory and file are both given.<p>
427
428Each line is annotated with its event counts. Events not applicable for a line
429are represented by a `.'; this is useful for distinguishing between an event
430which cannot happen, and one which can but did not.<p>
431
432Sometimes only a small section of a source file is executed. To minimise
433uninteresting output, Valgrind only shows annotated lines and lines within a
434small distance of annotated lines. Gaps are marked with the line numbers so
435you know which part of a file the shown code comes from, eg:
436
437<pre>
438(figures and code for line 704)
439-- line 704 ----------------------------------------
440-- line 878 ----------------------------------------
441(figures and code for line 878)
442</pre>
443
444The amount of context to show around annotated lines is controlled by the
445<code>--context</code> option.<p>
446
447To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
448cg_annotate will automatically annotate every source file it can find that is
449mentioned in the function-by-function summary. Therefore, the files chosen for
450auto-annotation are affected by the <code>--sort</code> and
451<code>--threshold</code> options. Each source file is clearly marked
452(<code>Auto-annotated source</code>) as being chosen automatically. Any files
453that could not be found are mentioned at the end of the output, eg:
454
455<pre>
456--------------------------------------------------------------------------------
457The following files chosen for auto-annotation could not be found:
458--------------------------------------------------------------------------------
459 getc.c
460 ctype.c
461 ../sysdeps/generic/lockfile.c
462</pre>
463
464This is quite common for library files, since libraries are usually compiled
465with debugging information, but the source files are often not present on a
466system. If a file is chosen for annotation <b>both</b> manually and
467automatically, it is marked as <code>User-annotated source</code>.
468
469Use the <code>-I/--include</code> option to tell Valgrind where to look for
470source files if the filenames found from the debugging information aren't
471specific enough.
472
473Beware that cg_annotate can take some time to digest large
474<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
475beware that auto-annotation can produce a lot of output if your program is
476large!
477
478
sewardjf555ac72002-11-18 00:07:28 +0000479<h3>4.8&nbsp; Annotating assembler programs</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000480
481Valgrind can annotate assembler programs too, or annotate the
482assembler generated for your C program. Sometimes this is useful for
483understanding what is really happening when an interesting line of C
484code is translated into multiple instructions.<p>
485
486To do this, you just need to assemble your <code>.s</code> files with
487assembler-level debug information. gcc doesn't do this, but you can
488use the GNU assembler with the <code>--gstabs</code> option to
489generate object files with this information, eg:
490
491<blockquote><code>as --gstabs foo.s</code></blockquote>
492
493You can then profile and annotate source files in the same way as for C/C++
494programs.
495
496
sewardjf555ac72002-11-18 00:07:28 +0000497<h3>4.9&nbsp; <code>cg_annotate</code> options</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000498<ul>
499 <li><code>--<i>pid</i></code></li><p>
500
501 Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
502 Not actually an option -- it is required.
503
504 <li><code>-h, --help</code></li><p>
505 <li><code>-v, --version</code><p>
506
507 Help and version, as usual.</li>
508
509 <li><code>--sort=A,B,C</code> [default: order in
510 <code>cachegrind.out.<i>pid</i></code>]<p>
511 Specifies the events upon which the sorting of the function-by-function
512 entries will be based. Useful if you want to concentrate on eg. I cache
513 misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
514 (<code>--sort=D1mr,D2mr</code>), or L2 misses
515 (<code>--sort=D2mr,I2mr</code>).</li><p>
516
517 <li><code>--show=A,B,C</code> [default: all, using order in
518 <code>cachegrind.out.<i>pid</i></code>]<p>
519 Specifies which events to show (and the column order). Default is to use
520 all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
521 the order in the file).</li><p>
522
523 <li><code>--threshold=X</code> [default: 99%] <p>
524 Sets the threshold for the function-by-function summary. Functions are
525 shown that account for more than X% of the primary sort event. If
526 auto-annotating, also affects which files are annotated.
527
528 Note: thresholds can be set for more than one of the events by appending
529 any events for the <code>--sort</code> option with a colon and a number
530 (no spaces, though). E.g. if you want to see the functions that cover
531 99% of L2 read misses and 99% of L2 write misses, use this option:
532
533 <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
534 </li><p>
535
536 <li><code>--auto=no</code> [default]<br>
537 <code>--auto=yes</code> <p>
538 When enabled, automatically annotates every file that is mentioned in the
539 function-by-function summary that can be found. Also gives a list of
540 those that couldn't be found.
541
542 <li><code>--context=N</code> [default: 8]<p>
543 Print N lines of context before and after each annotated line. Avoids
544 printing large sections of source files that were not executed. Use a
545 large number (eg. 10,000) to show all source lines.
546 </li><p>
547
548 <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code>
549 [default: empty string]<p>
550 Adds a directory to the list in which to search for files. Multiple
551 -I/--include options can be given to add multiple directories.
552</ul>
553
554
sewardjf555ac72002-11-18 00:07:28 +0000555<h3>4.10&nbsp; Warnings</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000556There are a couple of situations in which cg_annotate issues warnings.
557
558<ul>
559 <li>If a source file is more recent than the
560 <code>cachegrind.out.<i>pid</i></code> file. This is because the
561 information in <code>cachegrind.out.<i>pid</i></code> is only recorded
562 with line numbers, so if the line numbers change at all in the source
563 (eg. lines added, deleted, swapped), any annotations will be
564 incorrect.<p>
565
566 <li>If information is recorded about line numbers past the end of a file.
567 This can be caused by the above problem, ie. shortening the source file
568 while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
569 happens, the figures for the bogus lines are printed anyway (clearly
570 marked as bogus) in case they are important.</li><p>
571</ul>
572
573
sewardjf555ac72002-11-18 00:07:28 +0000574<h3>4.11&nbsp; Things to watch out for</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000575Some odd things that can occur during annotation:
576
577<ul>
578 <li>If annotating at the assembler level, you might see something like this:
579
580 <pre>
581 1 0 0 . . . . . . leal -12(%ebp),%eax
582 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
583 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
584 . . . . . . . . . .align 4,0x90
585 1 0 0 . . . . . . movl $.LnrB,%eax
586 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
587 </pre>
588
589 How can the third instruction be executed twice when the others are
590 executed only once? As it turns out, it isn't. Here's a dump of the
591 executable, using <code>objdump -d</code>:
592
593 <pre>
594 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
595 8048f28: 89 43 54 mov %eax,0x54(%ebx)
596 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
597 8048f32: 89 f6 mov %esi,%esi
598 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
599 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
600 </pre>
601
602 Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
603 come from? The GNU assembler inserted it to serve as the two bytes of
604 padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
605 a four-byte boundary, but pretended it didn't exist when adding debug
606 information. Thus when Valgrind reads the debug info it thinks that the
607 <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
608 range 0x8048f2b--0x804833 by itself, and attributes the counts for the
609 <code>mov %esi,%esi</code> to it.<p>
610 </li>
611
612 <li>Inlined functions can cause strange results in the function-by-function
613 summary. If a function <code>inline_me()</code> is defined in
614 <code>foo.h</code> and inlined in the functions <code>f1()</code>,
615 <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
616 not be a <code>foo.h:inline_me()</code> function entry. Instead, there
617 will be separate function entries for each inlining site, ie.
618 <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
619 <code>foo.h:f3()</code>. To find the total counts for
620 <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
621
622 The reason for this is that although the debug info output by gcc
623 indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
624 doesn't indicate the name of the function in <code>foo.h</code>, so
625 Valgrind keeps using the old one.<p>
626
627 <li>Sometimes, the same filename might be represented with a relative name
628 and with an absolute name in different parts of the debug info, eg:
629 <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
630 case, if you use auto-annotation, the file will be annotated twice with
631 the counts split between the two.<p>
632 </li>
633
634 <li>Files with more than 65,535 lines cause difficulties for the stabs debug
635 info reader. This is because the line number in the <code>struct
636 nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
637 value. Valgrind can handle some files with more than 65,535 lines
638 correctly by making some guesses to identify line number overflows. But
639 some cases are beyond it, in which case you'll get a warning message
640 explaining that annotations for the file might be incorrect.<p>
641 </li>
642
643 <li>If you compile some files with <code>-g</code> and some without, some
644 events that take place in a file without debug info could be attributed
645 to the last line of a file with debug info (whichever one gets placed
646 before the non-debug-info file in the executable).<p>
647 </li>
648</ul>
649
650This list looks long, but these cases should be fairly rare.<p>
651
652Note: stabs is not an easy format to read. If you come across bizarre
653annotations that look like might be caused by a bug in the stabs reader,
654please let us know.<p>
655
656
sewardjf555ac72002-11-18 00:07:28 +0000657<h3>4.12&nbsp; Accuracy</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000658Valgrind's cache profiling has a number of shortcomings:
659
660<ul>
661 <li>It doesn't account for kernel activity -- the effect of system calls on
662 the cache contents is ignored.</li><p>
663
664 <li>It doesn't account for other process activity (although this is probably
665 desirable when considering a single program).</li><p>
666
667 <li>It doesn't account for virtual-to-physical address mappings; hence the
668 entire simulation is not a true representation of what's happening in the
669 cache.</li><p>
670
671 <li>It doesn't account for cache misses not visible at the instruction level,
672 eg. those arising from TLB misses, or speculative execution.</li><p>
673
sewardja9a2dcf2002-11-11 00:20:07 +0000674 <li>Valgrind's custom threads implementation will schedule threads
njn3e884182003-04-15 13:03:23 +0000675 differently to the standard one. This could warp the results for
sewardja9a2dcf2002-11-11 00:20:07 +0000676 threaded programs.
677 </li><p>
678
679 <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
680 will incorrectly be counted as doing a data read if both the arguments
681 are registers, eg:
682
683 <blockquote><code>btsl %eax, %edx</code></blockquote>
684
685 This should only happen rarely.
686 </li><p>
687
688 <li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
689 <code>fsave</code>) are treated as though they only access 16 bytes.
690 These instructions seem to be rare so hopefully this won't affect
691 accuracy much.
692 </li><p>
693</ul>
694
695Another thing worth nothing is that results are very sensitive. Changing the
696size of the <code>valgrind.so</code> file, the size of the program being
697profiled, or even the length of its name can perturb the results. Variations
698will be small, but don't expect perfectly repeatable results if your program
699changes at all.<p>
700
701While these factors mean you shouldn't trust the results to be super-accurate,
702hopefully they should be close enough to be useful.<p>
703
704
sewardjf555ac72002-11-18 00:07:28 +0000705<h3>4.13&nbsp; Todo</h3>
sewardja9a2dcf2002-11-11 00:20:07 +0000706<ul>
707 <li>Program start-up/shut-down calls a lot of functions that aren't
708 interesting and just complicate the output. Would be nice to exclude
709 these somehow.</li>
710 <p>
711</ul>
sewardja9a2dcf2002-11-11 00:20:07 +0000712</body>
713</html>
714