blob: 91bcfc16a3a886e7b6ae56e72ca628dca3486504 [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj7aeb10f2006-12-10 02:59:16 +00003 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
4[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
njn3e986b22004-11-30 10:43:45 +00005
de03e0e7c2005-12-03 23:02:33 +00006
njn3e986b22004-11-30 10:43:45 +00007<chapter id="cg-manual" xreflabel="Cachegrind: a cache-miss profiler">
8<title>Cachegrind: a cache profiler</title>
9
njn3e986b22004-11-30 10:43:45 +000010<sect1 id="cg-manual.cache" xreflabel="Cache profiling">
11<title>Cache profiling</title>
12
13<para>To use this tool, you must specify
14<computeroutput>--tool=cachegrind</computeroutput> on the
15Valgrind command line.</para>
16
17<para>Cachegrind is a tool for doing cache simulations and
18annotating your source line-by-line with the number of cache
19misses. In particular, it records:</para>
20<itemizedlist>
21 <listitem>
22 <para>L1 instruction cache reads and misses;</para>
23 </listitem>
24 <listitem>
25 <para>L1 data cache reads and read misses, writes and write
26 misses;</para>
27 </listitem>
28 <listitem>
29 <para>L2 unified cache reads and read misses, writes and
30 writes misses.</para>
31 </listitem>
32</itemizedlist>
33
njnc8cccb12005-07-25 23:30:24 +000034<para>On a modern machine, an L1 miss will typically cost
njn3e986b22004-11-30 10:43:45 +000035around 10 cycles, and an L2 miss can cost as much as 200
36cycles. Detailed cache profiling can be very useful for improving
37the performance of your program.</para>
38
39<para>Also, since one instruction cache read is performed per
40instruction executed, you can find out how many instructions are
41executed per line, which can be useful for traditional profiling
42and test coverage.</para>
43
44<para>Any feedback, bug-fixes, suggestions, etc, welcome.</para>
45
46
47
48<sect2 id="cg-manual.overview" xreflabel="Overview">
49<title>Overview</title>
50
51<para>First off, as for normal Valgrind use, you probably want to
52compile with debugging info (the
53<computeroutput>-g</computeroutput> flag). But by contrast with
54normal Valgrind use, you probably <command>do</command> want to turn
55optimisation on, since you should profile your program as it will
56be normally run.</para>
57
58<para>The two steps are:</para>
59<orderedlist>
60 <listitem>
61 <para>Run your program with <computeroutput>valgrind
62 --tool=cachegrind</computeroutput> in front of the normal
63 command line invocation. When the program finishes,
64 Cachegrind will print summary cache statistics. It also
65 collects line-by-line information in a file
66 <computeroutput>cachegrind.out.pid</computeroutput>, where
67 <computeroutput>pid</computeroutput> is the program's process
68 id.</para>
69
70 <para>This step should be done every time you want to collect
71 information about a new program, a changed program, or about
72 the same program with different input.</para>
73 </listitem>
74
75 <listitem>
76 <para>Generate a function-by-function summary, and possibly
77 annotate source files, using the supplied
78 <computeroutput>cg_annotate</computeroutput> program. Source
79 files to annotate can be specified manually, or manually on
80 the command line, or "interesting" source files can be
81 annotated automatically with the
82 <computeroutput>--auto=yes</computeroutput> option. You can
83 annotate C/C++ files or assembly language files equally
84 easily.</para>
85
86 <para>This step can be performed as many times as you like
87 for each Step 2. You may want to do multiple annotations
88 showing different information each time.</para>
89 </listitem>
90
91</orderedlist>
92
93<para>The steps are described in detail in the following
94sections.</para>
95
96</sect2>
97
98
debc32e822005-06-25 14:43:05 +000099<sect2 id="cache-sim" xreflabel="Cache simulation specifics">
njn3e986b22004-11-30 10:43:45 +0000100<title>Cache simulation specifics</title>
101
102<para>Cachegrind uses a simulation for a machine with a split L1
103cache and a unified L2 cache. This configuration is used for all
104(modern) x86-based machines we are aware of. Old Cyrix CPUs had
105a unified I and D L1 cache, but they are ancient history
106now.</para>
107
108<para>The more specific characteristics of the simulation are as
109follows.</para>
110
111<itemizedlist>
112
113 <listitem>
114 <para>Write-allocate: when a write miss occurs, the block
115 written to is brought into the D1 cache. Most modern caches
116 have this property.</para>
117 </listitem>
118
119 <listitem>
120 <para>Bit-selection hash function: the line(s) in the cache
121 to which a memory block maps is chosen by the middle bits
122 M--(M+N-1) of the byte address, where:</para>
123 <itemizedlist>
124 <listitem>
125 <para>line size = 2^M bytes</para>
126 </listitem>
127 <listitem>
128 <para>(cache size / line size) = 2^N bytes</para>
129 </listitem>
130 </itemizedlist>
131 </listitem>
132
133 <listitem>
134 <para>Inclusive L2 cache: the L2 cache replicates all the
135 entries of the L1 cache. This is standard on Pentium chips,
136 but AMD Athlons use an exclusive L2 cache that only holds
137 blocks evicted from L1. Ditto AMD Durons and most modern
138 VIAs.</para>
139 </listitem>
140
141</itemizedlist>
142
143<para>The cache configuration simulated (cache size,
144associativity and line size) is determined automagically using
145the CPUID instruction. If you have an old machine that (a)
146doesn't support the CPUID instruction, or (b) supports it in an
147early incarnation that doesn't give any cache information, then
148Cachegrind will fall back to using a default configuration (that
149of a model 3/4 Athlon). Cachegrind will tell you if this
150happens. You can manually specify one, two or all three levels
151(I1/D1/L2) of the cache from the command line using the
152<computeroutput>--I1</computeroutput>,
153<computeroutput>--D1</computeroutput> and
154<computeroutput>--L2</computeroutput> options.</para>
155
156
157<para>Other noteworthy behaviour:</para>
158
159<itemizedlist>
160 <listitem>
161 <para>References that straddle two cache lines are treated as
162 follows:</para>
163 <itemizedlist>
164 <listitem>
165 <para>If both blocks hit --&gt; counted as one hit</para>
166 </listitem>
167 <listitem>
168 <para>If one block hits, the other misses --&gt; counted
169 as one miss.</para>
170 </listitem>
171 <listitem>
172 <para>If both blocks miss --&gt; counted as one miss (not
173 two)</para>
174 </listitem>
175 </itemizedlist>
176 </listitem>
177
178 <listitem>
179 <para>Instructions that modify a memory location
180 (eg. <computeroutput>inc</computeroutput> and
181 <computeroutput>dec</computeroutput>) are counted as doing
182 just a read, ie. a single data reference. This may seem
183 strange, but since the write can never cause a miss (the read
184 guarantees the block is in the cache) it's not very
185 interesting.</para>
186
187 <para>Thus it measures not the number of times the data cache
188 is accessed, but the number of times a data cache miss could
189 occur.</para>
190 </listitem>
191
192</itemizedlist>
193
194<para>If you are interested in simulating a cache with different
195properties, it is not particularly hard to write your own cache
196simulator, or to modify the existing ones in
197<computeroutput>vg_cachesim_I1.c</computeroutput>,
198<computeroutput>vg_cachesim_D1.c</computeroutput>,
199<computeroutput>vg_cachesim_L2.c</computeroutput> and
200<computeroutput>vg_cachesim_gen.c</computeroutput>. We'd be
201interested to hear from anyone who does.</para>
202
203</sect2>
204
205</sect1>
206
207
208
209<sect1 id="cg-manual.profile" xreflabel="Profiling programs">
210<title>Profiling programs</title>
211
212<para>To gather cache profiling information about the program
213<computeroutput>ls -l</computeroutput>, invoke Cachegrind like
214this:</para>
215
216<programlisting><![CDATA[
217valgrind --tool=cachegrind ls -l]]></programlisting>
218
219<para>The program will execute (slowly). Upon completion,
220summary statistics that look like this will be printed:</para>
221
222<programlisting><![CDATA[
223==31751== I refs: 27,742,716
224==31751== I1 misses: 276
225==31751== L2 misses: 275
226==31751== I1 miss rate: 0.0%
227==31751== L2i miss rate: 0.0%
228==31751==
229==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
230==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
231==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
232==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
233==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
234==31751==
235==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
236==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)]]></programlisting>
237
238<para>Cache accesses for instruction fetches are summarised
239first, giving the number of fetches made (this is the number of
240instructions executed, which can be useful to know in its own
241right), the number of I1 misses, and the number of L2 instruction
242(<computeroutput>L2i</computeroutput>) misses.</para>
243
244<para>Cache accesses for data follow. The information is similar
245to that of the instruction fetches, except that the values are
246also shown split between reads and writes (note each row's
247<computeroutput>rd</computeroutput> and
248<computeroutput>wr</computeroutput> values add up to the row's
249total).</para>
250
251<para>Combined instruction and data figures for the L2 cache
252follow that.</para>
253
254
255
256<sect2 id="cg-manual.outputfile" xreflabel="Output file">
257<title>Output file</title>
258
259<para>As well as printing summary information, Cachegrind also
sewardje1216cb2007-02-07 19:55:30 +0000260writes line-by-line cache profiling information to a user-specified
261file. By default this file is named
njn3e986b22004-11-30 10:43:45 +0000262<computeroutput>cachegrind.out.pid</computeroutput>. This file
sewardje1216cb2007-02-07 19:55:30 +0000263is human-readable, but is intended to be interpreted by the accompanying
njn3e986b22004-11-30 10:43:45 +0000264program <computeroutput>cg_annotate</computeroutput>, described
265in the next section.</para>
266
267<para>Things to note about the
268<computeroutput>cachegrind.out.pid</computeroutput>
269file:</para>
270
271<itemizedlist>
272 <listitem>
273 <para>It is written every time Cachegrind is run, and will
274 overwrite any existing
275 <computeroutput>cachegrind.out.pid</computeroutput>
276 in the current directory (but that won't happen very often
277 because it takes some time for process ids to be
278 recycled).</para>
sewardje1216cb2007-02-07 19:55:30 +0000279 <para>
280 To use a basename other than the default
sewardj8693e012007-02-08 06:47:19 +0000281 <computeroutput>cachegrind.out</computeroutput>,
sewardje1216cb2007-02-07 19:55:30 +0000282 use the <computeroutput>--cachegrind-out-file</computeroutput>
283 switch.</para>
284 <para>
285 To add further qualifiers to the output filename you can use
286 the core's <computeroutput>--log-file-qualifier</computeroutput>
sewardj8693e012007-02-08 06:47:19 +0000287 flag. This extends the file name further with the text
288 <computeroutput>.lfq.</computeroutput>followed by the
289 contents of the environment variable specified by
290 <computeroutput>--log-file-qualifier</computeroutput>.
291 </para>
njn3e986b22004-11-30 10:43:45 +0000292 </listitem>
293 <listitem>
294 <para>It can be huge: <computeroutput>ls -l</computeroutput>
295 generates a file of about 350KB. Browsing a few files and
296 web pages with a Konqueror built with full debugging
297 information generates a file of around 15 MB.</para>
298 </listitem>
299</itemizedlist>
300
sewardj8d9fec52005-11-15 20:56:23 +0000301<para>The <computeroutput>.pid</computeroutput> suffix
de7e109d12005-11-18 22:09:58 +0000302on the output file name serves two purposes. Firstly, it means you
303don't have to rename old log files that you don't want to overwrite.
304Secondly, and more importantly, it allows correct profiling with the
njn3e986b22004-11-30 10:43:45 +0000305<computeroutput>--trace-children=yes</computeroutput> option of
306programs that spawn child processes.</para>
307
308</sect2>
309
310
311
312<sect2 id="cg-manual.cgopts" xreflabel="Cachegrind options">
313<title>Cachegrind options</title>
314
de03e0e7c2005-12-03 23:02:33 +0000315<!-- start of xi:include in the manpage -->
316<para id="cg.opts.para">Manually specifies the I1/D1/L2 cache
317configuration, where <varname>size</varname> and
318<varname>line_size</varname> are measured in bytes. The three items
319must be comma-separated, but with no spaces, eg:
320<literallayout> valgrind --tool=cachegrind --I1=65535,2,64</literallayout>
321
322You can specify one, two or three of the I1/D1/L2 caches. Any level not
323manually specified will be simulated using the configuration found in
324the normal way (via the CPUID instruction for automagic cache
325configuration, or failing that, via defaults).</para>
326
njn3e986b22004-11-30 10:43:45 +0000327<para>Cache-simulation specific options are:</para>
328
de03e0e7c2005-12-03 23:02:33 +0000329<variablelist id="cg.opts.list">
njn3e986b22004-11-30 10:43:45 +0000330
de03e0e7c2005-12-03 23:02:33 +0000331 <varlistentry id="opt.I1" xreflabel="--I1">
332 <term>
333 <option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
334 </term>
335 <listitem>
336 <para>Specify the size, associativity and line size of the level 1
337 instruction cache. </para>
338 </listitem>
339 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000340
de03e0e7c2005-12-03 23:02:33 +0000341 <varlistentry id="opt.D1" xreflabel="--D1">
342 <term>
343 <option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
344 </term>
345 <listitem>
346 <para>Specify the size, associativity and line size of the level 1
347 data cache.</para>
348 </listitem>
349 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000350
de03e0e7c2005-12-03 23:02:33 +0000351 <varlistentry id="opt.L2" xreflabel="--L2">
352 <term>
353 <option><![CDATA[--L2=<size>,<associativity>,<line size> ]]></option>
354 </term>
355 <listitem>
356 <para>Specify the size, associativity and line size of the level 2
357 cache.</para>
358 </listitem>
359 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000360
sewardje1216cb2007-02-07 19:55:30 +0000361 <varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
362 <term>
363 <option><![CDATA[--cachegrind-out-file=<basename> ]]></option>
364 </term>
365 <listitem>
sewardj8693e012007-02-08 06:47:19 +0000366 <para>Write the profile data to
367 <computeroutput>basename.pid</computeroutput>
sewardje1216cb2007-02-07 19:55:30 +0000368 rather than to the default output file,
sewardj8693e012007-02-08 06:47:19 +0000369 <computeroutput>cachegrind.out.pid</computeroutput>.
sewardje1216cb2007-02-07 19:55:30 +0000370 </para>
371 </listitem>
372 </varlistentry>
373
de03e0e7c2005-12-03 23:02:33 +0000374</variablelist>
375<!-- end of xi:include in the manpage -->
njn3e986b22004-11-30 10:43:45 +0000376
377</sect2>
378
379
380
381<sect2 id="cg-manual.annotate" xreflabel="Annotating C/C++ programs">
382<title>Annotating C/C++ programs</title>
383
384<para>Before using <computeroutput>cg_annotate</computeroutput>,
385it is worth widening your window to be at least 120-characters
386wide if possible, as the output lines can be quite long.</para>
387
388<para>To get a function-by-function summary, run
389<computeroutput>cg_annotate --pid</computeroutput> in a directory
de03e0e7c2005-12-03 23:02:33 +0000390containing a <filename>cachegrind.out.pid</filename> file. The
391<emphasis>--pid</emphasis> is required so that
392<computeroutput>cg_annotate</computeroutput> knows which log file to use
393when several are present.</para>
njn3e986b22004-11-30 10:43:45 +0000394
395<para>The output looks like this:</para>
396
397<programlisting><![CDATA[
398--------------------------------------------------------------------------------
399I1 cache: 65536 B, 64 B, 2-way associative
400D1 cache: 65536 B, 64 B, 2-way associative
401L2 cache: 262144 B, 64 B, 8-way associative
402Command: concord vg_to_ucode.c
403Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
404Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
405Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
406Threshold: 99%
407Chosen for annotation:
408Auto-annotation: on
409
410--------------------------------------------------------------------------------
411Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
412--------------------------------------------------------------------------------
41327,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
414
415--------------------------------------------------------------------------------
416Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
417--------------------------------------------------------------------------------
4188,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
4195,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
4202,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
4212,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
4222,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
4231,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
424 897,991 51 51 897,831 95 30 62 1 1 ???:???
425 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
426 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
427 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
428 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
429 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
430 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
431 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
432 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
433 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
434 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
435 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue]]></programlisting>
436
437
438<para>First up is a summary of the annotation options:</para>
439
440<itemizedlist>
441
442 <listitem>
443 <para>I1 cache, D1 cache, L2 cache: cache configuration. So
444 you know the configuration with which these results were
445 obtained.</para>
446 </listitem>
447
448 <listitem>
449 <para>Command: the command line invocation of the program
450 under examination.</para>
451 </listitem>
452
453 <listitem>
454 <para>Events recorded: event abbreviations are:</para>
455 <itemizedlist>
456 <listitem>
457 <para><computeroutput>Ir </computeroutput>: I cache reads
458 (ie. instructions executed)</para>
459 </listitem>
460 <listitem>
461 <para><computeroutput>I1mr</computeroutput>: I1 cache read
462 misses</para>
463 </listitem>
464 <listitem>
465 <para><computeroutput>I2mr</computeroutput>: L2 cache
466 instruction read misses</para>
467 </listitem>
468 <listitem>
469 <para><computeroutput>Dr </computeroutput>: D cache reads
470 (ie. memory reads)</para>
471 </listitem>
472 <listitem>
473 <para><computeroutput>D1mr</computeroutput>: D1 cache read
474 misses</para>
475 </listitem>
476 <listitem>
477 <para><computeroutput>D2mr</computeroutput>: L2 cache data
478 read misses</para>
479 </listitem>
480 <listitem>
481 <para><computeroutput>Dw </computeroutput>: D cache writes
482 (ie. memory writes)</para>
483 </listitem>
484 <listitem>
485 <para><computeroutput>D1mw</computeroutput>: D1 cache write
486 misses</para>
487 </listitem>
488 <listitem>
489 <para><computeroutput>D2mw</computeroutput>: L2 cache data
490 write misses</para>
491 </listitem>
492 </itemizedlist>
493
494 <para>Note that D1 total accesses is given by
495 <computeroutput>D1mr</computeroutput> +
496 <computeroutput>D1mw</computeroutput>, and that L2 total
497 accesses is given by <computeroutput>I2mr</computeroutput> +
498 <computeroutput>D2mr</computeroutput> +
499 <computeroutput>D2mw</computeroutput>.</para>
500 </listitem>
501
502 <listitem>
503 <para>Events shown: the events shown (a subset of events
504 gathered). This can be adjusted with the
505 <computeroutput>--show</computeroutput> option.</para>
506 </listitem>
507
508 <listitem>
509 <para>Event sort order: the sort order in which functions are
510 shown. For example, in this case the functions are sorted
511 from highest <computeroutput>Ir</computeroutput> counts to
512 lowest. If two functions have identical
513 <computeroutput>Ir</computeroutput> counts, they will then be
514 sorted by <computeroutput>I1mr</computeroutput> counts, and
515 so on. This order can be adjusted with the
516 <computeroutput>--sort</computeroutput> option.</para>
517
518 <para>Note that this dictates the order the functions appear.
519 It is <command>not</command> the order in which the columns
520 appear; that is dictated by the "events shown" line (and can
521 be changed with the <computeroutput>--show</computeroutput>
522 option).</para>
523 </listitem>
524
525 <listitem>
526 <para>Threshold: <computeroutput>cg_annotate</computeroutput>
527 by default omits functions that cause very low numbers of
528 misses to avoid drowning you in information. In this case,
529 cg_annotate shows summaries the functions that account for
530 99% of the <computeroutput>Ir</computeroutput> counts;
531 <computeroutput>Ir</computeroutput> is chosen as the
532 threshold event since it is the primary sort event. The
533 threshold can be adjusted with the
534 <computeroutput>--threshold</computeroutput>
535 option.</para>
536 </listitem>
537
538 <listitem>
539 <para>Chosen for annotation: names of files specified
540 manually for annotation; in this case none.</para>
541 </listitem>
542
543 <listitem>
544 <para>Auto-annotation: whether auto-annotation was requested
545 via the <computeroutput>--auto=yes</computeroutput>
546 option. In this case no.</para>
547 </listitem>
548
549</itemizedlist>
550
551<para>Then follows summary statistics for the whole
552program. These are similar to the summary provided when running
de03e0e7c2005-12-03 23:02:33 +0000553<computeroutput>valgrind --tool=cachegrind</computeroutput>.</para>
njn3e986b22004-11-30 10:43:45 +0000554
555<para>Then follows function-by-function statistics. Each function
556is identified by a
557<computeroutput>file_name:function_name</computeroutput> pair. If
558a column contains only a dot it means the function never performs
559that event (eg. the third row shows that
560<computeroutput>strcmp()</computeroutput> contains no
561instructions that write to memory). The name
562<computeroutput>???</computeroutput> is used if the the file name
563and/or function name could not be determined from debugging
564information. If most of the entries have the form
565<computeroutput>???:???</computeroutput> the program probably
566wasn't compiled with <computeroutput>-g</computeroutput>. If any
567code was invalidated (either due to self-modifying code or
568unloading of shared objects) its counts are aggregated into a
569single cost centre written as
570<computeroutput>(discarded):(discarded)</computeroutput>.</para>
571
572<para>It is worth noting that functions will come from three
573types of source files:</para>
574
575<orderedlist>
576 <listitem>
577 <para>From the profiled program
578 (<filename>concord.c</filename> in this example).</para>
579 </listitem>
580 <listitem>
581 <para>From libraries (eg. <filename>getc.c</filename>)</para>
582 </listitem>
583 <listitem>
584 <para>From Valgrind's implementation of some libc functions
585 (eg. <computeroutput>vg_clientmalloc.c:malloc</computeroutput>).
586 These are recognisable because the filename begins with
587 <computeroutput>vg_</computeroutput>, and is probably one of
588 <filename>vg_main.c</filename>,
589 <filename>vg_clientmalloc.c</filename> or
590 <filename>vg_mylibc.c</filename>.</para>
591 </listitem>
592
593</orderedlist>
594
595<para>There are two ways to annotate source files -- by choosing
596them manually, or with the
597<computeroutput>--auto=yes</computeroutput> option. To do it
598manually, just specify the filenames as arguments to
599<computeroutput>cg_annotate</computeroutput>. For example, the
600output from running <filename>cg_annotate concord.c</filename>
601for our example produces the same output as above followed by an
602annotated version of <filename>concord.c</filename>, a section of
603which looks like:</para>
604
605<programlisting><![CDATA[
606--------------------------------------------------------------------------------
607-- User-annotated source: concord.c
608--------------------------------------------------------------------------------
609Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
610
611[snip]
612
613 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
614 3 1 1 . . . 1 0 0 {
615 . . . . . . . . . FILE *file_ptr;
616 . . . . . . . . . Word_Info *data;
617 1 0 0 . . . 1 1 1 int line = 1, i;
618 . . . . . . . . .
619 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
620 . . . . . . . . .
621 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
622 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
623 . . . . . . . . .
624 . . . . . . . . . /* Open file, check it. */
625 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
626 2 0 0 1 0 0 . . . if (!(file_ptr)) {
627 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
628 1 1 1 . . . . . . exit(EXIT_FAILURE);
629 . . . . . . . . . }
630 . . . . . . . . .
631 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
632 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
633 . . . . . . . . .
634 4 0 0 1 0 0 2 0 0 free(data);
635 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
636 3 0 0 2 0 0 . . . }]]></programlisting>
637
638<para>(Although column widths are automatically minimised, a wide
639terminal is clearly useful.)</para>
640
641<para>Each source file is clearly marked
642(<computeroutput>User-annotated source</computeroutput>) as
643having been chosen manually for annotation. If the file was
644found in one of the directories specified with the
645<computeroutput>-I / --include</computeroutput> option, the directory
646and file are both given.</para>
647
648<para>Each line is annotated with its event counts. Events not
649applicable for a line are represented by a `.'; this is useful
650for distinguishing between an event which cannot happen, and one
651which can but did not.</para>
652
653<para>Sometimes only a small section of a source file is
sewardj8d9fec52005-11-15 20:56:23 +0000654executed. To minimise uninteresting output, Cachegrind only shows
njn3e986b22004-11-30 10:43:45 +0000655annotated lines and lines within a small distance of annotated
656lines. Gaps are marked with the line numbers so you know which
657part of a file the shown code comes from, eg:</para>
658
659<programlisting><![CDATA[
660(figures and code for line 704)
661-- line 704 ----------------------------------------
662-- line 878 ----------------------------------------
663(figures and code for line 878)]]></programlisting>
664
665<para>The amount of context to show around annotated lines is
666controlled by the <computeroutput>--context</computeroutput>
667option.</para>
668
669<para>To get automatic annotation, run
670<computeroutput>cg_annotate --auto=yes</computeroutput>.
671cg_annotate will automatically annotate every source file it can
672find that is mentioned in the function-by-function summary.
673Therefore, the files chosen for auto-annotation are affected by
674the <computeroutput>--sort</computeroutput> and
675<computeroutput>--threshold</computeroutput> options. Each
676source file is clearly marked (<computeroutput>Auto-annotated
677source</computeroutput>) as being chosen automatically. Any
678files that could not be found are mentioned at the end of the
679output, eg:</para>
680
681<programlisting><![CDATA[
682------------------------------------------------------------------
683The following files chosen for auto-annotation could not be found:
684------------------------------------------------------------------
685 getc.c
686 ctype.c
687 ../sysdeps/generic/lockfile.c]]></programlisting>
688
689<para>This is quite common for library files, since libraries are
690usually compiled with debugging information, but the source files
691are often not present on a system. If a file is chosen for
692annotation <command>both</command> manually and automatically, it
693is marked as <computeroutput>User-annotated
694source</computeroutput>. Use the <computeroutput>-I /
695--include</computeroutput> option to tell Valgrind where to look
696for source files if the filenames found from the debugging
697information aren't specific enough.</para>
698
699<para>Beware that cg_annotate can take some time to digest large
700<computeroutput>cachegrind.out.pid</computeroutput> files,
701e.g. 30 seconds or more. Also beware that auto-annotation can
702produce a lot of output if your program is large!</para>
703
704</sect2>
705
706
707<sect2 id="cg-manual.assembler" xreflabel="Annotating assembler programs">
708<title>Annotating assembler programs</title>
709
710<para>Valgrind can annotate assembler programs too, or annotate
711the assembler generated for your C program. Sometimes this is
712useful for understanding what is really happening when an
713interesting line of C code is translated into multiple
714instructions.</para>
715
716<para>To do this, you just need to assemble your
717<computeroutput>.s</computeroutput> files with assembler-level
718debug information. gcc doesn't do this, but you can use the GNU
719assembler with the <computeroutput>--gstabs</computeroutput>
720option to generate object files with this information, eg:</para>
721
722<programlisting><![CDATA[
723as --gstabs foo.s]]></programlisting>
724
725<para>You can then profile and annotate source files in the same
726way as for C/C++ programs.</para>
727
728</sect2>
729
730</sect1>
731
732
733<sect1 id="cg-manual.annopts" xreflabel="cg_annotate options">
734<title><computeroutput>cg_annotate</computeroutput> options</title>
735
736<itemizedlist>
737
debc32e822005-06-25 14:43:05 +0000738 <listitem id="pid">
njn3e986b22004-11-30 10:43:45 +0000739 <para><computeroutput>--pid</computeroutput></para>
sewardj8693e012007-02-08 06:47:19 +0000740 <para>Indicates that profile data should be read from
741 the file
742 <computeroutput>cachegrind.out.pid</computeroutput>.
743 read.
744 Note that you must specify either
745 <computeroutput>--pid</computeroutput>
746 or <computeroutput>--cachegrind-out-file=filename</computeroutput>
747 exactly once.
748 </para>
749 </listitem>
750
751 <listitem id="cachegrind-out-file">
752 <para><computeroutput>--cachegrind-out-file=filename</computeroutput></para>
753 <para>Indicates that profile data
754 should be read from <computeroutput>filename</computeroutput>.
755 Note that you must specify either
756 <computeroutput>--pid</computeroutput>
757 or <computeroutput>--cachegrind-out-file=filename</computeroutput>
758 exactly once.
759 </para>
njn3e986b22004-11-30 10:43:45 +0000760 </listitem>
761
762 <listitem>
763 <para><computeroutput>-h, --help</computeroutput></para>
764 <para><computeroutput>-v, --version</computeroutput></para>
765 <para>Help and version, as usual.</para>
766 </listitem>
767
debc32e822005-06-25 14:43:05 +0000768 <listitem id="sort">
njn3e986b22004-11-30 10:43:45 +0000769 <para><computeroutput>--sort=A,B,C</computeroutput> [default:
770 order in
771 <computeroutput>cachegrind.out.pid</computeroutput>]</para>
772 <para>Specifies the events upon which the sorting of the
773 function-by-function entries will be based. Useful if you
774 want to concentrate on eg. I cache misses
775 (<computeroutput>--sort=I1mr,I2mr</computeroutput>), or D
776 cache misses
777 (<computeroutput>--sort=D1mr,D2mr</computeroutput>), or L2
778 misses
779 (<computeroutput>--sort=D2mr,I2mr</computeroutput>).</para>
780 </listitem>
781
debc32e822005-06-25 14:43:05 +0000782 <listitem id="show">
njn3e986b22004-11-30 10:43:45 +0000783 <para><computeroutput>--show=A,B,C</computeroutput> [default:
784 all, using order in
785 <computeroutput>cachegrind.out.pid</computeroutput>]</para>
786 <para>Specifies which events to show (and the column
787 order). Default is to use all present in the
788 <computeroutput>cachegrind.out.pid</computeroutput> file (and
789 use the order in the file).</para>
790 </listitem>
791
debc32e822005-06-25 14:43:05 +0000792 <listitem id="threshold">
njn3e986b22004-11-30 10:43:45 +0000793 <para><computeroutput>--threshold=X</computeroutput>
794 [default: 99%]</para>
795 <para>Sets the threshold for the function-by-function
796 summary. Functions are shown that account for more than X%
797 of the primary sort event. If auto-annotating, also affects
798 which files are annotated.</para>
799
800 <para>Note: thresholds can be set for more than one of the
801 events by appending any events for the
802 <computeroutput>--sort</computeroutput> option with a colon
803 and a number (no spaces, though). E.g. if you want to see
804 the functions that cover 99% of L2 read misses and 99% of L2
805 write misses, use this option:</para>
806 <para><computeroutput>--sort=D2mr:99,D2mw:99</computeroutput></para>
807 </listitem>
808
debc32e822005-06-25 14:43:05 +0000809 <listitem id="auto">
njn3e986b22004-11-30 10:43:45 +0000810 <para><computeroutput>--auto=no</computeroutput> [default]</para>
811 <para><computeroutput>--auto=yes</computeroutput></para>
812 <para>When enabled, automatically annotates every file that
813 is mentioned in the function-by-function summary that can be
814 found. Also gives a list of those that couldn't be found.</para>
815 </listitem>
816
debc32e822005-06-25 14:43:05 +0000817 <listitem id="context">
njn3e986b22004-11-30 10:43:45 +0000818 <para><computeroutput>--context=N</computeroutput> [default:
819 8]</para>
820 <para>Print N lines of context before and after each
821 annotated line. Avoids printing large sections of source
822 files that were not executed. Use a large number
823 (eg. 10,000) to show all source lines.</para>
824 </listitem>
825
debc32e822005-06-25 14:43:05 +0000826 <listitem id="include">
sewardj8d9fec52005-11-15 20:56:23 +0000827 <para><computeroutput>-I&lt;dir&gt;,
njn3e986b22004-11-30 10:43:45 +0000828 --include=&lt;dir&gt;</computeroutput> [default: empty
829 string]</para>
830 <para>Adds a directory to the list in which to search for
831 files. Multiple -I/--include options can be given to add
832 multiple directories.</para>
833 </listitem>
834
835</itemizedlist>
836
837
838
839<sect2>
840<title>Warnings</title>
841
842<para>There are a couple of situations in which
843<computeroutput>cg_annotate</computeroutput> issues
844warnings.</para>
845
846<itemizedlist>
847 <listitem>
848 <para>If a source file is more recent than the
849 <computeroutput>cachegrind.out.pid</computeroutput> file.
850 This is because the information in
851 <computeroutput>cachegrind.out.pid</computeroutput> is only
852 recorded with line numbers, so if the line numbers change at
853 all in the source (eg. lines added, deleted, swapped), any
854 annotations will be incorrect.</para>
855 </listitem>
856 <listitem>
857 <para>If information is recorded about line numbers past the
858 end of a file. This can be caused by the above problem,
859 ie. shortening the source file while using an old
860 <computeroutput>cachegrind.out.pid</computeroutput> file. If
861 this happens, the figures for the bogus lines are printed
862 anyway (clearly marked as bogus) in case they are
863 important.</para>
864 </listitem>
865</itemizedlist>
866
867</sect2>
868
869
870
871<sect2>
872<title>Things to watch out for</title>
873
874<para>Some odd things that can occur during annotation:</para>
875
876<itemizedlist>
877 <listitem>
878 <para>If annotating at the assembler level, you might see
879 something like this:</para>
880<programlisting><![CDATA[
881 1 0 0 . . . . . . leal -12(%ebp),%eax
882 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
883 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
884 . . . . . . . . . .align 4,0x90
885 1 0 0 . . . . . . movl $.LnrB,%eax
886 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)]]></programlisting>
887
888 <para>How can the third instruction be executed twice when
889 the others are executed only once? As it turns out, it
890 isn't. Here's a dump of the executable, using
891 <computeroutput>objdump -d</computeroutput>:</para>
892<programlisting><![CDATA[
893 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
894 8048f28: 89 43 54 mov %eax,0x54(%ebx)
895 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
896 8048f32: 89 f6 mov %esi,%esi
897 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
898 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)]]></programlisting>
899
900 <para>Notice the extra <computeroutput>mov
901 %esi,%esi</computeroutput> instruction. Where did this come
902 from? The GNU assembler inserted it to serve as the two
903 bytes of padding needed to align the <computeroutput>movl
904 $.LnrB,%eax</computeroutput> instruction on a four-byte
905 boundary, but pretended it didn't exist when adding debug
906 information. Thus when Valgrind reads the debug info it
907 thinks that the <computeroutput>movl
908 $0x1,0xffffffec(%ebp)</computeroutput> instruction covers the
909 address range 0x8048f2b--0x804833 by itself, and attributes
910 the counts for the <computeroutput>mov
911 %esi,%esi</computeroutput> to it.</para>
912 </listitem>
913
914 <listitem>
915 <para>Inlined functions can cause strange results in the
916 function-by-function summary. If a function
917 <computeroutput>inline_me()</computeroutput> is defined in
918 <filename>foo.h</filename> and inlined in the functions
919 <computeroutput>f1()</computeroutput>,
920 <computeroutput>f2()</computeroutput> and
921 <computeroutput>f3()</computeroutput> in
922 <filename>bar.c</filename>, there will not be a
923 <computeroutput>foo.h:inline_me()</computeroutput> function
924 entry. Instead, there will be separate function entries for
925 each inlining site, ie.
926 <computeroutput>foo.h:f1()</computeroutput>,
927 <computeroutput>foo.h:f2()</computeroutput> and
928 <computeroutput>foo.h:f3()</computeroutput>. To find the
929 total counts for
930 <computeroutput>foo.h:inline_me()</computeroutput>, add up
931 the counts from each entry.</para>
932
933 <para>The reason for this is that although the debug info
934 output by gcc indicates the switch from
935 <filename>bar.c</filename> to <filename>foo.h</filename>, it
936 doesn't indicate the name of the function in
937 <filename>foo.h</filename>, so Valgrind keeps using the old
938 one.</para>
939 </listitem>
940
941 <listitem>
942 <para>Sometimes, the same filename might be represented with
943 a relative name and with an absolute name in different parts
944 of the debug info, eg:
945 <filename>/home/user/proj/proj.h</filename> and
946 <filename>../proj.h</filename>. In this case, if you use
947 auto-annotation, the file will be annotated twice with the
948 counts split between the two.</para>
949 </listitem>
950
951 <listitem>
952 <para>Files with more than 65,535 lines cause difficulties
953 for the stabs debug info reader. This is because the line
954 number in the <computeroutput>struct nlist</computeroutput>
955 defined in <filename>a.out.h</filename> under Linux is only a
956 16-bit value. Valgrind can handle some files with more than
957 65,535 lines correctly by making some guesses to identify
958 line number overflows. But some cases are beyond it, in
959 which case you'll get a warning message explaining that
960 annotations for the file might be incorrect.</para>
961 </listitem>
962
963 <listitem>
964 <para>If you compile some files with
965 <computeroutput>-g</computeroutput> and some without, some
966 events that take place in a file without debug info could be
967 attributed to the last line of a file with debug info
968 (whichever one gets placed before the non-debug-info file in
969 the executable).</para>
970 </listitem>
971
972</itemizedlist>
973
974<para>This list looks long, but these cases should be fairly
975rare.</para>
976
977<formalpara>
978 <title>Note:</title>
979 <para><computeroutput>stabs</computeroutput> is not an easy
980 format to read. If you come across bizarre annotations that
981 look like might be caused by a bug in the stabs reader, please
982 let us know.</para>
983</formalpara>
984
985</sect2>
986
987
988
989<sect2>
990<title>Accuracy</title>
991
992<para>Valgrind's cache profiling has a number of
993shortcomings:</para>
994
995<itemizedlist>
996 <listitem>
997 <para>It doesn't account for kernel activity -- the effect of
998 system calls on the cache contents is ignored.</para>
999 </listitem>
1000
1001 <listitem>
1002 <para>It doesn't account for other process activity (although
1003 this is probably desirable when considering a single
1004 program).</para>
1005 </listitem>
1006
1007 <listitem>
1008 <para>It doesn't account for virtual-to-physical address
1009 mappings; hence the entire simulation is not a true
1010 representation of what's happening in the
1011 cache.</para>
1012 </listitem>
1013
1014 <listitem>
1015 <para>It doesn't account for cache misses not visible at the
1016 instruction level, eg. those arising from TLB misses, or
1017 speculative execution.</para>
1018 </listitem>
1019
1020 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001021 <para>Valgrind will schedule
1022 threads differently from how they would be when running natively.
1023 This could warp the results for threaded programs.</para>
njn3e986b22004-11-30 10:43:45 +00001024 </listitem>
1025
1026 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001027 <para>The x86/amd64 instructions <computeroutput>bts</computeroutput>,
njn3e986b22004-11-30 10:43:45 +00001028 <computeroutput>btr</computeroutput> and
1029 <computeroutput>btc</computeroutput> will incorrectly be
1030 counted as doing a data read if both the arguments are
1031 registers, eg:</para>
1032<programlisting><![CDATA[
1033 btsl %eax, %edx]]></programlisting>
1034
1035 <para>This should only happen rarely.</para>
1036 </listitem>
1037
1038 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001039 <para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
njn3e986b22004-11-30 10:43:45 +00001040 (e.g. <computeroutput>fsave</computeroutput>) are treated as
1041 though they only access 16 bytes. These instructions seem to
1042 be rare so hopefully this won't affect accuracy much.</para>
1043 </listitem>
1044
1045</itemizedlist>
1046
1047<para>Another thing worth nothing is that results are very
1048sensitive. Changing the size of the
1049<filename>valgrind.so</filename> file, the size of the program
1050being profiled, or even the length of its name can perturb the
1051results. Variations will be small, but don't expect perfectly
1052repeatable results if your program changes at all.</para>
1053
1054<para>While these factors mean you shouldn't trust the results to
1055be super-accurate, hopefully they should be close enough to be
1056useful.</para>
1057
1058</sect2>
1059
njn534f7812006-10-21 22:22:59 +00001060</sect1>
1061
1062<sect1>
1063<title>Implementation details</title>
1064This section talks about details you don't need to know about in order to
1065use Cachegrind, but may be of interest to some people.
njn3e986b22004-11-30 10:43:45 +00001066
1067<sect2>
njn534f7812006-10-21 22:22:59 +00001068<title>How Cachegrind works</title>
1069<para>The best reference for understanding how Cachegrind works is chapter 3 of
1070"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
njn011215f2006-10-21 23:00:59 +00001071is available on the <ulink url="&vg-pubs;">Valgrind publications
1072page</ulink>.</para>
njn534f7812006-10-21 22:22:59 +00001073</sect2>
njn3e986b22004-11-30 10:43:45 +00001074
njn534f7812006-10-21 22:22:59 +00001075<sect2>
1076<title>Cachegrind output file format</title>
1077<para>The file format is fairly straightforward, basically giving the
1078cost centre for every line, grouped by files and
1079functions. Total counts (eg. total cache accesses, total L1
1080misses) are calculated when traversing this structure rather than
1081during execution, to save time; the cache simulation functions
1082are called so often that even one or two extra adds can make a
1083sizeable difference.</para>
1084
1085<para>The file format:</para>
1086<programlisting><![CDATA[
1087file ::= desc_line* cmd_line events_line data_line+ summary_line
1088desc_line ::= "desc:" ws? non_nl_string
1089cmd_line ::= "cmd:" ws? cmd
1090events_line ::= "events:" ws? (event ws)+
1091data_line ::= file_line | fn_line | count_line
1092file_line ::= "fl=" filename
1093fn_line ::= "fn=" fn_name
1094count_line ::= line_num ws? (count ws)+
1095summary_line ::= "summary:" ws? (count ws)+
1096count ::= num | "."]]></programlisting>
1097
1098<para>Where:</para>
njn3e986b22004-11-30 10:43:45 +00001099<itemizedlist>
1100 <listitem>
njn534f7812006-10-21 22:22:59 +00001101 <para><computeroutput>non_nl_string</computeroutput> is any
1102 string not containing a newline.</para>
njn3e986b22004-11-30 10:43:45 +00001103 </listitem>
njn534f7812006-10-21 22:22:59 +00001104 <listitem>
1105 <para><computeroutput>cmd</computeroutput> is a string holding the
1106 command line of the profiled program.</para>
1107 </listitem>
1108 <listitem>
njn26242122007-01-22 03:21:27 +00001109 <para><computeroutput>event</computeroutput> is a string containing
1110 no whitespace.</para>
1111 </listitem>
1112 <listitem>
njn534f7812006-10-21 22:22:59 +00001113 <para><computeroutput>filename</computeroutput> and
1114 <computeroutput>fn_name</computeroutput> are strings.</para>
1115 </listitem>
1116 <listitem>
1117 <para><computeroutput>num</computeroutput> and
1118 <computeroutput>line_num</computeroutput> are decimal
1119 numbers.</para>
1120 </listitem>
1121 <listitem>
1122 <para><computeroutput>ws</computeroutput> is whitespace.</para>
1123 </listitem>
1124</itemizedlist>
1125
1126<para>The contents of the "desc:" lines are printed out at the top
1127of the summary. This is a generic way of providing simulation
1128specific information, eg. for giving the cache configuration for
1129cache simulation.</para>
1130
1131<para>More than one line of info can be presented for each file/fn/line number.
1132In such cases, the counts for the named events will be accumulated.</para>
1133
1134<para>Counts can be "." to represent zero. This makes the files easier to
1135read.</para>
1136
1137<para>The number of counts in each
1138<computeroutput>line</computeroutput> and the
1139<computeroutput>summary_line</computeroutput> should not exceed
1140the number of events in the
1141<computeroutput>event_line</computeroutput>. If the number in
1142each <computeroutput>line</computeroutput> is less, cg_annotate
1143treats those missing as though they were a "." entry.</para>
1144
1145<para>A <computeroutput>file_line</computeroutput> changes the
1146current file name. A <computeroutput>fn_line</computeroutput>
1147changes the current function name. A
1148<computeroutput>count_line</computeroutput> contains counts that
1149pertain to the current filename/fn_name. A "fn="
1150<computeroutput>file_line</computeroutput> and a
1151<computeroutput>fn_line</computeroutput> must appear before any
1152<computeroutput>count_line</computeroutput>s to give the context
1153of the first <computeroutput>count_line</computeroutput>s.</para>
1154
1155<para>Each <computeroutput>file_line</computeroutput> will normally be
1156immediately followed by a <computeroutput>fn_line</computeroutput>. But it
1157doesn't have to be.</para>
1158
njn3e986b22004-11-30 10:43:45 +00001159
1160</sect2>
1161
1162</sect1>
1163</chapter>