blob: f65272bbc2cfdaa1955415d42c09e498012513c2 [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj7aeb10f2006-12-10 02:59:16 +00003 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
4[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
njn3e986b22004-11-30 10:43:45 +00005
de03e0e7c2005-12-03 23:02:33 +00006
njn3e986b22004-11-30 10:43:45 +00007<chapter id="cg-manual" xreflabel="Cachegrind: a cache-miss profiler">
sewardj8badbaa2007-05-08 09:20:25 +00008<title>Cachegrind: a cache and branch profiler</title>
njn3e986b22004-11-30 10:43:45 +00009
njn3e986b22004-11-30 10:43:45 +000010<sect1 id="cg-manual.cache" xreflabel="Cache profiling">
sewardj8badbaa2007-05-08 09:20:25 +000011<title>Cache and branch profiling</title>
njn3e986b22004-11-30 10:43:45 +000012
13<para>To use this tool, you must specify
14<computeroutput>--tool=cachegrind</computeroutput> on the
15Valgrind command line.</para>
16
sewardj8badbaa2007-05-08 09:20:25 +000017<para>Cachegrind is a tool for finding places where programs
18interact badly with typical modern superscalar processors
19and run slowly as a result.
20In particular, it will do a cache simulation of your program,
21and optionally a branch-predictor simulation, and can
22then annotate your source line-by-line with the number of cache
23misses and branch mispredictions. The following statistics are
24collected:</para>
njn3e986b22004-11-30 10:43:45 +000025<itemizedlist>
26 <listitem>
27 <para>L1 instruction cache reads and misses;</para>
28 </listitem>
29 <listitem>
30 <para>L1 data cache reads and read misses, writes and write
31 misses;</para>
32 </listitem>
33 <listitem>
34 <para>L2 unified cache reads and read misses, writes and
35 writes misses.</para>
36 </listitem>
sewardj8badbaa2007-05-08 09:20:25 +000037 <listitem>
38 <para>Conditional branches and mispredicted conditional branches.</para>
39 </listitem>
40 <listitem>
41 <para>Indirect branches and mispredicted indirect branches. An
42 indirect branch is a jump or call to a destination only known at
43 run time.</para>
44 </listitem>
njn3e986b22004-11-30 10:43:45 +000045</itemizedlist>
46
njnc8cccb12005-07-25 23:30:24 +000047<para>On a modern machine, an L1 miss will typically cost
sewardj8badbaa2007-05-08 09:20:25 +000048around 10 cycles, an L2 miss can cost as much as 200
49cycles, and a mispredicted branch costs in the region of 10
50to 30 cycles. Detailed cache and branch profiling can be very useful
51for improving the performance of your program.</para>
njn3e986b22004-11-30 10:43:45 +000052
53<para>Also, since one instruction cache read is performed per
54instruction executed, you can find out how many instructions are
55executed per line, which can be useful for traditional profiling
56and test coverage.</para>
57
sewardj8badbaa2007-05-08 09:20:25 +000058<para>Branch profiling is not enabled by default. To use it, you must
59additionally specify <computeroutput>--branch-sim=yes</computeroutput>
60on the command line.</para>
61
njn3e986b22004-11-30 10:43:45 +000062
63<sect2 id="cg-manual.overview" xreflabel="Overview">
64<title>Overview</title>
65
66<para>First off, as for normal Valgrind use, you probably want to
67compile with debugging info (the
68<computeroutput>-g</computeroutput> flag). But by contrast with
69normal Valgrind use, you probably <command>do</command> want to turn
70optimisation on, since you should profile your program as it will
71be normally run.</para>
72
73<para>The two steps are:</para>
74<orderedlist>
75 <listitem>
76 <para>Run your program with <computeroutput>valgrind
77 --tool=cachegrind</computeroutput> in front of the normal
78 command line invocation. When the program finishes,
79 Cachegrind will print summary cache statistics. It also
80 collects line-by-line information in a file
njn374a36d2007-11-23 01:41:32 +000081 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>, where
82 <computeroutput>&lt;pid&gt;</computeroutput> is the program's process
83 ID.</para>
njn3e986b22004-11-30 10:43:45 +000084
sewardj8badbaa2007-05-08 09:20:25 +000085 <para>Branch prediction statistics are not collected by default.
86 To do so, add the flag
87 <computeroutput>--branch-sim=yes</computeroutput>.
88 </para>
89
njn3e986b22004-11-30 10:43:45 +000090 <para>This step should be done every time you want to collect
91 information about a new program, a changed program, or about
92 the same program with different input.</para>
93 </listitem>
94
95 <listitem>
96 <para>Generate a function-by-function summary, and possibly
97 annotate source files, using the supplied
njn374a36d2007-11-23 01:41:32 +000098 cg_annotate program. Source
njn3e986b22004-11-30 10:43:45 +000099 files to annotate can be specified manually, or manually on
100 the command line, or "interesting" source files can be
101 annotated automatically with the
102 <computeroutput>--auto=yes</computeroutput> option. You can
103 annotate C/C++ files or assembly language files equally
104 easily.</para>
105
106 <para>This step can be performed as many times as you like
107 for each Step 2. You may want to do multiple annotations
108 showing different information each time.</para>
109 </listitem>
110
111</orderedlist>
112
sewardj94dc5082007-02-08 11:31:03 +0000113<para>As an optional intermediate step, you can use the supplied
njn374a36d2007-11-23 01:41:32 +0000114cg_merge program to sum together the
sewardj94dc5082007-02-08 11:31:03 +0000115outputs of multiple Cachegrind runs, into a single file which you then
njn374a36d2007-11-23 01:41:32 +0000116use as the input for cg_annotate.</para>
sewardj94dc5082007-02-08 11:31:03 +0000117
sewardj08e31e22007-05-23 21:58:33 +0000118<para>These steps are described in detail in the following
njn3e986b22004-11-30 10:43:45 +0000119sections.</para>
120
121</sect2>
122
123
debc32e822005-06-25 14:43:05 +0000124<sect2 id="cache-sim" xreflabel="Cache simulation specifics">
njn3e986b22004-11-30 10:43:45 +0000125<title>Cache simulation specifics</title>
126
sewardj08e31e22007-05-23 21:58:33 +0000127<para>Cachegrind simulates a machine with independent
128first level instruction and data caches (I1 and D1), backed by a
129unified second level cache (L2). This configuration is used by almost
130all modern machines. Some old Cyrix CPUs had a unified I and D L1
131cache, but they are ancient history now.</para>
njn3e986b22004-11-30 10:43:45 +0000132
sewardj08e31e22007-05-23 21:58:33 +0000133<para>Specific characteristics of the simulation are as
134follows:</para>
njn3e986b22004-11-30 10:43:45 +0000135
136<itemizedlist>
137
138 <listitem>
139 <para>Write-allocate: when a write miss occurs, the block
140 written to is brought into the D1 cache. Most modern caches
141 have this property.</para>
142 </listitem>
143
144 <listitem>
145 <para>Bit-selection hash function: the line(s) in the cache
146 to which a memory block maps is chosen by the middle bits
147 M--(M+N-1) of the byte address, where:</para>
148 <itemizedlist>
149 <listitem>
150 <para>line size = 2^M bytes</para>
151 </listitem>
152 <listitem>
153 <para>(cache size / line size) = 2^N bytes</para>
154 </listitem>
155 </itemizedlist>
156 </listitem>
157
158 <listitem>
159 <para>Inclusive L2 cache: the L2 cache replicates all the
160 entries of the L1 cache. This is standard on Pentium chips,
sewardj08e31e22007-05-23 21:58:33 +0000161 but AMD Opterons, Athlons and Durons
162 use an exclusive L2 cache that only holds
163 blocks evicted from L1. Ditto most modern VIA CPUs.</para>
njn3e986b22004-11-30 10:43:45 +0000164 </listitem>
165
166</itemizedlist>
167
168<para>The cache configuration simulated (cache size,
169associativity and line size) is determined automagically using
170the CPUID instruction. If you have an old machine that (a)
171doesn't support the CPUID instruction, or (b) supports it in an
172early incarnation that doesn't give any cache information, then
173Cachegrind will fall back to using a default configuration (that
174of a model 3/4 Athlon). Cachegrind will tell you if this
175happens. You can manually specify one, two or all three levels
176(I1/D1/L2) of the cache from the command line using the
177<computeroutput>--I1</computeroutput>,
178<computeroutput>--D1</computeroutput> and
179<computeroutput>--L2</computeroutput> options.</para>
180
sewardj08e31e22007-05-23 21:58:33 +0000181<para>On PowerPC platforms
182Cachegrind cannot automatically
183determine the cache configuration, so you will
184need to specify it with the
185<computeroutput>--I1</computeroutput>,
186<computeroutput>--D1</computeroutput> and
187<computeroutput>--L2</computeroutput> options.</para>
188
njn3e986b22004-11-30 10:43:45 +0000189
190<para>Other noteworthy behaviour:</para>
191
192<itemizedlist>
193 <listitem>
194 <para>References that straddle two cache lines are treated as
195 follows:</para>
196 <itemizedlist>
197 <listitem>
198 <para>If both blocks hit --&gt; counted as one hit</para>
199 </listitem>
200 <listitem>
201 <para>If one block hits, the other misses --&gt; counted
202 as one miss.</para>
203 </listitem>
204 <listitem>
205 <para>If both blocks miss --&gt; counted as one miss (not
206 two)</para>
207 </listitem>
208 </itemizedlist>
209 </listitem>
210
211 <listitem>
212 <para>Instructions that modify a memory location
213 (eg. <computeroutput>inc</computeroutput> and
214 <computeroutput>dec</computeroutput>) are counted as doing
215 just a read, ie. a single data reference. This may seem
216 strange, but since the write can never cause a miss (the read
217 guarantees the block is in the cache) it's not very
218 interesting.</para>
219
220 <para>Thus it measures not the number of times the data cache
221 is accessed, but the number of times a data cache miss could
222 occur.</para>
223 </listitem>
224
225</itemizedlist>
226
227<para>If you are interested in simulating a cache with different
228properties, it is not particularly hard to write your own cache
229simulator, or to modify the existing ones in
230<computeroutput>vg_cachesim_I1.c</computeroutput>,
231<computeroutput>vg_cachesim_D1.c</computeroutput>,
232<computeroutput>vg_cachesim_L2.c</computeroutput> and
233<computeroutput>vg_cachesim_gen.c</computeroutput>. We'd be
234interested to hear from anyone who does.</para>
235
236</sect2>
237
sewardj8badbaa2007-05-08 09:20:25 +0000238
239<sect2 id="branch-sim" xreflabel="Branch simulation specifics">
240<title>Branch simulation specifics</title>
241
242<para>Cachegrind simulates branch predictors intended to be
243typical of mainstream desktop/server processors of around 2004.</para>
244
245<para>Conditional branches are predicted using an array of 16384 2-bit
246saturating counters. The array index used for a branch instruction is
247computed partly from the low-order bits of the branch instruction's
248address and partly using the taken/not-taken behaviour of the last few
249conditional branches. As a result the predictions for any specific
250branch depend both on its own history and the behaviour of previous
251branches. This is a standard technique for improving prediction
252accuracy.</para>
253
254<para>For indirect branches (that is, jumps to unknown destinations)
255Cachegrind uses a simple branch target address predictor. Targets are
256predicted using an array of 512 entries indexed by the low order 9
257bits of the branch instruction's address. Each branch is predicted to
258jump to the same address it did last time. Any other behaviour causes
259a mispredict.</para>
260
261<para>More recent processors have better branch predictors, in
262particular better indirect branch predictors. Cachegrind's predictor
263design is deliberately conservative so as to be representative of the
264large installed base of processors which pre-date widespread
265deployment of more sophisticated indirect branch predictors. In
266particular, late model Pentium 4s (Prescott), Pentium M, Core and Core
2672 have more sophisticated indirect branch predictors than modelled by
268Cachegrind. </para>
269
270<para>Cachegrind does not simulate a return stack predictor. It
271assumes that processors perfectly predict function return addresses,
272an assumption which is probably close to being true.</para>
273
274<para>See Hennessy and Patterson's classic text "Computer
275Architecture: A Quantitative Approach", 4th edition (2007), Section
2762.3 (pages 80-89) for background on modern branch predictors.</para>
277
278</sect2>
279
280
njn3e986b22004-11-30 10:43:45 +0000281</sect1>
282
283
284
285<sect1 id="cg-manual.profile" xreflabel="Profiling programs">
286<title>Profiling programs</title>
287
288<para>To gather cache profiling information about the program
289<computeroutput>ls -l</computeroutput>, invoke Cachegrind like
290this:</para>
291
292<programlisting><![CDATA[
293valgrind --tool=cachegrind ls -l]]></programlisting>
294
295<para>The program will execute (slowly). Upon completion,
296summary statistics that look like this will be printed:</para>
297
298<programlisting><![CDATA[
299==31751== I refs: 27,742,716
300==31751== I1 misses: 276
301==31751== L2 misses: 275
302==31751== I1 miss rate: 0.0%
303==31751== L2i miss rate: 0.0%
304==31751==
305==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
306==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
307==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
308==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
309==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
310==31751==
311==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
312==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)]]></programlisting>
313
314<para>Cache accesses for instruction fetches are summarised
315first, giving the number of fetches made (this is the number of
316instructions executed, which can be useful to know in its own
317right), the number of I1 misses, and the number of L2 instruction
318(<computeroutput>L2i</computeroutput>) misses.</para>
319
320<para>Cache accesses for data follow. The information is similar
321to that of the instruction fetches, except that the values are
322also shown split between reads and writes (note each row's
323<computeroutput>rd</computeroutput> and
324<computeroutput>wr</computeroutput> values add up to the row's
325total).</para>
326
327<para>Combined instruction and data figures for the L2 cache
328follow that.</para>
329
330
331
332<sect2 id="cg-manual.outputfile" xreflabel="Output file">
333<title>Output file</title>
334
335<para>As well as printing summary information, Cachegrind also
sewardje1216cb2007-02-07 19:55:30 +0000336writes line-by-line cache profiling information to a user-specified
337file. By default this file is named
njn374a36d2007-11-23 01:41:32 +0000338<computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>. This file
sewardje1216cb2007-02-07 19:55:30 +0000339is human-readable, but is intended to be interpreted by the accompanying
njn374a36d2007-11-23 01:41:32 +0000340program cg_annotate, described in the next section.</para>
njn3e986b22004-11-30 10:43:45 +0000341
342<para>Things to note about the
njn374a36d2007-11-23 01:41:32 +0000343<computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>
njn3e986b22004-11-30 10:43:45 +0000344file:</para>
345
346<itemizedlist>
347 <listitem>
348 <para>It is written every time Cachegrind is run, and will
349 overwrite any existing
njn374a36d2007-11-23 01:41:32 +0000350 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>
njn3e986b22004-11-30 10:43:45 +0000351 in the current directory (but that won't happen very often
352 because it takes some time for process ids to be
353 recycled).</para>
njn374a36d2007-11-23 01:41:32 +0000354 </listitem>
355 <listitem>
356 <para>To use an output file name other than the default
sewardj8693e012007-02-08 06:47:19 +0000357 <computeroutput>cachegrind.out</computeroutput>,
sewardje1216cb2007-02-07 19:55:30 +0000358 use the <computeroutput>--cachegrind-out-file</computeroutput>
359 switch.</para>
njn3e986b22004-11-30 10:43:45 +0000360 </listitem>
361 <listitem>
njn374a36d2007-11-23 01:41:32 +0000362 <para>It can be big: <computeroutput>ls -l</computeroutput>
njn3e986b22004-11-30 10:43:45 +0000363 generates a file of about 350KB. Browsing a few files and
364 web pages with a Konqueror built with full debugging
365 information generates a file of around 15 MB.</para>
366 </listitem>
367</itemizedlist>
368
njn374a36d2007-11-23 01:41:32 +0000369<para>The default <computeroutput>.&lt;pid&gt;</computeroutput> suffix
de7e109d12005-11-18 22:09:58 +0000370on the output file name serves two purposes. Firstly, it means you
371don't have to rename old log files that you don't want to overwrite.
372Secondly, and more importantly, it allows correct profiling with the
njn3e986b22004-11-30 10:43:45 +0000373<computeroutput>--trace-children=yes</computeroutput> option of
374programs that spawn child processes.</para>
375
376</sect2>
377
378
379
380<sect2 id="cg-manual.cgopts" xreflabel="Cachegrind options">
381<title>Cachegrind options</title>
382
de03e0e7c2005-12-03 23:02:33 +0000383<!-- start of xi:include in the manpage -->
sewardj08e31e22007-05-23 21:58:33 +0000384<para id="cg.opts.para">Using command line options, you can
385manually specify the I1/D1/L2 cache
386configuration to simulate. For each cache, you can specify the
387size, associativity and line size. The size and line size
388are measured in bytes. The three items
de03e0e7c2005-12-03 23:02:33 +0000389must be comma-separated, but with no spaces, eg:
390<literallayout> valgrind --tool=cachegrind --I1=65535,2,64</literallayout>
391
392You can specify one, two or three of the I1/D1/L2 caches. Any level not
393manually specified will be simulated using the configuration found in
394the normal way (via the CPUID instruction for automagic cache
395configuration, or failing that, via defaults).</para>
396
njn3e986b22004-11-30 10:43:45 +0000397<para>Cache-simulation specific options are:</para>
398
de03e0e7c2005-12-03 23:02:33 +0000399<variablelist id="cg.opts.list">
njn3e986b22004-11-30 10:43:45 +0000400
de03e0e7c2005-12-03 23:02:33 +0000401 <varlistentry id="opt.I1" xreflabel="--I1">
402 <term>
403 <option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
404 </term>
405 <listitem>
406 <para>Specify the size, associativity and line size of the level 1
407 instruction cache. </para>
408 </listitem>
409 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000410
de03e0e7c2005-12-03 23:02:33 +0000411 <varlistentry id="opt.D1" xreflabel="--D1">
412 <term>
413 <option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
414 </term>
415 <listitem>
416 <para>Specify the size, associativity and line size of the level 1
417 data cache.</para>
418 </listitem>
419 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000420
de03e0e7c2005-12-03 23:02:33 +0000421 <varlistentry id="opt.L2" xreflabel="--L2">
422 <term>
423 <option><![CDATA[--L2=<size>,<associativity>,<line size> ]]></option>
424 </term>
425 <listitem>
426 <para>Specify the size, associativity and line size of the level 2
427 cache.</para>
428 </listitem>
429 </varlistentry>
njn3e986b22004-11-30 10:43:45 +0000430
sewardje1216cb2007-02-07 19:55:30 +0000431 <varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
432 <term>
njn374a36d2007-11-23 01:41:32 +0000433 <option><![CDATA[--cachegrind-out-file=<file> ]]></option>
sewardje1216cb2007-02-07 19:55:30 +0000434 </term>
435 <listitem>
sewardj8693e012007-02-08 06:47:19 +0000436 <para>Write the profile data to
njn374a36d2007-11-23 01:41:32 +0000437 <computeroutput>file</computeroutput> rather than to the default
438 output file,
439 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>. The
440 <option>%p</option> and <option>%q</option> format specifiers
441 can be used to embed the process ID and/or the contents of an
442 environment variable in the name, as is the case for the core
443 option <option>--log-file</option>. See <link
444 linkend="manual-core.basicopts">here</link> for details.
sewardje1216cb2007-02-07 19:55:30 +0000445 </para>
446 </listitem>
447 </varlistentry>
448
sewardj8badbaa2007-05-08 09:20:25 +0000449 <varlistentry id="opt.cache-sim" xreflabel="--cache-sim">
450 <term>
451 <option><![CDATA[--cache-sim=no|yes [yes] ]]></option>
452 </term>
453 <listitem>
454 <para>Enables or disables collection of cache access and miss
455 counts.</para>
456 </listitem>
457 </varlistentry>
458
459 <varlistentry id="opt.branch-sim" xreflabel="--branch-sim">
460 <term>
461 <option><![CDATA[--branch-sim=no|yes [no] ]]></option>
462 </term>
463 <listitem>
464 <para>Enables or disables collection of branch instruction and
465 misprediction counts. By default this is disabled as it
466 slows Cachegrind down by approximately 25%. Note that you
467 cannot specify <computeroutput>--cache-sim=no</computeroutput>
468 and <computeroutput>--branch-sim=no</computeroutput>
469 together, as that would leave Cachegrind with no
470 information to collect.</para>
471 </listitem>
472 </varlistentry>
473
de03e0e7c2005-12-03 23:02:33 +0000474</variablelist>
475<!-- end of xi:include in the manpage -->
njn3e986b22004-11-30 10:43:45 +0000476
477</sect2>
478
479
480
481<sect2 id="cg-manual.annotate" xreflabel="Annotating C/C++ programs">
482<title>Annotating C/C++ programs</title>
483
njn374a36d2007-11-23 01:41:32 +0000484<para>Before using cg_annotate,
njn3e986b22004-11-30 10:43:45 +0000485it is worth widening your window to be at least 120-characters
486wide if possible, as the output lines can be quite long.</para>
487
njn374a36d2007-11-23 01:41:32 +0000488<para>To get a function-by-function summary, run <computeroutput>cg_annotate
489&lt;filename&gt;</computeroutput> on a Cachegrind output file.</para>
njn3e986b22004-11-30 10:43:45 +0000490
491<para>The output looks like this:</para>
492
493<programlisting><![CDATA[
494--------------------------------------------------------------------------------
495I1 cache: 65536 B, 64 B, 2-way associative
496D1 cache: 65536 B, 64 B, 2-way associative
497L2 cache: 262144 B, 64 B, 8-way associative
498Command: concord vg_to_ucode.c
499Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
500Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
501Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
502Threshold: 99%
503Chosen for annotation:
504Auto-annotation: on
505
506--------------------------------------------------------------------------------
507Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
508--------------------------------------------------------------------------------
50927,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
510
511--------------------------------------------------------------------------------
512Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
513--------------------------------------------------------------------------------
5148,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
5155,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
5162,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
5172,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
5182,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
5191,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
520 897,991 51 51 897,831 95 30 62 1 1 ???:???
521 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
522 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
523 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
524 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
525 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
526 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
527 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
528 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
529 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
530 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
531 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue]]></programlisting>
532
533
534<para>First up is a summary of the annotation options:</para>
535
536<itemizedlist>
537
538 <listitem>
539 <para>I1 cache, D1 cache, L2 cache: cache configuration. So
540 you know the configuration with which these results were
541 obtained.</para>
542 </listitem>
543
544 <listitem>
545 <para>Command: the command line invocation of the program
546 under examination.</para>
547 </listitem>
548
549 <listitem>
550 <para>Events recorded: event abbreviations are:</para>
551 <itemizedlist>
552 <listitem>
sewardj08e31e22007-05-23 21:58:33 +0000553 <para><computeroutput>Ir</computeroutput>: I cache reads
njn3e986b22004-11-30 10:43:45 +0000554 (ie. instructions executed)</para>
555 </listitem>
556 <listitem>
557 <para><computeroutput>I1mr</computeroutput>: I1 cache read
558 misses</para>
559 </listitem>
560 <listitem>
561 <para><computeroutput>I2mr</computeroutput>: L2 cache
562 instruction read misses</para>
563 </listitem>
564 <listitem>
sewardj08e31e22007-05-23 21:58:33 +0000565 <para><computeroutput>Dr</computeroutput>: D cache reads
njn3e986b22004-11-30 10:43:45 +0000566 (ie. memory reads)</para>
567 </listitem>
568 <listitem>
569 <para><computeroutput>D1mr</computeroutput>: D1 cache read
570 misses</para>
571 </listitem>
572 <listitem>
573 <para><computeroutput>D2mr</computeroutput>: L2 cache data
574 read misses</para>
575 </listitem>
576 <listitem>
sewardj08e31e22007-05-23 21:58:33 +0000577 <para><computeroutput>Dw</computeroutput>: D cache writes
njn3e986b22004-11-30 10:43:45 +0000578 (ie. memory writes)</para>
579 </listitem>
580 <listitem>
581 <para><computeroutput>D1mw</computeroutput>: D1 cache write
582 misses</para>
583 </listitem>
584 <listitem>
585 <para><computeroutput>D2mw</computeroutput>: L2 cache data
586 write misses</para>
587 </listitem>
sewardj8badbaa2007-05-08 09:20:25 +0000588 <listitem>
589 <para><computeroutput>Bc</computeroutput>: Conditional branches
590 executed</para>
591 </listitem>
592 <listitem>
593 <para><computeroutput>Bcm</computeroutput>: Conditional branches
594 mispredicted</para>
595 </listitem>
596 <listitem>
597 <para><computeroutput>Bi</computeroutput>: Indirect branches
598 executed</para>
599 </listitem>
600 <listitem>
601 <para><computeroutput>Bim</computeroutput>: Conditional branches
602 mispredicted</para>
603 </listitem>
njn3e986b22004-11-30 10:43:45 +0000604 </itemizedlist>
605
606 <para>Note that D1 total accesses is given by
607 <computeroutput>D1mr</computeroutput> +
608 <computeroutput>D1mw</computeroutput>, and that L2 total
609 accesses is given by <computeroutput>I2mr</computeroutput> +
610 <computeroutput>D2mr</computeroutput> +
611 <computeroutput>D2mw</computeroutput>.</para>
612 </listitem>
613
614 <listitem>
sewardj08e31e22007-05-23 21:58:33 +0000615 <para>Events shown: the events shown, which is a subset of the events
616 gathered. This can be adjusted with the
njn3e986b22004-11-30 10:43:45 +0000617 <computeroutput>--show</computeroutput> option.</para>
618 </listitem>
619
620 <listitem>
621 <para>Event sort order: the sort order in which functions are
622 shown. For example, in this case the functions are sorted
623 from highest <computeroutput>Ir</computeroutput> counts to
624 lowest. If two functions have identical
625 <computeroutput>Ir</computeroutput> counts, they will then be
626 sorted by <computeroutput>I1mr</computeroutput> counts, and
627 so on. This order can be adjusted with the
628 <computeroutput>--sort</computeroutput> option.</para>
629
630 <para>Note that this dictates the order the functions appear.
631 It is <command>not</command> the order in which the columns
632 appear; that is dictated by the "events shown" line (and can
633 be changed with the <computeroutput>--show</computeroutput>
634 option).</para>
635 </listitem>
636
637 <listitem>
njn374a36d2007-11-23 01:41:32 +0000638 <para>Threshold: cg_annotate
sewardj08e31e22007-05-23 21:58:33 +0000639 by default omits functions that cause very low counts
640 to avoid drowning you in information. In this case,
njn3e986b22004-11-30 10:43:45 +0000641 cg_annotate shows summaries the functions that account for
642 99% of the <computeroutput>Ir</computeroutput> counts;
643 <computeroutput>Ir</computeroutput> is chosen as the
644 threshold event since it is the primary sort event. The
645 threshold can be adjusted with the
646 <computeroutput>--threshold</computeroutput>
647 option.</para>
648 </listitem>
649
650 <listitem>
651 <para>Chosen for annotation: names of files specified
652 manually for annotation; in this case none.</para>
653 </listitem>
654
655 <listitem>
656 <para>Auto-annotation: whether auto-annotation was requested
657 via the <computeroutput>--auto=yes</computeroutput>
658 option. In this case no.</para>
659 </listitem>
660
661</itemizedlist>
662
663<para>Then follows summary statistics for the whole
664program. These are similar to the summary provided when running
de03e0e7c2005-12-03 23:02:33 +0000665<computeroutput>valgrind --tool=cachegrind</computeroutput>.</para>
njn3e986b22004-11-30 10:43:45 +0000666
667<para>Then follows function-by-function statistics. Each function
668is identified by a
669<computeroutput>file_name:function_name</computeroutput> pair. If
670a column contains only a dot it means the function never performs
671that event (eg. the third row shows that
672<computeroutput>strcmp()</computeroutput> contains no
673instructions that write to memory). The name
674<computeroutput>???</computeroutput> is used if the the file name
675and/or function name could not be determined from debugging
676information. If most of the entries have the form
677<computeroutput>???:???</computeroutput> the program probably
678wasn't compiled with <computeroutput>-g</computeroutput>. If any
679code was invalidated (either due to self-modifying code or
680unloading of shared objects) its counts are aggregated into a
681single cost centre written as
682<computeroutput>(discarded):(discarded)</computeroutput>.</para>
683
sewardj08e31e22007-05-23 21:58:33 +0000684<para>It is worth noting that functions will come both from
685the profiled program (eg. <filename>concord.c</filename>)
686and from libraries (eg. <filename>getc.c</filename>)</para>
njn3e986b22004-11-30 10:43:45 +0000687
688<para>There are two ways to annotate source files -- by choosing
689them manually, or with the
690<computeroutput>--auto=yes</computeroutput> option. To do it
njn374a36d2007-11-23 01:41:32 +0000691manually, just specify the filenames as additional arguments to
692cg_annotate. For example, the
693output from running <filename>cg_annotate &lt;filename&gt;
694concord.c</filename> for our example produces the same output as above
695followed by an annotated version of <filename>concord.c</filename>, a
696section of which looks like:</para>
njn3e986b22004-11-30 10:43:45 +0000697
698<programlisting><![CDATA[
699--------------------------------------------------------------------------------
700-- User-annotated source: concord.c
701--------------------------------------------------------------------------------
702Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
703
704[snip]
705
706 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
707 3 1 1 . . . 1 0 0 {
708 . . . . . . . . . FILE *file_ptr;
709 . . . . . . . . . Word_Info *data;
710 1 0 0 . . . 1 1 1 int line = 1, i;
711 . . . . . . . . .
712 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
713 . . . . . . . . .
714 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
715 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
716 . . . . . . . . .
717 . . . . . . . . . /* Open file, check it. */
718 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
719 2 0 0 1 0 0 . . . if (!(file_ptr)) {
720 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
721 1 1 1 . . . . . . exit(EXIT_FAILURE);
722 . . . . . . . . . }
723 . . . . . . . . .
724 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
725 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
726 . . . . . . . . .
727 4 0 0 1 0 0 2 0 0 free(data);
728 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
729 3 0 0 2 0 0 . . . }]]></programlisting>
730
731<para>(Although column widths are automatically minimised, a wide
732terminal is clearly useful.)</para>
733
734<para>Each source file is clearly marked
735(<computeroutput>User-annotated source</computeroutput>) as
736having been chosen manually for annotation. If the file was
737found in one of the directories specified with the
738<computeroutput>-I / --include</computeroutput> option, the directory
739and file are both given.</para>
740
741<para>Each line is annotated with its event counts. Events not
sewardj08e31e22007-05-23 21:58:33 +0000742applicable for a line are represented by a dot. This is useful
njn3e986b22004-11-30 10:43:45 +0000743for distinguishing between an event which cannot happen, and one
744which can but did not.</para>
745
746<para>Sometimes only a small section of a source file is
sewardj8d9fec52005-11-15 20:56:23 +0000747executed. To minimise uninteresting output, Cachegrind only shows
njn3e986b22004-11-30 10:43:45 +0000748annotated lines and lines within a small distance of annotated
749lines. Gaps are marked with the line numbers so you know which
750part of a file the shown code comes from, eg:</para>
751
752<programlisting><![CDATA[
753(figures and code for line 704)
754-- line 704 ----------------------------------------
755-- line 878 ----------------------------------------
756(figures and code for line 878)]]></programlisting>
757
758<para>The amount of context to show around annotated lines is
759controlled by the <computeroutput>--context</computeroutput>
760option.</para>
761
762<para>To get automatic annotation, run
njn374a36d2007-11-23 01:41:32 +0000763<computeroutput>cg_annotate &lt;filename&gt; --auto=yes</computeroutput>.
njn3e986b22004-11-30 10:43:45 +0000764cg_annotate will automatically annotate every source file it can
765find that is mentioned in the function-by-function summary.
766Therefore, the files chosen for auto-annotation are affected by
767the <computeroutput>--sort</computeroutput> and
768<computeroutput>--threshold</computeroutput> options. Each
769source file is clearly marked (<computeroutput>Auto-annotated
770source</computeroutput>) as being chosen automatically. Any
771files that could not be found are mentioned at the end of the
772output, eg:</para>
773
774<programlisting><![CDATA[
775------------------------------------------------------------------
776The following files chosen for auto-annotation could not be found:
777------------------------------------------------------------------
778 getc.c
779 ctype.c
780 ../sysdeps/generic/lockfile.c]]></programlisting>
781
782<para>This is quite common for library files, since libraries are
783usually compiled with debugging information, but the source files
784are often not present on a system. If a file is chosen for
785annotation <command>both</command> manually and automatically, it
786is marked as <computeroutput>User-annotated
787source</computeroutput>. Use the <computeroutput>-I /
788--include</computeroutput> option to tell Valgrind where to look
789for source files if the filenames found from the debugging
790information aren't specific enough.</para>
791
792<para>Beware that cg_annotate can take some time to digest large
njn374a36d2007-11-23 01:41:32 +0000793<computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput> files,
njn3e986b22004-11-30 10:43:45 +0000794e.g. 30 seconds or more. Also beware that auto-annotation can
795produce a lot of output if your program is large!</para>
796
797</sect2>
798
799
800<sect2 id="cg-manual.assembler" xreflabel="Annotating assembler programs">
njn374a36d2007-11-23 01:41:32 +0000801<title>Annotating assembly code programs</title>
njn3e986b22004-11-30 10:43:45 +0000802
njn374a36d2007-11-23 01:41:32 +0000803<para>Valgrind can annotate assembly code programs too, or annotate
804the assembly code generated for your C program. Sometimes this is
njn3e986b22004-11-30 10:43:45 +0000805useful for understanding what is really happening when an
806interesting line of C code is translated into multiple
807instructions.</para>
808
809<para>To do this, you just need to assemble your
njn85a38bc2008-10-30 02:41:13 +0000810<computeroutput>.s</computeroutput> files with assembly-level debug
811information. You can use <computeroutput>gcc
812-S</computeroutput> to compile C/C++ programs to assembly code, and then
813<computeroutput>gcc -g</computeroutput> on the assembly code files to
814achieve this. You can then profile and annotate the assembly code source
815files in the same way as C/C++ source files.</para>
njn3e986b22004-11-30 10:43:45 +0000816
817</sect2>
818
njn7064fb22008-05-29 23:09:52 +0000819<sect2 id="ms-manual.forkingprograms" xreflabel="Forking Programs">
820<title>Forking Programs</title>
821<para>If your program forks, the child will inherit all the profiling data that
822has been gathered for the parent.</para>
823
824<para>If the output file format string (controlled by
825<option>--cachegrind-out-file</option>) does not contain <option>%p</option>,
826then the outputs from the parent and child will be intermingled in a single
827output file, which will almost certainly make it unreadable by
828cg_annotate.</para>
829</sect2>
830
831
njn3e986b22004-11-30 10:43:45 +0000832</sect1>
833
834
835<sect1 id="cg-manual.annopts" xreflabel="cg_annotate options">
njn374a36d2007-11-23 01:41:32 +0000836<title>cg_annotate options</title>
njn3e986b22004-11-30 10:43:45 +0000837
838<itemizedlist>
839
njn3e986b22004-11-30 10:43:45 +0000840 <listitem>
841 <para><computeroutput>-h, --help</computeroutput></para>
842 <para><computeroutput>-v, --version</computeroutput></para>
843 <para>Help and version, as usual.</para>
844 </listitem>
845
debc32e822005-06-25 14:43:05 +0000846 <listitem id="sort">
njn3e986b22004-11-30 10:43:45 +0000847 <para><computeroutput>--sort=A,B,C</computeroutput> [default:
848 order in
njn374a36d2007-11-23 01:41:32 +0000849 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>]</para>
njn3e986b22004-11-30 10:43:45 +0000850 <para>Specifies the events upon which the sorting of the
851 function-by-function entries will be based. Useful if you
852 want to concentrate on eg. I cache misses
853 (<computeroutput>--sort=I1mr,I2mr</computeroutput>), or D
854 cache misses
855 (<computeroutput>--sort=D1mr,D2mr</computeroutput>), or L2
856 misses
857 (<computeroutput>--sort=D2mr,I2mr</computeroutput>).</para>
858 </listitem>
859
debc32e822005-06-25 14:43:05 +0000860 <listitem id="show">
njn3e986b22004-11-30 10:43:45 +0000861 <para><computeroutput>--show=A,B,C</computeroutput> [default:
862 all, using order in
njn374a36d2007-11-23 01:41:32 +0000863 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput>]</para>
njn3e986b22004-11-30 10:43:45 +0000864 <para>Specifies which events to show (and the column
865 order). Default is to use all present in the
njn374a36d2007-11-23 01:41:32 +0000866 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput> file (and
njn3e986b22004-11-30 10:43:45 +0000867 use the order in the file).</para>
868 </listitem>
869
debc32e822005-06-25 14:43:05 +0000870 <listitem id="threshold">
njn3e986b22004-11-30 10:43:45 +0000871 <para><computeroutput>--threshold=X</computeroutput>
872 [default: 99%]</para>
873 <para>Sets the threshold for the function-by-function
874 summary. Functions are shown that account for more than X%
875 of the primary sort event. If auto-annotating, also affects
876 which files are annotated.</para>
877
878 <para>Note: thresholds can be set for more than one of the
879 events by appending any events for the
880 <computeroutput>--sort</computeroutput> option with a colon
881 and a number (no spaces, though). E.g. if you want to see
882 the functions that cover 99% of L2 read misses and 99% of L2
883 write misses, use this option:</para>
884 <para><computeroutput>--sort=D2mr:99,D2mw:99</computeroutput></para>
885 </listitem>
886
debc32e822005-06-25 14:43:05 +0000887 <listitem id="auto">
njn3e986b22004-11-30 10:43:45 +0000888 <para><computeroutput>--auto=no</computeroutput> [default]</para>
889 <para><computeroutput>--auto=yes</computeroutput></para>
890 <para>When enabled, automatically annotates every file that
891 is mentioned in the function-by-function summary that can be
892 found. Also gives a list of those that couldn't be found.</para>
893 </listitem>
894
debc32e822005-06-25 14:43:05 +0000895 <listitem id="context">
njn3e986b22004-11-30 10:43:45 +0000896 <para><computeroutput>--context=N</computeroutput> [default:
897 8]</para>
898 <para>Print N lines of context before and after each
899 annotated line. Avoids printing large sections of source
900 files that were not executed. Use a large number
901 (eg. 10,000) to show all source lines.</para>
902 </listitem>
903
debc32e822005-06-25 14:43:05 +0000904 <listitem id="include">
sewardj8d9fec52005-11-15 20:56:23 +0000905 <para><computeroutput>-I&lt;dir&gt;,
njn3e986b22004-11-30 10:43:45 +0000906 --include=&lt;dir&gt;</computeroutput> [default: empty
907 string]</para>
908 <para>Adds a directory to the list in which to search for
909 files. Multiple -I/--include options can be given to add
910 multiple directories.</para>
911 </listitem>
912
913</itemizedlist>
914
915
916
sewardj778d7832007-11-22 01:21:56 +0000917<sect2 id="cg-manual.annopts.warnings" xreflabel="Warnings">
njn3e986b22004-11-30 10:43:45 +0000918<title>Warnings</title>
919
920<para>There are a couple of situations in which
njn374a36d2007-11-23 01:41:32 +0000921cg_annotate issues warnings.</para>
njn3e986b22004-11-30 10:43:45 +0000922
923<itemizedlist>
924 <listitem>
925 <para>If a source file is more recent than the
njn374a36d2007-11-23 01:41:32 +0000926 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput> file.
njn3e986b22004-11-30 10:43:45 +0000927 This is because the information in
njn374a36d2007-11-23 01:41:32 +0000928 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput> is only
njn3e986b22004-11-30 10:43:45 +0000929 recorded with line numbers, so if the line numbers change at
930 all in the source (eg. lines added, deleted, swapped), any
931 annotations will be incorrect.</para>
932 </listitem>
933 <listitem>
934 <para>If information is recorded about line numbers past the
935 end of a file. This can be caused by the above problem,
936 ie. shortening the source file while using an old
njn374a36d2007-11-23 01:41:32 +0000937 <computeroutput>cachegrind.out.&lt;pid&gt;</computeroutput> file. If
njn3e986b22004-11-30 10:43:45 +0000938 this happens, the figures for the bogus lines are printed
939 anyway (clearly marked as bogus) in case they are
940 important.</para>
941 </listitem>
942</itemizedlist>
943
944</sect2>
945
946
947
sewardj778d7832007-11-22 01:21:56 +0000948<sect2 id="cg-manual.annopts.things-to-watch-out-for"
949 xreflabel="Things to watch out for">
njn3e986b22004-11-30 10:43:45 +0000950<title>Things to watch out for</title>
951
952<para>Some odd things that can occur during annotation:</para>
953
954<itemizedlist>
955 <listitem>
956 <para>If annotating at the assembler level, you might see
957 something like this:</para>
958<programlisting><![CDATA[
959 1 0 0 . . . . . . leal -12(%ebp),%eax
960 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
961 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
962 . . . . . . . . . .align 4,0x90
963 1 0 0 . . . . . . movl $.LnrB,%eax
964 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)]]></programlisting>
965
966 <para>How can the third instruction be executed twice when
967 the others are executed only once? As it turns out, it
968 isn't. Here's a dump of the executable, using
969 <computeroutput>objdump -d</computeroutput>:</para>
970<programlisting><![CDATA[
971 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
972 8048f28: 89 43 54 mov %eax,0x54(%ebx)
973 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
974 8048f32: 89 f6 mov %esi,%esi
975 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
976 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)]]></programlisting>
977
978 <para>Notice the extra <computeroutput>mov
979 %esi,%esi</computeroutput> instruction. Where did this come
980 from? The GNU assembler inserted it to serve as the two
981 bytes of padding needed to align the <computeroutput>movl
982 $.LnrB,%eax</computeroutput> instruction on a four-byte
983 boundary, but pretended it didn't exist when adding debug
984 information. Thus when Valgrind reads the debug info it
985 thinks that the <computeroutput>movl
986 $0x1,0xffffffec(%ebp)</computeroutput> instruction covers the
987 address range 0x8048f2b--0x804833 by itself, and attributes
988 the counts for the <computeroutput>mov
989 %esi,%esi</computeroutput> to it.</para>
990 </listitem>
991
992 <listitem>
993 <para>Inlined functions can cause strange results in the
994 function-by-function summary. If a function
995 <computeroutput>inline_me()</computeroutput> is defined in
996 <filename>foo.h</filename> and inlined in the functions
997 <computeroutput>f1()</computeroutput>,
998 <computeroutput>f2()</computeroutput> and
999 <computeroutput>f3()</computeroutput> in
1000 <filename>bar.c</filename>, there will not be a
1001 <computeroutput>foo.h:inline_me()</computeroutput> function
1002 entry. Instead, there will be separate function entries for
1003 each inlining site, ie.
1004 <computeroutput>foo.h:f1()</computeroutput>,
1005 <computeroutput>foo.h:f2()</computeroutput> and
1006 <computeroutput>foo.h:f3()</computeroutput>. To find the
1007 total counts for
1008 <computeroutput>foo.h:inline_me()</computeroutput>, add up
1009 the counts from each entry.</para>
1010
1011 <para>The reason for this is that although the debug info
1012 output by gcc indicates the switch from
1013 <filename>bar.c</filename> to <filename>foo.h</filename>, it
1014 doesn't indicate the name of the function in
1015 <filename>foo.h</filename>, so Valgrind keeps using the old
1016 one.</para>
1017 </listitem>
1018
1019 <listitem>
1020 <para>Sometimes, the same filename might be represented with
1021 a relative name and with an absolute name in different parts
1022 of the debug info, eg:
1023 <filename>/home/user/proj/proj.h</filename> and
1024 <filename>../proj.h</filename>. In this case, if you use
1025 auto-annotation, the file will be annotated twice with the
1026 counts split between the two.</para>
1027 </listitem>
1028
1029 <listitem>
1030 <para>Files with more than 65,535 lines cause difficulties
sewardj08e31e22007-05-23 21:58:33 +00001031 for the Stabs-format debug info reader. This is because the line
njn3e986b22004-11-30 10:43:45 +00001032 number in the <computeroutput>struct nlist</computeroutput>
1033 defined in <filename>a.out.h</filename> under Linux is only a
1034 16-bit value. Valgrind can handle some files with more than
1035 65,535 lines correctly by making some guesses to identify
1036 line number overflows. But some cases are beyond it, in
1037 which case you'll get a warning message explaining that
1038 annotations for the file might be incorrect.</para>
sewardj08e31e22007-05-23 21:58:33 +00001039
1040 <para>If you are using gcc 3.1 or later, this is most likely
1041 irrelevant, since gcc switched to using the more modern DWARF2
1042 format by default at version 3.1. DWARF2 does not have any such
1043 limitations on line numbers.</para>
njn3e986b22004-11-30 10:43:45 +00001044 </listitem>
1045
1046 <listitem>
1047 <para>If you compile some files with
1048 <computeroutput>-g</computeroutput> and some without, some
1049 events that take place in a file without debug info could be
1050 attributed to the last line of a file with debug info
1051 (whichever one gets placed before the non-debug-info file in
1052 the executable).</para>
1053 </listitem>
1054
1055</itemizedlist>
1056
1057<para>This list looks long, but these cases should be fairly
1058rare.</para>
1059
njn3e986b22004-11-30 10:43:45 +00001060</sect2>
1061
1062
1063
sewardj778d7832007-11-22 01:21:56 +00001064<sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
njn3e986b22004-11-30 10:43:45 +00001065<title>Accuracy</title>
1066
1067<para>Valgrind's cache profiling has a number of
1068shortcomings:</para>
1069
1070<itemizedlist>
1071 <listitem>
1072 <para>It doesn't account for kernel activity -- the effect of
1073 system calls on the cache contents is ignored.</para>
1074 </listitem>
1075
1076 <listitem>
sewardj08e31e22007-05-23 21:58:33 +00001077 <para>It doesn't account for other process activity.
1078 This is probably desirable when considering a single
1079 program.</para>
njn3e986b22004-11-30 10:43:45 +00001080 </listitem>
1081
1082 <listitem>
1083 <para>It doesn't account for virtual-to-physical address
sewardj08e31e22007-05-23 21:58:33 +00001084 mappings. Hence the simulation is not a true
njn3e986b22004-11-30 10:43:45 +00001085 representation of what's happening in the
sewardj08e31e22007-05-23 21:58:33 +00001086 cache. Most caches are physically indexed, but Cachegrind
1087 simulates caches using virtual addresses.</para>
njn3e986b22004-11-30 10:43:45 +00001088 </listitem>
1089
1090 <listitem>
1091 <para>It doesn't account for cache misses not visible at the
1092 instruction level, eg. those arising from TLB misses, or
1093 speculative execution.</para>
1094 </listitem>
1095
1096 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001097 <para>Valgrind will schedule
1098 threads differently from how they would be when running natively.
1099 This could warp the results for threaded programs.</para>
njn3e986b22004-11-30 10:43:45 +00001100 </listitem>
1101
1102 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001103 <para>The x86/amd64 instructions <computeroutput>bts</computeroutput>,
njn3e986b22004-11-30 10:43:45 +00001104 <computeroutput>btr</computeroutput> and
1105 <computeroutput>btc</computeroutput> will incorrectly be
1106 counted as doing a data read if both the arguments are
1107 registers, eg:</para>
1108<programlisting><![CDATA[
1109 btsl %eax, %edx]]></programlisting>
1110
1111 <para>This should only happen rarely.</para>
1112 </listitem>
1113
1114 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +00001115 <para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
njn3e986b22004-11-30 10:43:45 +00001116 (e.g. <computeroutput>fsave</computeroutput>) are treated as
1117 though they only access 16 bytes. These instructions seem to
1118 be rare so hopefully this won't affect accuracy much.</para>
1119 </listitem>
1120
1121</itemizedlist>
1122
sewardj08e31e22007-05-23 21:58:33 +00001123<para>Another thing worth noting is that results are very sensitive.
1124Changing the size of the the executable being profiled, or the sizes
1125of any of the shared libraries it uses, or even the length of their
1126file names, can perturb the results. Variations will be small, but
1127don't expect perfectly repeatable results if your program changes at
1128all.</para>
njn3e986b22004-11-30 10:43:45 +00001129
sewardj08e31e22007-05-23 21:58:33 +00001130<para>More recent GNU/Linux distributions do address space
1131randomisation, in which identical runs of the same program have their
1132shared libraries loaded at different locations, as a security measure.
1133This also perturbs the results.</para>
sewardj94dc5082007-02-08 11:31:03 +00001134
njn3e986b22004-11-30 10:43:45 +00001135<para>While these factors mean you shouldn't trust the results to
1136be super-accurate, hopefully they should be close enough to be
1137useful.</para>
1138
1139</sect2>
1140
njn534f7812006-10-21 22:22:59 +00001141</sect1>
1142
sewardj94dc5082007-02-08 11:31:03 +00001143
1144
1145<sect1 id="cg-manual.cg_merge" xreflabel="cg_merge">
njn374a36d2007-11-23 01:41:32 +00001146<title>Merging profiles with cg_merge</title>
sewardj94dc5082007-02-08 11:31:03 +00001147
1148<para>
njn374a36d2007-11-23 01:41:32 +00001149cg_merge is a simple program which
sewardj94dc5082007-02-08 11:31:03 +00001150reads multiple profile files, as created by cachegrind, merges them
1151together, and writes the results into another file in the same format.
1152You can then examine the merged results using
njn374a36d2007-11-23 01:41:32 +00001153<computeroutput>cg_annotate &lt;filename&gt;</computeroutput>, as
sewardj94dc5082007-02-08 11:31:03 +00001154described above. The merging functionality might be useful if you
1155want to aggregate costs over multiple runs of the same program, or
1156from a single parallel run with multiple instances of the same
1157program.</para>
1158
1159<para>
njn374a36d2007-11-23 01:41:32 +00001160cg_merge is invoked as follows:
sewardj94dc5082007-02-08 11:31:03 +00001161</para>
1162
1163<programlisting><![CDATA[
1164cg_merge -o outputfile file1 file2 file3 ...]]></programlisting>
1165
1166<para>
1167It reads and checks <computeroutput>file1</computeroutput>, then read
1168and checks <computeroutput>file2</computeroutput> and merges it into
1169the running totals, then the same with
1170<computeroutput>file3</computeroutput>, etc. The final results are
1171written to <computeroutput>outputfile</computeroutput>, or to standard
1172out if no output file is specified.</para>
1173
1174<para>
1175Costs are summed on a per-function, per-line and per-instruction
1176basis. Because of this, the order in which the input files does not
1177matter, although you should take care to only mention each file once,
1178since any file mentioned twice will be added in twice.</para>
1179
1180<para>
njn374a36d2007-11-23 01:41:32 +00001181cg_merge does not attempt to check
sewardj94dc5082007-02-08 11:31:03 +00001182that the input files come from runs of the same executable. It will
1183happily merge together profile files from completely unrelated
1184programs. It does however check that the
1185<computeroutput>Events:</computeroutput> lines of all the inputs are
1186identical, so as to ensure that the addition of costs makes sense.
1187For example, it would be nonsensical for it to add a number indicating
1188D1 read references to a number from a different file indicating L2
1189write misses.</para>
1190
1191<para>
1192A number of other syntax and sanity checks are done whilst reading the
njn374a36d2007-11-23 01:41:32 +00001193inputs. cg_merge will stop and
sewardj94dc5082007-02-08 11:31:03 +00001194attempt to print a helpful error message if any of the input files
1195fail these checks.</para>
1196
1197</sect1>
1198
1199
sewardj778d7832007-11-22 01:21:56 +00001200<sect1 id="cg-manual.acting-on"
1201 xreflabel="Acting on Cachegrind's information">
njn3a9d5dc2007-09-17 22:19:01 +00001202<title>Acting on Cachegrind's information</title>
1203<para>
1204So, you've managed to profile your program with Cachegrind. Now what?
1205What's the best way to actually act on the information it provides to speed
njn07f96562007-09-17 22:28:21 +00001206up your program? Here are some rules of thumb that we have found to be
1207useful.</para>
njn3a9d5dc2007-09-17 22:19:01 +00001208
1209<para>
1210First of all, the global hit/miss rate numbers are not that useful. If you
1211have multiple programs or multiple runs of a program, comparing the numbers
njn07f96562007-09-17 22:28:21 +00001212might identify if any are outliers and worthy of closer investigation.
1213Otherwise, they're not enough to act on.</para>
njn3a9d5dc2007-09-17 22:19:01 +00001214
1215<para>
njn07f96562007-09-17 22:28:21 +00001216The line-by-line source code annotations are much more useful. In our
1217experience, the best place to start is by looking at the
1218<computeroutput>Ir</computeroutput> numbers. They simply measure how many
1219instructions were executed for each line, and don't include any cache
1220information, but they can still be very useful for identifying
1221bottlenecks.</para>
njn3a9d5dc2007-09-17 22:19:01 +00001222
1223<para>
1224After that, we have found that L2 misses are typically a much bigger source
1225of slow-downs than L1 misses. So it's worth looking for any snippets of
njn07f96562007-09-17 22:28:21 +00001226code that cause a high proportion of the L2 misses. If you find any, it's
1227still not always easy to work out how to improve things. You need to have a
1228reasonable understanding of how caches work, the principles of locality, and
1229your program's data access patterns. Improving things may require
1230redesigning a data structure, for example.</para>
njn3a9d5dc2007-09-17 22:19:01 +00001231
1232<para>
1233In short, Cachegrind can tell you where some of the bottlenecks in your code
1234are, but it can't tell you how to fix them. You have to work that out for
1235yourself. But at least you have the information!
1236</para>
1237
1238</sect1>
sewardj94dc5082007-02-08 11:31:03 +00001239
sewardj778d7832007-11-22 01:21:56 +00001240<sect1 id="cg-manual.impl-details"
1241 xreflabel="Implementation details">
njn534f7812006-10-21 22:22:59 +00001242<title>Implementation details</title>
njn3a9d5dc2007-09-17 22:19:01 +00001243<para>
njn534f7812006-10-21 22:22:59 +00001244This section talks about details you don't need to know about in order to
1245use Cachegrind, but may be of interest to some people.
njn3a9d5dc2007-09-17 22:19:01 +00001246</para>
njn3e986b22004-11-30 10:43:45 +00001247
sewardj778d7832007-11-22 01:21:56 +00001248<sect2 id="cg-manual.impl-details.how-cg-works"
1249 xreflabel="How Cachegrind works">
njn534f7812006-10-21 22:22:59 +00001250<title>How Cachegrind works</title>
1251<para>The best reference for understanding how Cachegrind works is chapter 3 of
1252"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
njn011215f2006-10-21 23:00:59 +00001253is available on the <ulink url="&vg-pubs;">Valgrind publications
1254page</ulink>.</para>
njn534f7812006-10-21 22:22:59 +00001255</sect2>
njn3e986b22004-11-30 10:43:45 +00001256
sewardj778d7832007-11-22 01:21:56 +00001257<sect2 id="cg-manual.impl-details.file-format"
1258 xreflabel="Cachegrind output file format">
njn534f7812006-10-21 22:22:59 +00001259<title>Cachegrind output file format</title>
1260<para>The file format is fairly straightforward, basically giving the
1261cost centre for every line, grouped by files and
1262functions. Total counts (eg. total cache accesses, total L1
1263misses) are calculated when traversing this structure rather than
1264during execution, to save time; the cache simulation functions
1265are called so often that even one or two extra adds can make a
1266sizeable difference.</para>
1267
1268<para>The file format:</para>
1269<programlisting><![CDATA[
1270file ::= desc_line* cmd_line events_line data_line+ summary_line
1271desc_line ::= "desc:" ws? non_nl_string
1272cmd_line ::= "cmd:" ws? cmd
1273events_line ::= "events:" ws? (event ws)+
1274data_line ::= file_line | fn_line | count_line
1275file_line ::= "fl=" filename
1276fn_line ::= "fn=" fn_name
1277count_line ::= line_num ws? (count ws)+
1278summary_line ::= "summary:" ws? (count ws)+
1279count ::= num | "."]]></programlisting>
1280
1281<para>Where:</para>
njn3e986b22004-11-30 10:43:45 +00001282<itemizedlist>
1283 <listitem>
njn534f7812006-10-21 22:22:59 +00001284 <para><computeroutput>non_nl_string</computeroutput> is any
1285 string not containing a newline.</para>
njn3e986b22004-11-30 10:43:45 +00001286 </listitem>
njn534f7812006-10-21 22:22:59 +00001287 <listitem>
1288 <para><computeroutput>cmd</computeroutput> is a string holding the
1289 command line of the profiled program.</para>
1290 </listitem>
1291 <listitem>
njn26242122007-01-22 03:21:27 +00001292 <para><computeroutput>event</computeroutput> is a string containing
1293 no whitespace.</para>
1294 </listitem>
1295 <listitem>
njn534f7812006-10-21 22:22:59 +00001296 <para><computeroutput>filename</computeroutput> and
1297 <computeroutput>fn_name</computeroutput> are strings.</para>
1298 </listitem>
1299 <listitem>
1300 <para><computeroutput>num</computeroutput> and
1301 <computeroutput>line_num</computeroutput> are decimal
1302 numbers.</para>
1303 </listitem>
1304 <listitem>
1305 <para><computeroutput>ws</computeroutput> is whitespace.</para>
1306 </listitem>
1307</itemizedlist>
1308
1309<para>The contents of the "desc:" lines are printed out at the top
1310of the summary. This is a generic way of providing simulation
1311specific information, eg. for giving the cache configuration for
1312cache simulation.</para>
1313
1314<para>More than one line of info can be presented for each file/fn/line number.
1315In such cases, the counts for the named events will be accumulated.</para>
1316
njn3a9d5dc2007-09-17 22:19:01 +00001317<para>Counts can be "." to represent zero. This makes the files easier for
1318humans to read.</para>
njn534f7812006-10-21 22:22:59 +00001319
1320<para>The number of counts in each
1321<computeroutput>line</computeroutput> and the
1322<computeroutput>summary_line</computeroutput> should not exceed
1323the number of events in the
1324<computeroutput>event_line</computeroutput>. If the number in
1325each <computeroutput>line</computeroutput> is less, cg_annotate
njn3a9d5dc2007-09-17 22:19:01 +00001326treats those missing as though they were a "." entry. This saves space.
1327</para>
njn534f7812006-10-21 22:22:59 +00001328
1329<para>A <computeroutput>file_line</computeroutput> changes the
1330current file name. A <computeroutput>fn_line</computeroutput>
1331changes the current function name. A
1332<computeroutput>count_line</computeroutput> contains counts that
1333pertain to the current filename/fn_name. A "fn="
1334<computeroutput>file_line</computeroutput> and a
1335<computeroutput>fn_line</computeroutput> must appear before any
1336<computeroutput>count_line</computeroutput>s to give the context
1337of the first <computeroutput>count_line</computeroutput>s.</para>
1338
1339<para>Each <computeroutput>file_line</computeroutput> will normally be
1340immediately followed by a <computeroutput>fn_line</computeroutput>. But it
1341doesn't have to be.</para>
1342
njn3e986b22004-11-30 10:43:45 +00001343
1344</sect2>
1345
1346</sect1>
1347</chapter>