blob: 52f630098facc2d0b00b0fc25b3caaf9e78e7e3c [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<chapter id="cg-manual" xreflabel="Cachegrind: a cache-miss profiler">
6<title>Cachegrind: a cache profiler</title>
7
8<para>Detailed technical documentation on how Cachegrind works is
9available in <xref linkend="cg-tech-docs"/>. If you only want to know
10how to <command>use</command> it, this is the page you need to
11read.</para>
12
13
14<sect1 id="cg-manual.cache" xreflabel="Cache profiling">
15<title>Cache profiling</title>
16
17<para>To use this tool, you must specify
18<computeroutput>--tool=cachegrind</computeroutput> on the
19Valgrind command line.</para>
20
21<para>Cachegrind is a tool for doing cache simulations and
22annotating your source line-by-line with the number of cache
23misses. In particular, it records:</para>
24<itemizedlist>
25 <listitem>
26 <para>L1 instruction cache reads and misses;</para>
27 </listitem>
28 <listitem>
29 <para>L1 data cache reads and read misses, writes and write
30 misses;</para>
31 </listitem>
32 <listitem>
33 <para>L2 unified cache reads and read misses, writes and
34 writes misses.</para>
35 </listitem>
36</itemizedlist>
37
njnc8cccb12005-07-25 23:30:24 +000038<para>On a modern machine, an L1 miss will typically cost
njn3e986b22004-11-30 10:43:45 +000039around 10 cycles, and an L2 miss can cost as much as 200
40cycles. Detailed cache profiling can be very useful for improving
41the performance of your program.</para>
42
43<para>Also, since one instruction cache read is performed per
44instruction executed, you can find out how many instructions are
45executed per line, which can be useful for traditional profiling
46and test coverage.</para>
47
48<para>Any feedback, bug-fixes, suggestions, etc, welcome.</para>
49
50
51
52<sect2 id="cg-manual.overview" xreflabel="Overview">
53<title>Overview</title>
54
55<para>First off, as for normal Valgrind use, you probably want to
56compile with debugging info (the
57<computeroutput>-g</computeroutput> flag). But by contrast with
58normal Valgrind use, you probably <command>do</command> want to turn
59optimisation on, since you should profile your program as it will
60be normally run.</para>
61
62<para>The two steps are:</para>
63<orderedlist>
64 <listitem>
65 <para>Run your program with <computeroutput>valgrind
66 --tool=cachegrind</computeroutput> in front of the normal
67 command line invocation. When the program finishes,
68 Cachegrind will print summary cache statistics. It also
69 collects line-by-line information in a file
70 <computeroutput>cachegrind.out.pid</computeroutput>, where
71 <computeroutput>pid</computeroutput> is the program's process
72 id.</para>
73
74 <para>This step should be done every time you want to collect
75 information about a new program, a changed program, or about
76 the same program with different input.</para>
77 </listitem>
78
79 <listitem>
80 <para>Generate a function-by-function summary, and possibly
81 annotate source files, using the supplied
82 <computeroutput>cg_annotate</computeroutput> program. Source
83 files to annotate can be specified manually, or manually on
84 the command line, or "interesting" source files can be
85 annotated automatically with the
86 <computeroutput>--auto=yes</computeroutput> option. You can
87 annotate C/C++ files or assembly language files equally
88 easily.</para>
89
90 <para>This step can be performed as many times as you like
91 for each Step 2. You may want to do multiple annotations
92 showing different information each time.</para>
93 </listitem>
94
95</orderedlist>
96
97<para>The steps are described in detail in the following
98sections.</para>
99
100</sect2>
101
102
debc32e822005-06-25 14:43:05 +0000103<sect2 id="cache-sim" xreflabel="Cache simulation specifics">
njn3e986b22004-11-30 10:43:45 +0000104<title>Cache simulation specifics</title>
105
106<para>Cachegrind uses a simulation for a machine with a split L1
107cache and a unified L2 cache. This configuration is used for all
108(modern) x86-based machines we are aware of. Old Cyrix CPUs had
109a unified I and D L1 cache, but they are ancient history
110now.</para>
111
112<para>The more specific characteristics of the simulation are as
113follows.</para>
114
115<itemizedlist>
116
117 <listitem>
118 <para>Write-allocate: when a write miss occurs, the block
119 written to is brought into the D1 cache. Most modern caches
120 have this property.</para>
121 </listitem>
122
123 <listitem>
124 <para>Bit-selection hash function: the line(s) in the cache
125 to which a memory block maps is chosen by the middle bits
126 M--(M+N-1) of the byte address, where:</para>
127 <itemizedlist>
128 <listitem>
129 <para>line size = 2^M bytes</para>
130 </listitem>
131 <listitem>
132 <para>(cache size / line size) = 2^N bytes</para>
133 </listitem>
134 </itemizedlist>
135 </listitem>
136
137 <listitem>
138 <para>Inclusive L2 cache: the L2 cache replicates all the
139 entries of the L1 cache. This is standard on Pentium chips,
140 but AMD Athlons use an exclusive L2 cache that only holds
141 blocks evicted from L1. Ditto AMD Durons and most modern
142 VIAs.</para>
143 </listitem>
144
145</itemizedlist>
146
147<para>The cache configuration simulated (cache size,
148associativity and line size) is determined automagically using
149the CPUID instruction. If you have an old machine that (a)
150doesn't support the CPUID instruction, or (b) supports it in an
151early incarnation that doesn't give any cache information, then
152Cachegrind will fall back to using a default configuration (that
153of a model 3/4 Athlon). Cachegrind will tell you if this
154happens. You can manually specify one, two or all three levels
155(I1/D1/L2) of the cache from the command line using the
156<computeroutput>--I1</computeroutput>,
157<computeroutput>--D1</computeroutput> and
158<computeroutput>--L2</computeroutput> options.</para>
159
160
161<para>Other noteworthy behaviour:</para>
162
163<itemizedlist>
164 <listitem>
165 <para>References that straddle two cache lines are treated as
166 follows:</para>
167 <itemizedlist>
168 <listitem>
169 <para>If both blocks hit --&gt; counted as one hit</para>
170 </listitem>
171 <listitem>
172 <para>If one block hits, the other misses --&gt; counted
173 as one miss.</para>
174 </listitem>
175 <listitem>
176 <para>If both blocks miss --&gt; counted as one miss (not
177 two)</para>
178 </listitem>
179 </itemizedlist>
180 </listitem>
181
182 <listitem>
183 <para>Instructions that modify a memory location
184 (eg. <computeroutput>inc</computeroutput> and
185 <computeroutput>dec</computeroutput>) are counted as doing
186 just a read, ie. a single data reference. This may seem
187 strange, but since the write can never cause a miss (the read
188 guarantees the block is in the cache) it's not very
189 interesting.</para>
190
191 <para>Thus it measures not the number of times the data cache
192 is accessed, but the number of times a data cache miss could
193 occur.</para>
194 </listitem>
195
196</itemizedlist>
197
198<para>If you are interested in simulating a cache with different
199properties, it is not particularly hard to write your own cache
200simulator, or to modify the existing ones in
201<computeroutput>vg_cachesim_I1.c</computeroutput>,
202<computeroutput>vg_cachesim_D1.c</computeroutput>,
203<computeroutput>vg_cachesim_L2.c</computeroutput> and
204<computeroutput>vg_cachesim_gen.c</computeroutput>. We'd be
205interested to hear from anyone who does.</para>
206
207</sect2>
208
209</sect1>
210
211
212
213<sect1 id="cg-manual.profile" xreflabel="Profiling programs">
214<title>Profiling programs</title>
215
216<para>To gather cache profiling information about the program
217<computeroutput>ls -l</computeroutput>, invoke Cachegrind like
218this:</para>
219
220<programlisting><![CDATA[
221valgrind --tool=cachegrind ls -l]]></programlisting>
222
223<para>The program will execute (slowly). Upon completion,
224summary statistics that look like this will be printed:</para>
225
226<programlisting><![CDATA[
227==31751== I refs: 27,742,716
228==31751== I1 misses: 276
229==31751== L2 misses: 275
230==31751== I1 miss rate: 0.0%
231==31751== L2i miss rate: 0.0%
232==31751==
233==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
234==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
235==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
236==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
237==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
238==31751==
239==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
240==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)]]></programlisting>
241
242<para>Cache accesses for instruction fetches are summarised
243first, giving the number of fetches made (this is the number of
244instructions executed, which can be useful to know in its own
245right), the number of I1 misses, and the number of L2 instruction
246(<computeroutput>L2i</computeroutput>) misses.</para>
247
248<para>Cache accesses for data follow. The information is similar
249to that of the instruction fetches, except that the values are
250also shown split between reads and writes (note each row's
251<computeroutput>rd</computeroutput> and
252<computeroutput>wr</computeroutput> values add up to the row's
253total).</para>
254
255<para>Combined instruction and data figures for the L2 cache
256follow that.</para>
257
258
259
260<sect2 id="cg-manual.outputfile" xreflabel="Output file">
261<title>Output file</title>
262
263<para>As well as printing summary information, Cachegrind also
264writes line-by-line cache profiling information to a file named
265<computeroutput>cachegrind.out.pid</computeroutput>. This file
266is human-readable, but is best interpreted by the accompanying
267program <computeroutput>cg_annotate</computeroutput>, described
268in the next section.</para>
269
270<para>Things to note about the
271<computeroutput>cachegrind.out.pid</computeroutput>
272file:</para>
273
274<itemizedlist>
275 <listitem>
276 <para>It is written every time Cachegrind is run, and will
277 overwrite any existing
278 <computeroutput>cachegrind.out.pid</computeroutput>
279 in the current directory (but that won't happen very often
280 because it takes some time for process ids to be
281 recycled).</para>
282 </listitem>
283 <listitem>
284 <para>It can be huge: <computeroutput>ls -l</computeroutput>
285 generates a file of about 350KB. Browsing a few files and
286 web pages with a Konqueror built with full debugging
287 information generates a file of around 15 MB.</para>
288 </listitem>
289</itemizedlist>
290
sewardj8d9fec52005-11-15 20:56:23 +0000291<para>The <computeroutput>.pid</computeroutput> suffix
292on the output file name
293serves
njn3e986b22004-11-30 10:43:45 +0000294two purposes. Firstly, it means you don't have to rename old log
295files that you don't want to overwrite. Secondly, and more
296importantly, it allows correct profiling with the
297<computeroutput>--trace-children=yes</computeroutput> option of
298programs that spawn child processes.</para>
299
300</sect2>
301
302
303
304<sect2 id="cg-manual.cgopts" xreflabel="Cachegrind options">
305<title>Cachegrind options</title>
306
307<para>Cache-simulation specific options are:</para>
308
309<screen><![CDATA[
310--I1=<size>,<associativity>,<line_size>
311--D1=<size>,<associativity>,<line_size>
312--L2=<size>,<associativity>,<line_size>
313
314[default: uses CPUID for automagic cache configuration]]]></screen>
315
316<para>Manually specifies the I1/D1/L2 cache configuration, where
317<computeroutput>size</computeroutput> and
318<computeroutput>line_size</computeroutput> are measured in bytes.
319The three items must be comma-separated, but with no spaces,
320eg:</para>
321
322<programlisting><![CDATA[
323valgrind --tool=cachegrind --I1=65535,2,64]]></programlisting>
324
325<para>You can specify one, two or three of the I1/D1/L2 caches.
326Any level not manually specified will be simulated using the
327configuration found in the normal way (via the CPUID instruction,
328or failing that, via defaults).</para>
329
330</sect2>
331
332
333
334<sect2 id="cg-manual.annotate" xreflabel="Annotating C/C++ programs">
335<title>Annotating C/C++ programs</title>
336
337<para>Before using <computeroutput>cg_annotate</computeroutput>,
338it is worth widening your window to be at least 120-characters
339wide if possible, as the output lines can be quite long.</para>
340
341<para>To get a function-by-function summary, run
342<computeroutput>cg_annotate --pid</computeroutput> in a directory
343containing a <computeroutput>cachegrind.out.pid</computeroutput>
344file. The <emphasis>--pid</emphasis> is required so that
345<computeroutput>cg_annotate</computeroutput> knows which log file
346to use when several are present.</para>
347
348<para>The output looks like this:</para>
349
350<programlisting><![CDATA[
351--------------------------------------------------------------------------------
352I1 cache: 65536 B, 64 B, 2-way associative
353D1 cache: 65536 B, 64 B, 2-way associative
354L2 cache: 262144 B, 64 B, 8-way associative
355Command: concord vg_to_ucode.c
356Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
357Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
358Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
359Threshold: 99%
360Chosen for annotation:
361Auto-annotation: on
362
363--------------------------------------------------------------------------------
364Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
365--------------------------------------------------------------------------------
36627,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
367
368--------------------------------------------------------------------------------
369Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
370--------------------------------------------------------------------------------
3718,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
3725,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
3732,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
3742,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
3752,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
3761,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
377 897,991 51 51 897,831 95 30 62 1 1 ???:???
378 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
379 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
380 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
381 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
382 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
383 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
384 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
385 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
386 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
387 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
388 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue]]></programlisting>
389
390
391<para>First up is a summary of the annotation options:</para>
392
393<itemizedlist>
394
395 <listitem>
396 <para>I1 cache, D1 cache, L2 cache: cache configuration. So
397 you know the configuration with which these results were
398 obtained.</para>
399 </listitem>
400
401 <listitem>
402 <para>Command: the command line invocation of the program
403 under examination.</para>
404 </listitem>
405
406 <listitem>
407 <para>Events recorded: event abbreviations are:</para>
408 <itemizedlist>
409 <listitem>
410 <para><computeroutput>Ir </computeroutput>: I cache reads
411 (ie. instructions executed)</para>
412 </listitem>
413 <listitem>
414 <para><computeroutput>I1mr</computeroutput>: I1 cache read
415 misses</para>
416 </listitem>
417 <listitem>
418 <para><computeroutput>I2mr</computeroutput>: L2 cache
419 instruction read misses</para>
420 </listitem>
421 <listitem>
422 <para><computeroutput>Dr </computeroutput>: D cache reads
423 (ie. memory reads)</para>
424 </listitem>
425 <listitem>
426 <para><computeroutput>D1mr</computeroutput>: D1 cache read
427 misses</para>
428 </listitem>
429 <listitem>
430 <para><computeroutput>D2mr</computeroutput>: L2 cache data
431 read misses</para>
432 </listitem>
433 <listitem>
434 <para><computeroutput>Dw </computeroutput>: D cache writes
435 (ie. memory writes)</para>
436 </listitem>
437 <listitem>
438 <para><computeroutput>D1mw</computeroutput>: D1 cache write
439 misses</para>
440 </listitem>
441 <listitem>
442 <para><computeroutput>D2mw</computeroutput>: L2 cache data
443 write misses</para>
444 </listitem>
445 </itemizedlist>
446
447 <para>Note that D1 total accesses is given by
448 <computeroutput>D1mr</computeroutput> +
449 <computeroutput>D1mw</computeroutput>, and that L2 total
450 accesses is given by <computeroutput>I2mr</computeroutput> +
451 <computeroutput>D2mr</computeroutput> +
452 <computeroutput>D2mw</computeroutput>.</para>
453 </listitem>
454
455 <listitem>
456 <para>Events shown: the events shown (a subset of events
457 gathered). This can be adjusted with the
458 <computeroutput>--show</computeroutput> option.</para>
459 </listitem>
460
461 <listitem>
462 <para>Event sort order: the sort order in which functions are
463 shown. For example, in this case the functions are sorted
464 from highest <computeroutput>Ir</computeroutput> counts to
465 lowest. If two functions have identical
466 <computeroutput>Ir</computeroutput> counts, they will then be
467 sorted by <computeroutput>I1mr</computeroutput> counts, and
468 so on. This order can be adjusted with the
469 <computeroutput>--sort</computeroutput> option.</para>
470
471 <para>Note that this dictates the order the functions appear.
472 It is <command>not</command> the order in which the columns
473 appear; that is dictated by the "events shown" line (and can
474 be changed with the <computeroutput>--show</computeroutput>
475 option).</para>
476 </listitem>
477
478 <listitem>
479 <para>Threshold: <computeroutput>cg_annotate</computeroutput>
480 by default omits functions that cause very low numbers of
481 misses to avoid drowning you in information. In this case,
482 cg_annotate shows summaries the functions that account for
483 99% of the <computeroutput>Ir</computeroutput> counts;
484 <computeroutput>Ir</computeroutput> is chosen as the
485 threshold event since it is the primary sort event. The
486 threshold can be adjusted with the
487 <computeroutput>--threshold</computeroutput>
488 option.</para>
489 </listitem>
490
491 <listitem>
492 <para>Chosen for annotation: names of files specified
493 manually for annotation; in this case none.</para>
494 </listitem>
495
496 <listitem>
497 <para>Auto-annotation: whether auto-annotation was requested
498 via the <computeroutput>--auto=yes</computeroutput>
499 option. In this case no.</para>
500 </listitem>
501
502</itemizedlist>
503
504<para>Then follows summary statistics for the whole
505program. These are similar to the summary provided when running
506<computeroutput>valgrind
507--tool=cachegrind</computeroutput>.</para>
508
509<para>Then follows function-by-function statistics. Each function
510is identified by a
511<computeroutput>file_name:function_name</computeroutput> pair. If
512a column contains only a dot it means the function never performs
513that event (eg. the third row shows that
514<computeroutput>strcmp()</computeroutput> contains no
515instructions that write to memory). The name
516<computeroutput>???</computeroutput> is used if the the file name
517and/or function name could not be determined from debugging
518information. If most of the entries have the form
519<computeroutput>???:???</computeroutput> the program probably
520wasn't compiled with <computeroutput>-g</computeroutput>. If any
521code was invalidated (either due to self-modifying code or
522unloading of shared objects) its counts are aggregated into a
523single cost centre written as
524<computeroutput>(discarded):(discarded)</computeroutput>.</para>
525
526<para>It is worth noting that functions will come from three
527types of source files:</para>
528
529<orderedlist>
530 <listitem>
531 <para>From the profiled program
532 (<filename>concord.c</filename> in this example).</para>
533 </listitem>
534 <listitem>
535 <para>From libraries (eg. <filename>getc.c</filename>)</para>
536 </listitem>
537 <listitem>
538 <para>From Valgrind's implementation of some libc functions
539 (eg. <computeroutput>vg_clientmalloc.c:malloc</computeroutput>).
540 These are recognisable because the filename begins with
541 <computeroutput>vg_</computeroutput>, and is probably one of
542 <filename>vg_main.c</filename>,
543 <filename>vg_clientmalloc.c</filename> or
544 <filename>vg_mylibc.c</filename>.</para>
545 </listitem>
546
547</orderedlist>
548
549<para>There are two ways to annotate source files -- by choosing
550them manually, or with the
551<computeroutput>--auto=yes</computeroutput> option. To do it
552manually, just specify the filenames as arguments to
553<computeroutput>cg_annotate</computeroutput>. For example, the
554output from running <filename>cg_annotate concord.c</filename>
555for our example produces the same output as above followed by an
556annotated version of <filename>concord.c</filename>, a section of
557which looks like:</para>
558
559<programlisting><![CDATA[
560--------------------------------------------------------------------------------
561-- User-annotated source: concord.c
562--------------------------------------------------------------------------------
563Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
564
565[snip]
566
567 . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
568 3 1 1 . . . 1 0 0 {
569 . . . . . . . . . FILE *file_ptr;
570 . . . . . . . . . Word_Info *data;
571 1 0 0 . . . 1 1 1 int line = 1, i;
572 . . . . . . . . .
573 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
574 . . . . . . . . .
575 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
576 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
577 . . . . . . . . .
578 . . . . . . . . . /* Open file, check it. */
579 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
580 2 0 0 1 0 0 . . . if (!(file_ptr)) {
581 . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
582 1 1 1 . . . . . . exit(EXIT_FAILURE);
583 . . . . . . . . . }
584 . . . . . . . . .
585 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
586 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
587 . . . . . . . . .
588 4 0 0 1 0 0 2 0 0 free(data);
589 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
590 3 0 0 2 0 0 . . . }]]></programlisting>
591
592<para>(Although column widths are automatically minimised, a wide
593terminal is clearly useful.)</para>
594
595<para>Each source file is clearly marked
596(<computeroutput>User-annotated source</computeroutput>) as
597having been chosen manually for annotation. If the file was
598found in one of the directories specified with the
599<computeroutput>-I / --include</computeroutput> option, the directory
600and file are both given.</para>
601
602<para>Each line is annotated with its event counts. Events not
603applicable for a line are represented by a `.'; this is useful
604for distinguishing between an event which cannot happen, and one
605which can but did not.</para>
606
607<para>Sometimes only a small section of a source file is
sewardj8d9fec52005-11-15 20:56:23 +0000608executed. To minimise uninteresting output, Cachegrind only shows
njn3e986b22004-11-30 10:43:45 +0000609annotated lines and lines within a small distance of annotated
610lines. Gaps are marked with the line numbers so you know which
611part of a file the shown code comes from, eg:</para>
612
613<programlisting><![CDATA[
614(figures and code for line 704)
615-- line 704 ----------------------------------------
616-- line 878 ----------------------------------------
617(figures and code for line 878)]]></programlisting>
618
619<para>The amount of context to show around annotated lines is
620controlled by the <computeroutput>--context</computeroutput>
621option.</para>
622
623<para>To get automatic annotation, run
624<computeroutput>cg_annotate --auto=yes</computeroutput>.
625cg_annotate will automatically annotate every source file it can
626find that is mentioned in the function-by-function summary.
627Therefore, the files chosen for auto-annotation are affected by
628the <computeroutput>--sort</computeroutput> and
629<computeroutput>--threshold</computeroutput> options. Each
630source file is clearly marked (<computeroutput>Auto-annotated
631source</computeroutput>) as being chosen automatically. Any
632files that could not be found are mentioned at the end of the
633output, eg:</para>
634
635<programlisting><![CDATA[
636------------------------------------------------------------------
637The following files chosen for auto-annotation could not be found:
638------------------------------------------------------------------
639 getc.c
640 ctype.c
641 ../sysdeps/generic/lockfile.c]]></programlisting>
642
643<para>This is quite common for library files, since libraries are
644usually compiled with debugging information, but the source files
645are often not present on a system. If a file is chosen for
646annotation <command>both</command> manually and automatically, it
647is marked as <computeroutput>User-annotated
648source</computeroutput>. Use the <computeroutput>-I /
649--include</computeroutput> option to tell Valgrind where to look
650for source files if the filenames found from the debugging
651information aren't specific enough.</para>
652
653<para>Beware that cg_annotate can take some time to digest large
654<computeroutput>cachegrind.out.pid</computeroutput> files,
655e.g. 30 seconds or more. Also beware that auto-annotation can
656produce a lot of output if your program is large!</para>
657
658</sect2>
659
660
661<sect2 id="cg-manual.assembler" xreflabel="Annotating assembler programs">
662<title>Annotating assembler programs</title>
663
664<para>Valgrind can annotate assembler programs too, or annotate
665the assembler generated for your C program. Sometimes this is
666useful for understanding what is really happening when an
667interesting line of C code is translated into multiple
668instructions.</para>
669
670<para>To do this, you just need to assemble your
671<computeroutput>.s</computeroutput> files with assembler-level
672debug information. gcc doesn't do this, but you can use the GNU
673assembler with the <computeroutput>--gstabs</computeroutput>
674option to generate object files with this information, eg:</para>
675
676<programlisting><![CDATA[
677as --gstabs foo.s]]></programlisting>
678
679<para>You can then profile and annotate source files in the same
680way as for C/C++ programs.</para>
681
682</sect2>
683
684</sect1>
685
686
687<sect1 id="cg-manual.annopts" xreflabel="cg_annotate options">
688<title><computeroutput>cg_annotate</computeroutput> options</title>
689
690<itemizedlist>
691
debc32e822005-06-25 14:43:05 +0000692 <listitem id="pid">
njn3e986b22004-11-30 10:43:45 +0000693 <para><computeroutput>--pid</computeroutput></para>
694 <para>Indicates which
695 <computeroutput>cachegrind.out.pid</computeroutput> file to
696 read. Not actually an option -- it is required.</para>
697 </listitem>
698
699 <listitem>
700 <para><computeroutput>-h, --help</computeroutput></para>
701 <para><computeroutput>-v, --version</computeroutput></para>
702 <para>Help and version, as usual.</para>
703 </listitem>
704
debc32e822005-06-25 14:43:05 +0000705 <listitem id="sort">
njn3e986b22004-11-30 10:43:45 +0000706 <para><computeroutput>--sort=A,B,C</computeroutput> [default:
707 order in
708 <computeroutput>cachegrind.out.pid</computeroutput>]</para>
709 <para>Specifies the events upon which the sorting of the
710 function-by-function entries will be based. Useful if you
711 want to concentrate on eg. I cache misses
712 (<computeroutput>--sort=I1mr,I2mr</computeroutput>), or D
713 cache misses
714 (<computeroutput>--sort=D1mr,D2mr</computeroutput>), or L2
715 misses
716 (<computeroutput>--sort=D2mr,I2mr</computeroutput>).</para>
717 </listitem>
718
debc32e822005-06-25 14:43:05 +0000719 <listitem id="show">
njn3e986b22004-11-30 10:43:45 +0000720 <para><computeroutput>--show=A,B,C</computeroutput> [default:
721 all, using order in
722 <computeroutput>cachegrind.out.pid</computeroutput>]</para>
723 <para>Specifies which events to show (and the column
724 order). Default is to use all present in the
725 <computeroutput>cachegrind.out.pid</computeroutput> file (and
726 use the order in the file).</para>
727 </listitem>
728
debc32e822005-06-25 14:43:05 +0000729 <listitem id="threshold">
njn3e986b22004-11-30 10:43:45 +0000730 <para><computeroutput>--threshold=X</computeroutput>
731 [default: 99%]</para>
732 <para>Sets the threshold for the function-by-function
733 summary. Functions are shown that account for more than X%
734 of the primary sort event. If auto-annotating, also affects
735 which files are annotated.</para>
736
737 <para>Note: thresholds can be set for more than one of the
738 events by appending any events for the
739 <computeroutput>--sort</computeroutput> option with a colon
740 and a number (no spaces, though). E.g. if you want to see
741 the functions that cover 99% of L2 read misses and 99% of L2
742 write misses, use this option:</para>
743 <para><computeroutput>--sort=D2mr:99,D2mw:99</computeroutput></para>
744 </listitem>
745
debc32e822005-06-25 14:43:05 +0000746 <listitem id="auto">
njn3e986b22004-11-30 10:43:45 +0000747 <para><computeroutput>--auto=no</computeroutput> [default]</para>
748 <para><computeroutput>--auto=yes</computeroutput></para>
749 <para>When enabled, automatically annotates every file that
750 is mentioned in the function-by-function summary that can be
751 found. Also gives a list of those that couldn't be found.</para>
752 </listitem>
753
debc32e822005-06-25 14:43:05 +0000754 <listitem id="context">
njn3e986b22004-11-30 10:43:45 +0000755 <para><computeroutput>--context=N</computeroutput> [default:
756 8]</para>
757 <para>Print N lines of context before and after each
758 annotated line. Avoids printing large sections of source
759 files that were not executed. Use a large number
760 (eg. 10,000) to show all source lines.</para>
761 </listitem>
762
debc32e822005-06-25 14:43:05 +0000763 <listitem id="include">
sewardj8d9fec52005-11-15 20:56:23 +0000764 <para><computeroutput>-I&lt;dir&gt;,
njn3e986b22004-11-30 10:43:45 +0000765 --include=&lt;dir&gt;</computeroutput> [default: empty
766 string]</para>
767 <para>Adds a directory to the list in which to search for
768 files. Multiple -I/--include options can be given to add
769 multiple directories.</para>
770 </listitem>
771
772</itemizedlist>
773
774
775
776<sect2>
777<title>Warnings</title>
778
779<para>There are a couple of situations in which
780<computeroutput>cg_annotate</computeroutput> issues
781warnings.</para>
782
783<itemizedlist>
784 <listitem>
785 <para>If a source file is more recent than the
786 <computeroutput>cachegrind.out.pid</computeroutput> file.
787 This is because the information in
788 <computeroutput>cachegrind.out.pid</computeroutput> is only
789 recorded with line numbers, so if the line numbers change at
790 all in the source (eg. lines added, deleted, swapped), any
791 annotations will be incorrect.</para>
792 </listitem>
793 <listitem>
794 <para>If information is recorded about line numbers past the
795 end of a file. This can be caused by the above problem,
796 ie. shortening the source file while using an old
797 <computeroutput>cachegrind.out.pid</computeroutput> file. If
798 this happens, the figures for the bogus lines are printed
799 anyway (clearly marked as bogus) in case they are
800 important.</para>
801 </listitem>
802</itemizedlist>
803
804</sect2>
805
806
807
808<sect2>
809<title>Things to watch out for</title>
810
811<para>Some odd things that can occur during annotation:</para>
812
813<itemizedlist>
814 <listitem>
815 <para>If annotating at the assembler level, you might see
816 something like this:</para>
817<programlisting><![CDATA[
818 1 0 0 . . . . . . leal -12(%ebp),%eax
819 1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
820 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
821 . . . . . . . . . .align 4,0x90
822 1 0 0 . . . . . . movl $.LnrB,%eax
823 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)]]></programlisting>
824
825 <para>How can the third instruction be executed twice when
826 the others are executed only once? As it turns out, it
827 isn't. Here's a dump of the executable, using
828 <computeroutput>objdump -d</computeroutput>:</para>
829<programlisting><![CDATA[
830 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
831 8048f28: 89 43 54 mov %eax,0x54(%ebx)
832 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
833 8048f32: 89 f6 mov %esi,%esi
834 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
835 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)]]></programlisting>
836
837 <para>Notice the extra <computeroutput>mov
838 %esi,%esi</computeroutput> instruction. Where did this come
839 from? The GNU assembler inserted it to serve as the two
840 bytes of padding needed to align the <computeroutput>movl
841 $.LnrB,%eax</computeroutput> instruction on a four-byte
842 boundary, but pretended it didn't exist when adding debug
843 information. Thus when Valgrind reads the debug info it
844 thinks that the <computeroutput>movl
845 $0x1,0xffffffec(%ebp)</computeroutput> instruction covers the
846 address range 0x8048f2b--0x804833 by itself, and attributes
847 the counts for the <computeroutput>mov
848 %esi,%esi</computeroutput> to it.</para>
849 </listitem>
850
851 <listitem>
852 <para>Inlined functions can cause strange results in the
853 function-by-function summary. If a function
854 <computeroutput>inline_me()</computeroutput> is defined in
855 <filename>foo.h</filename> and inlined in the functions
856 <computeroutput>f1()</computeroutput>,
857 <computeroutput>f2()</computeroutput> and
858 <computeroutput>f3()</computeroutput> in
859 <filename>bar.c</filename>, there will not be a
860 <computeroutput>foo.h:inline_me()</computeroutput> function
861 entry. Instead, there will be separate function entries for
862 each inlining site, ie.
863 <computeroutput>foo.h:f1()</computeroutput>,
864 <computeroutput>foo.h:f2()</computeroutput> and
865 <computeroutput>foo.h:f3()</computeroutput>. To find the
866 total counts for
867 <computeroutput>foo.h:inline_me()</computeroutput>, add up
868 the counts from each entry.</para>
869
870 <para>The reason for this is that although the debug info
871 output by gcc indicates the switch from
872 <filename>bar.c</filename> to <filename>foo.h</filename>, it
873 doesn't indicate the name of the function in
874 <filename>foo.h</filename>, so Valgrind keeps using the old
875 one.</para>
876 </listitem>
877
878 <listitem>
879 <para>Sometimes, the same filename might be represented with
880 a relative name and with an absolute name in different parts
881 of the debug info, eg:
882 <filename>/home/user/proj/proj.h</filename> and
883 <filename>../proj.h</filename>. In this case, if you use
884 auto-annotation, the file will be annotated twice with the
885 counts split between the two.</para>
886 </listitem>
887
888 <listitem>
889 <para>Files with more than 65,535 lines cause difficulties
890 for the stabs debug info reader. This is because the line
891 number in the <computeroutput>struct nlist</computeroutput>
892 defined in <filename>a.out.h</filename> under Linux is only a
893 16-bit value. Valgrind can handle some files with more than
894 65,535 lines correctly by making some guesses to identify
895 line number overflows. But some cases are beyond it, in
896 which case you'll get a warning message explaining that
897 annotations for the file might be incorrect.</para>
898 </listitem>
899
900 <listitem>
901 <para>If you compile some files with
902 <computeroutput>-g</computeroutput> and some without, some
903 events that take place in a file without debug info could be
904 attributed to the last line of a file with debug info
905 (whichever one gets placed before the non-debug-info file in
906 the executable).</para>
907 </listitem>
908
909</itemizedlist>
910
911<para>This list looks long, but these cases should be fairly
912rare.</para>
913
914<formalpara>
915 <title>Note:</title>
916 <para><computeroutput>stabs</computeroutput> is not an easy
917 format to read. If you come across bizarre annotations that
918 look like might be caused by a bug in the stabs reader, please
919 let us know.</para>
920</formalpara>
921
922</sect2>
923
924
925
926<sect2>
927<title>Accuracy</title>
928
929<para>Valgrind's cache profiling has a number of
930shortcomings:</para>
931
932<itemizedlist>
933 <listitem>
934 <para>It doesn't account for kernel activity -- the effect of
935 system calls on the cache contents is ignored.</para>
936 </listitem>
937
938 <listitem>
939 <para>It doesn't account for other process activity (although
940 this is probably desirable when considering a single
941 program).</para>
942 </listitem>
943
944 <listitem>
945 <para>It doesn't account for virtual-to-physical address
946 mappings; hence the entire simulation is not a true
947 representation of what's happening in the
948 cache.</para>
949 </listitem>
950
951 <listitem>
952 <para>It doesn't account for cache misses not visible at the
953 instruction level, eg. those arising from TLB misses, or
954 speculative execution.</para>
955 </listitem>
956
957 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +0000958 <para>Valgrind will schedule
959 threads differently from how they would be when running natively.
960 This could warp the results for threaded programs.</para>
njn3e986b22004-11-30 10:43:45 +0000961 </listitem>
962
963 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +0000964 <para>The x86/amd64 instructions <computeroutput>bts</computeroutput>,
njn3e986b22004-11-30 10:43:45 +0000965 <computeroutput>btr</computeroutput> and
966 <computeroutput>btc</computeroutput> will incorrectly be
967 counted as doing a data read if both the arguments are
968 registers, eg:</para>
969<programlisting><![CDATA[
970 btsl %eax, %edx]]></programlisting>
971
972 <para>This should only happen rarely.</para>
973 </listitem>
974
975 <listitem>
sewardj8d9fec52005-11-15 20:56:23 +0000976 <para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
njn3e986b22004-11-30 10:43:45 +0000977 (e.g. <computeroutput>fsave</computeroutput>) are treated as
978 though they only access 16 bytes. These instructions seem to
979 be rare so hopefully this won't affect accuracy much.</para>
980 </listitem>
981
982</itemizedlist>
983
984<para>Another thing worth nothing is that results are very
985sensitive. Changing the size of the
986<filename>valgrind.so</filename> file, the size of the program
987being profiled, or even the length of its name can perturb the
988results. Variations will be small, but don't expect perfectly
989repeatable results if your program changes at all.</para>
990
991<para>While these factors mean you shouldn't trust the results to
992be super-accurate, hopefully they should be close enough to be
993useful.</para>
994
995</sect2>
996
997
998<sect2>
999<title>Todo</title>
1000
1001<itemizedlist>
1002 <listitem>
1003 <para>Program start-up/shut-down calls a lot of functions
1004 that aren't interesting and just complicate the output.
1005 Would be nice to exclude these somehow.</para>
1006 </listitem>
1007</itemizedlist>
1008
1009</sect2>
1010
1011</sect1>
1012</chapter>