Blame - cachegrind/docs/cg_main.html - fp2-dev/platform/external/valgrind

blob: 85462560e69466be9e944f2e99ea7aeafd6f8839 [file] [log] [blame]

sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<style type="text/css">
				4	body { background-color: #ffffff;
				5	color: #000000;
				6	font-family: Times, Helvetica, Arial;
				7	font-size: 14pt}
				8	h4 { margin-bottom: 0.3em}
				9	code { color: #000000;
				10	font-family: Courier;
				11	font-size: 13pt }
				12	pre { color: #000000;
				13	font-family: Courier;
				14	font-size: 13pt }
				15	a:link { color: #0000C0;
				16	text-decoration: none; }
				17	a:visited { color: #0000C0;
				18	text-decoration: none; }
				19	a:active { color: #0000C0;
				20	text-decoration: none; }
				21	</style>
				22	<title>Cachegrind</title>
				23	</head>
				24
				25	<body bgcolor="#ffffff">
				26
				27	<a name="title"> </a>
				28	<h1 align=center>Cachegrind, version 1.0.0</h1>
				29	<center>This manual was last updated on 20020726</center>
				30	<p>
				31
				32	<center>
				33	<a href="mailto:jseward@acm.org">jseward@acm.org</a><br>
				34	Copyright © 2000-2002 Julian Seward
				35	<p>
				36	Cachegrind is licensed under the GNU General Public License,
				37	version 2<br>
				38	An open-source tool for finding memory-management problems in
				39	Linux-x86 executables.
				40	</center>
				41
				42	<p>
				43
				44	<hr width="100%">
				45	<a name="contents"></a>
				46	<h2>Contents of this manual</h2>
				47
				48	<h4>1  <a href="#cache">How to use Cachegrind</a></h4>
				49
				50	<h4>2  <a href="techdocs.html">How Cachegrind works</a></h4>
				51
				52	<hr width="100%">
				53
				54
				55	<a name="cache"></a>
				56	<h2>1  Cache profiling</h2>
				57	Cachegrind is a tool for doing cache simulations and annotate your source
				58	line-by-line with the number of cache misses. In particular, it records:
				59	<ul>
				60	<li>L1 instruction cache reads and misses;
				61	<li>L1 data cache reads and read misses, writes and write misses;
				62	<li>L2 unified cache reads and read misses, writes and writes misses.
				63	</ul>
				64	On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
				65	and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
				66	very useful for improving the performance of your program.<p>
				67
				68	Also, since one instruction cache read is performed per instruction executed,
				69	you can find out how many instructions are executed per line, which can be
				70	useful for traditional profiling and test coverage.<p>
				71
				72	Any feedback, bug-fixes, suggestions, etc, welcome.
				73
				74
				75	<h3>1.1  Overview</h3>
				76	First off, as for normal Valgrind use, you probably want to compile with
				77	debugging info (the <code>-g</code> flag). But by contrast with normal
				78	Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
				79	should profile your program as it will be normally run.
				80
				81	The two steps are:
				82	<ol>
				83	<li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
				84	the normal command line invocation. When the program finishes,
				85	Valgrind will print summary cache statistics. It also collects
				86	line-by-line information in a file
				87	<code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
				88	is the program's process id.
				89	<p>
				90	This step should be done every time you want to collect
				91	information about a new program, a changed program, or about the
				92	same program with different input.
				93	</li>
				94	<p>
				95	<li>Generate a function-by-function summary, and possibly annotate
				96	source files with 'cg_annotate'. Source files to annotate can be
				97	specified manually, or manually on the command line, or
				98	"interesting" source files can be annotated automatically with
				99	the <code>--auto=yes</code> option. You can annotate C/C++
				100	files or assembly language files equally easily.
				101	<p>
				102	This step can be performed as many times as you like for each
				103	Step 2. You may want to do multiple annotations showing
				104	different information each time.<p>
				105	</li>
				106	</ol>
				107
				108	The steps are described in detail in the following sections.<p>
				109
				110
				111	<h3>1.2  Cache simulation specifics</h3>
				112
				113	Cachegrind uses a simulation for a machine with a split L1 cache and a unified
				114	L2 cache. This configuration is used for all (modern) x86-based machines we
				115	are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
				116	ancient history now.<p>
				117
				118	The more specific characteristics of the simulation are as follows.
				119
				120	<ul>
				121	<li>Write-allocate: when a write miss occurs, the block written to
				122	is brought into the D1 cache. Most modern caches have this
				123	property.</li><p>
				124
				125	<li>Bit-selection hash function: the line(s) in the cache to which a
				126	memory block maps is chosen by the middle bits M--(M+N-1) of the
				127	byte address, where:
				128	<ul>
				129	<li> line size = 2^M bytes </li>
				130	<li>(cache size / line size) = 2^N bytes</li>
				131	</ul> </li><p>
				132
				133	<li>Inclusive L2 cache: the L2 cache replicates all the entries of
				134	the L1 cache. This is standard on Pentium chips, but AMD
				135	Athlons use an exclusive L2 cache that only holds blocks evicted
				136	from L1. Ditto AMD Durons and most modern VIAs.</li><p>
				137	</ul>
				138
				139	The cache configuration simulated (cache size, associativity and line size) is
				140	determined automagically using the CPUID instruction. If you have an old
				141	machine that (a) doesn't support the CPUID instruction, or (b) supports it in
				142	an early incarnation that doesn't give any cache information, then Cachegrind
				143	will fall back to using a default configuration (that of a model 3/4 Athlon).
				144	Cachegrind will tell you if this happens. You can manually specify one, two or
				145	all three levels (I1/D1/L2) of the cache from the command line using the
				146	<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>
				147
				148	Other noteworthy behaviour:
				149
				150	<ul>
				151	<li>References that straddle two cache lines are treated as follows:
				152	<ul>
				153	<li>If both blocks hit --> counted as one hit</li>
				154	<li>If one block hits, the other misses --> counted as one miss</li>
				155	<li>If both blocks miss --> counted as one miss (not two)</li>
				156	</ul><p></li>
				157
				158	<li>Instructions that modify a memory location (eg. <code>inc</code> and
				159	<code>dec</code>) are counted as doing just a read, ie. a single data
				160	reference. This may seem strange, but since the write can never cause a
				161	miss (the read guarantees the block is in the cache) it's not very
				162	interesting.<p>
				163
				164	Thus it measures not the number of times the data cache is accessed, but
				165	the number of times a data cache miss could occur.<p>
				166	</li>
				167	</ul>
				168
				169	If you are interested in simulating a cache with different properties, it is
				170	not particularly hard to write your own cache simulator, or to modify the
				171	existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
				172	<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
				173	interested to hear from anyone who does.
				174
				175	<a name="profile"></a>
				176	<h3>1.3  Profiling programs</h3>
				177
				178	Cache profiling is enabled by using the <code>--skin=cachegrind</code>
				179	option to the <code>valgrind</code> shell script. To gather cache profiling
				180	information about the program <code>ls -l</code>, type:
				181
				182	<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
				183
				184	The program will execute (slowly). Upon completion, summary statistics
				185	that look like this will be printed:
				186
				187	<pre>
				188	==31751== I refs: 27,742,716
				189	==31751== I1 misses: 276
				190	==31751== L2 misses: 275
				191	==31751== I1 miss rate: 0.0%
				192	==31751== L2i miss rate: 0.0%
				193	==31751==
				194	==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
				195	==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
				196	==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
				197	==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
				198	==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
				199	==31751==
				200	==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
				201	==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
				202	</pre>
				203
				204	Cache accesses for instruction fetches are summarised first, giving the
				205	number of fetches made (this is the number of instructions executed, which
				206	can be useful to know in its own right), the number of I1 misses, and the
				207	number of L2 instruction (<code>L2i</code>) misses.<p>
				208
				209	Cache accesses for data follow. The information is similar to that of the
				210	instruction fetches, except that the values are also shown split between reads
				211	and writes (note each row's <code>rd</code> and <code>wr</code> values add up
				212	to the row's total).<p>
				213
				214	Combined instruction and data figures for the L2 cache follow that.<p>
				215
				216
				217	<h3>1.4  Output file</h3>
				218
				219	As well as printing summary information, Cachegrind also writes
				220	line-by-line cache profiling information to a file named
				221	<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
				222	best interpreted by the accompanying program <code>cg_annotate</code>,
				223	described in the next section.
				224	<p>
				225	Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
				226	<ul>
				227	<li>It is written every time <code>valgrind --skin=cachegrind</code>
				228	is run, and will overwrite any existing
				229	<code>cachegrind.out.<i>pid</i></code> in the current directory (but
				230	that won't happen very often because it takes some time for process ids
				231	to be recycled).</li>
				232	<p>
				233	<li>It can be huge: <code>ls -l</code> generates a file of about
				234	350KB. Browsing a few files and web pages with a Konqueror
				235	built with full debugging information generates a file
				236	of around 15 MB.</li>
				237	</ul>
				238
				239	Note that older versions of Cachegrind used a log file named
				240	<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
				241	The suffix serves two purposes. Firstly, it means you don't have to rename old
				242	log files that you don't want to overwrite. Secondly, and more importantly,
				243	it allows correct profiling with the <code>--trace-children=yes</code> option
				244	of programs that spawn child processes.
				245
				246	<a name="profileflags"></a>
				247	<h3>1.5  Cachegrind options</h3>
				248	Cachegrind accepts all the options that Valgrind does, although some of them
				249	(ones related to memory checking) don't do anything when cache profiling.<p>
				250
				251	The interesting cache-simulation specific options are:
				252
				253	<ul>
				254	<li><code>--I1=<size>,<associativity>,<line_size></code><br>
				255	<code>--D1=<size>,<associativity>,<line_size></code><br>
				256	<code>--L2=<size>,<associativity>,<line_size></code><p>
				257	[default: uses CPUID for automagic cache configuration]<p>
				258
				259	Manually specifies the I1/D1/L2 cache configuration, where
				260	<code>size</code> and <code>line_size</code> are measured in bytes. The
				261	three items must be comma-separated, but with no spaces, eg:
				262
				263	<blockquote>
				264	<code>valgrind --skin=cachegrind --I1=65535,2,64</code>
				265	</blockquote>
				266
				267	You can specify one, two or three of the I1/D1/L2 caches. Any level not
				268	manually specified will be simulated using the configuration found in the
				269	normal way (via the CPUID instruction, or failing that, via defaults).
				270	</ul>
				271
				272
				273	<a name="annotate"></a>
				274	<h3>1.6  Annotating C/C++ programs</h3>
				275
				276	Before using <code>cg_annotate</code>, it is worth widening your
				277	window to be at least 120-characters wide if possible, as the output
				278	lines can be quite long.
				279	<p>
				280	To get a function-by-function summary, run <code>cg_annotate
				281	--<i>pid</i></code> in a directory containing a
				282	<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
				283	is required so that <code>cg_annotate</code> knows which log file to use when
				284	several are present.
				285	<p>
				286	The output looks like this:
				287
				288	<pre>
				289	--------------------------------------------------------------------------------
				290	I1 cache: 65536 B, 64 B, 2-way associative
				291	D1 cache: 65536 B, 64 B, 2-way associative
				292	L2 cache: 262144 B, 64 B, 8-way associative
				293	Command: concord vg_to_ucode.c
				294	Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				295	Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				296	Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				297	Threshold: 99%
				298	Chosen for annotation:
				299	Auto-annotation: on
				300
				301	--------------------------------------------------------------------------------
				302	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				303	--------------------------------------------------------------------------------
				304	27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
				305
				306	--------------------------------------------------------------------------------
				307	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
				308	--------------------------------------------------------------------------------
				309	8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
				310	5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
				311	2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
				312	2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
				313	2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
				314	1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
				315	897,991 51 51 897,831 95 30 62 1 1 ???:???
				316	598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
				317	598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
				318	598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
				319	446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
				320	341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
				321	320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
				322	298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
				323	149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
				324	149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
				325	95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
				326	85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
				327	</pre>
				328
				329	First up is a summary of the annotation options:
				330
				331	<ul>
				332	<li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
				333	configuration with which these results were obtained.</li><p>
				334
				335	<li>Command: the command line invocation of the program under
				336	examination.</li><p>
				337
				338	<li>Events recorded: event abbreviations are:<p>
				339	<ul>
				340	<li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
				341	<li><code>I1mr</code>: I1 cache read misses</li>
				342	<li><code>I2mr</code>: L2 cache instruction read misses</li>
				343	<li><code>Dr </code>: D cache reads (ie. memory reads)</li>
				344	<li><code>D1mr</code>: D1 cache read misses</li>
				345	<li><code>D2mr</code>: L2 cache data read misses</li>
				346	<li><code>Dw </code>: D cache writes (ie. memory writes)</li>
				347	<li><code>D1mw</code>: D1 cache write misses</li>
				348	<li><code>D2mw</code>: L2 cache data write misses</li>
				349	</ul><p>
				350	Note that D1 total accesses is given by <code>D1mr</code> +
				351	<code>D1mw</code>, and that L2 total accesses is given by
				352	<code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
				353
				354	<li>Events shown: the events shown (a subset of events gathered). This can
				355	be adjusted with the <code>--show</code> option.</li><p>
				356
				357	<li>Event sort order: the sort order in which functions are shown. For
				358	example, in this case the functions are sorted from highest
				359	<code>Ir</code> counts to lowest. If two functions have identical
				360	<code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
				361	counts, and so on. This order can be adjusted with the
				362	<code>--sort</code> option.<p>
				363
				364	Note that this dictates the order the functions appear. It is <b>not</b>
				365	the order in which the columns appear; that is dictated by the "events
				366	shown" line (and can be changed with the <code>--show</code> option).
				367	</li><p>
				368
				369	<li>Threshold: <code>cg_annotate</code> by default omits functions
				370	that cause very low numbers of misses to avoid drowning you in
				371	information. In this case, cg_annotate shows summaries the
				372	functions that account for 99% of the <code>Ir</code> counts;
				373	<code>Ir</code> is chosen as the threshold event since it is the
				374	primary sort event. The threshold can be adjusted with the
				375	<code>--threshold</code> option.</li><p>
				376
				377	<li>Chosen for annotation: names of files specified manually for annotation;
				378	in this case none.</li><p>
				379
				380	<li>Auto-annotation: whether auto-annotation was requested via the
				381	<code>--auto=yes</code> option. In this case no.</li><p>
				382	</ul>
				383
				384	Then follows summary statistics for the whole program. These are similar
				385	to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
				386
				387	Then follows function-by-function statistics. Each function is
				388	identified by a <code>file_name:function_name</code> pair. If a column
				389	contains only a dot it means the function never performs
				390	that event (eg. the third row shows that <code>strcmp()</code>
				391	contains no instructions that write to memory). The name
				392	<code>???</code> is used if the the file name and/or function name
				393	could not be determined from debugging information. If most of the
				394	entries have the form <code>???:???</code> the program probably wasn't
				395	compiled with <code>-g</code>. If any code was invalidated (either due to
				396	self-modifying code or unloading of shared objects) its counts are aggregated
				397	into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
				398
				399	It is worth noting that functions will come from three types of source files:
				400	<ol>
				401	<li> From the profiled program (<code>concord.c</code> in this example).</li>
				402	<li>From libraries (eg. <code>getc.c</code>)</li>
				403	<li>From Valgrind's implementation of some libc functions (eg.
				404	<code>vg_clientmalloc.c:malloc</code>). These are recognisable because
				405	the filename begins with <code>vg_</code>, and is probably one of
				406	<code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
				407	<code>vg_mylibc.c</code>.
				408	</li>
				409	</ol>
				410
				411	There are two ways to annotate source files -- by choosing them
				412	manually, or with the <code>--auto=yes</code> option. To do it
				413	manually, just specify the filenames as arguments to
				414	<code>cg_annotate</code>. For example, the output from running
				415	<code>cg_annotate concord.c</code> for our example produces the same
				416	output as above followed by an annotated version of
				417	<code>concord.c</code>, a section of which looks like:
				418
				419	<pre>
				420	--------------------------------------------------------------------------------
				421	-- User-annotated source: concord.c
				422	--------------------------------------------------------------------------------
				423	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				424
				425	[snip]
				426
				427	. . . . . . . . . void init_hash_table(char file_name, Word_Node table[])
				428	3 1 1 . . . 1 0 0 {
				429	. . . . . . . . . FILE *file_ptr;
				430	. . . . . . . . . Word_Info *data;
				431	1 0 0 . . . 1 1 1 int line = 1, i;
				432	. . . . . . . . .
				433	5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
				434	. . . . . . . . .
				435	4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
				436	3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
				437	. . . . . . . . .
				438	. . . . . . . . . /* Open file, check it. */
				439	6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
				440	2 0 0 1 0 0 . . . if (!(file_ptr)) {
				441	. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
				442	1 1 1 . . . . . . exit(EXIT_FAILURE);
				443	. . . . . . . . . }
				444	. . . . . . . . .
				445	165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
				446	146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
				447	. . . . . . . . .
				448	4 0 0 1 0 0 2 0 0 free(data);
				449	4 0 0 1 0 0 2 0 0 fclose(file_ptr);
				450	3 0 0 2 0 0 . . . }
				451	</pre>
				452
				453	(Although column widths are automatically minimised, a wide terminal is clearly
				454	useful.)<p>
				455
				456	Each source file is clearly marked (<code>User-annotated source</code>) as
				457	having been chosen manually for annotation. If the file was found in one of
				458	the directories specified with the <code>-I</code>/<code>--include</code>
				459	option, the directory and file are both given.<p>
				460
				461	Each line is annotated with its event counts. Events not applicable for a line
				462	are represented by a `.'; this is useful for distinguishing between an event
				463	which cannot happen, and one which can but did not.<p>
				464
				465	Sometimes only a small section of a source file is executed. To minimise
				466	uninteresting output, Valgrind only shows annotated lines and lines within a
				467	small distance of annotated lines. Gaps are marked with the line numbers so
				468	you know which part of a file the shown code comes from, eg:
				469
				470	<pre>
				471	(figures and code for line 704)
				472	-- line 704 ----------------------------------------
				473	-- line 878 ----------------------------------------
				474	(figures and code for line 878)
				475	</pre>
				476
				477	The amount of context to show around annotated lines is controlled by the
				478	<code>--context</code> option.<p>
				479
				480	To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
				481	cg_annotate will automatically annotate every source file it can find that is
				482	mentioned in the function-by-function summary. Therefore, the files chosen for
				483	auto-annotation are affected by the <code>--sort</code> and
				484	<code>--threshold</code> options. Each source file is clearly marked
				485	(<code>Auto-annotated source</code>) as being chosen automatically. Any files
				486	that could not be found are mentioned at the end of the output, eg:
				487
				488	<pre>
				489	--------------------------------------------------------------------------------
				490	The following files chosen for auto-annotation could not be found:
				491	--------------------------------------------------------------------------------
				492	getc.c
				493	ctype.c
				494	../sysdeps/generic/lockfile.c
				495	</pre>
				496
				497	This is quite common for library files, since libraries are usually compiled
				498	with debugging information, but the source files are often not present on a
				499	system. If a file is chosen for annotation <b>both</b> manually and
				500	automatically, it is marked as <code>User-annotated source</code>.
				501
				502	Use the <code>-I/--include</code> option to tell Valgrind where to look for
				503	source files if the filenames found from the debugging information aren't
				504	specific enough.
				505
				506	Beware that cg_annotate can take some time to digest large
				507	<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
				508	beware that auto-annotation can produce a lot of output if your program is
				509	large!
				510
				511
				512	<h3>1.7  Annotating assembler programs</h3>
				513
				514	Valgrind can annotate assembler programs too, or annotate the
				515	assembler generated for your C program. Sometimes this is useful for
				516	understanding what is really happening when an interesting line of C
				517	code is translated into multiple instructions.<p>
				518
				519	To do this, you just need to assemble your <code>.s</code> files with
				520	assembler-level debug information. gcc doesn't do this, but you can
				521	use the GNU assembler with the <code>--gstabs</code> option to
				522	generate object files with this information, eg:
				523
				524	<blockquote><code>as --gstabs foo.s</code></blockquote>
				525
				526	You can then profile and annotate source files in the same way as for C/C++
				527	programs.
				528
				529
				530	<h3>1.8  <code>cg_annotate</code> options</h3>
				531	<ul>
				532	<li><code>--<i>pid</i></code></li><p>
				533
				534	Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
				535	Not actually an option -- it is required.
				536
				537	<li><code>-h, --help</code></li><p>
				538	<li><code>-v, --version</code><p>
				539
				540	Help and version, as usual.</li>
				541
				542	<li><code>--sort=A,B,C</code> [default: order in
				543	<code>cachegrind.out.<i>pid</i></code>]<p>
				544	Specifies the events upon which the sorting of the function-by-function
				545	entries will be based. Useful if you want to concentrate on eg. I cache
				546	misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
				547	(<code>--sort=D1mr,D2mr</code>), or L2 misses
				548	(<code>--sort=D2mr,I2mr</code>).</li><p>
				549
				550	<li><code>--show=A,B,C</code> [default: all, using order in
				551	<code>cachegrind.out.<i>pid</i></code>]<p>
				552	Specifies which events to show (and the column order). Default is to use
				553	all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
				554	the order in the file).</li><p>
				555
				556	<li><code>--threshold=X</code> [default: 99%] <p>
				557	Sets the threshold for the function-by-function summary. Functions are
				558	shown that account for more than X% of the primary sort event. If
				559	auto-annotating, also affects which files are annotated.
				560
				561	Note: thresholds can be set for more than one of the events by appending
				562	any events for the <code>--sort</code> option with a colon and a number
				563	(no spaces, though). E.g. if you want to see the functions that cover
				564	99% of L2 read misses and 99% of L2 write misses, use this option:
				565
				566	<blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
				567	</li><p>
				568
				569	<li><code>--auto=no</code> [default]<br>
				570	<code>--auto=yes</code> <p>
				571	When enabled, automatically annotates every file that is mentioned in the
				572	function-by-function summary that can be found. Also gives a list of
				573	those that couldn't be found.
				574
				575	<li><code>--context=N</code> [default: 8]<p>
				576	Print N lines of context before and after each annotated line. Avoids
				577	printing large sections of source files that were not executed. Use a
				578	large number (eg. 10,000) to show all source lines.
				579	</li><p>
				580
				581	<li><code>-I=<dir>, --include=<dir></code>
				582	[default: empty string]<p>
				583	Adds a directory to the list in which to search for files. Multiple
				584	-I/--include options can be given to add multiple directories.
				585	</ul>
				586
				587
				588	<h3>1.9  Warnings</h3>
				589	There are a couple of situations in which cg_annotate issues warnings.
				590
				591	<ul>
				592	<li>If a source file is more recent than the
				593	<code>cachegrind.out.<i>pid</i></code> file. This is because the
				594	information in <code>cachegrind.out.<i>pid</i></code> is only recorded
				595	with line numbers, so if the line numbers change at all in the source
				596	(eg. lines added, deleted, swapped), any annotations will be
				597	incorrect.<p>
				598
				599	<li>If information is recorded about line numbers past the end of a file.
				600	This can be caused by the above problem, ie. shortening the source file
				601	while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
				602	happens, the figures for the bogus lines are printed anyway (clearly
				603	marked as bogus) in case they are important.</li><p>
				604	</ul>
				605
				606
				607	<h3>1.10  Things to watch out for</h3>
				608	Some odd things that can occur during annotation:
				609
				610	<ul>
				611	<li>If annotating at the assembler level, you might see something like this:
				612
				613	<pre>
				614	1 0 0 . . . . . . leal -12(%ebp),%eax
				615	1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
				616	2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
				617	. . . . . . . . . .align 4,0x90
				618	1 0 0 . . . . . . movl $.LnrB,%eax
				619	1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
				620	</pre>
				621
				622	How can the third instruction be executed twice when the others are
				623	executed only once? As it turns out, it isn't. Here's a dump of the
				624	executable, using <code>objdump -d</code>:
				625
				626	<pre>
				627	8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
				628	8048f28: 89 43 54 mov %eax,0x54(%ebx)
				629	8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
				630	8048f32: 89 f6 mov %esi,%esi
				631	8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
				632	8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
				633	</pre>
				634
				635	Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
				636	come from? The GNU assembler inserted it to serve as the two bytes of
				637	padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
				638	a four-byte boundary, but pretended it didn't exist when adding debug
				639	information. Thus when Valgrind reads the debug info it thinks that the
				640	<code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
				641	range 0x8048f2b--0x804833 by itself, and attributes the counts for the
				642	<code>mov %esi,%esi</code> to it.<p>
				643	</li>
				644
				645	<li>Inlined functions can cause strange results in the function-by-function
				646	summary. If a function <code>inline_me()</code> is defined in
				647	<code>foo.h</code> and inlined in the functions <code>f1()</code>,
				648	<code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
				649	not be a <code>foo.h:inline_me()</code> function entry. Instead, there
				650	will be separate function entries for each inlining site, ie.
				651	<code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
				652	<code>foo.h:f3()</code>. To find the total counts for
				653	<code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
				654
				655	The reason for this is that although the debug info output by gcc
				656	indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
				657	doesn't indicate the name of the function in <code>foo.h</code>, so
				658	Valgrind keeps using the old one.<p>
				659
				660	<li>Sometimes, the same filename might be represented with a relative name
				661	and with an absolute name in different parts of the debug info, eg:
				662	<code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
				663	case, if you use auto-annotation, the file will be annotated twice with
				664	the counts split between the two.<p>
				665	</li>
				666
				667	<li>Files with more than 65,535 lines cause difficulties for the stabs debug
				668	info reader. This is because the line number in the <code>struct
				669	nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
				670	value. Valgrind can handle some files with more than 65,535 lines
				671	correctly by making some guesses to identify line number overflows. But
				672	some cases are beyond it, in which case you'll get a warning message
				673	explaining that annotations for the file might be incorrect.<p>
				674	</li>
				675
				676	<li>If you compile some files with <code>-g</code> and some without, some
				677	events that take place in a file without debug info could be attributed
				678	to the last line of a file with debug info (whichever one gets placed
				679	before the non-debug-info file in the executable).<p>
				680	</li>
				681	</ul>
				682
				683	This list looks long, but these cases should be fairly rare.<p>
				684
				685	Note: stabs is not an easy format to read. If you come across bizarre
				686	annotations that look like might be caused by a bug in the stabs reader,
				687	please let us know.<p>
				688
				689
				690	<h3>1.11  Accuracy</h3>
				691	Valgrind's cache profiling has a number of shortcomings:
				692
				693	<ul>
				694	<li>It doesn't account for kernel activity -- the effect of system calls on
				695	the cache contents is ignored.</li><p>
				696
				697	<li>It doesn't account for other process activity (although this is probably
				698	desirable when considering a single program).</li><p>
				699
				700	<li>It doesn't account for virtual-to-physical address mappings; hence the
				701	entire simulation is not a true representation of what's happening in the
				702	cache.</li><p>
				703
				704	<li>It doesn't account for cache misses not visible at the instruction level,
				705	eg. those arising from TLB misses, or speculative execution.</li><p>
				706
				707	<li>Valgrind's custom <code>malloc()</code> will allocate memory in different
				708	ways to the standard <code>malloc()</code>, which could warp the results.
				709	</li><p>
				710
				711	<li>Valgrind's custom threads implementation will schedule threads
				712	differently to the standard one. This too could warp the results for
				713	threaded programs.
				714	</li><p>
				715
				716	<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
				717	will incorrectly be counted as doing a data read if both the arguments
				718	are registers, eg:
				719
				720	<blockquote><code>btsl %eax, %edx</code></blockquote>
				721
				722	This should only happen rarely.
				723	</li><p>
				724
				725	<li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
				726	<code>fsave</code>) are treated as though they only access 16 bytes.
				727	These instructions seem to be rare so hopefully this won't affect
				728	accuracy much.
				729	</li><p>
				730	</ul>
				731
				732	Another thing worth nothing is that results are very sensitive. Changing the
				733	size of the <code>valgrind.so</code> file, the size of the program being
				734	profiled, or even the length of its name can perturb the results. Variations
				735	will be small, but don't expect perfectly repeatable results if your program
				736	changes at all.<p>
				737
				738	While these factors mean you shouldn't trust the results to be super-accurate,
				739	hopefully they should be close enough to be useful.<p>
				740
				741
				742	<h3>1.12  Todo</h3>
				743	<ul>
				744	<li>Program start-up/shut-down calls a lot of functions that aren't
				745	interesting and just complicate the output. Would be nice to exclude
				746	these somehow.</li>
				747	<p>
				748	</ul>
				749	<hr width="100%">
				750	</body>
				751	</html>
				752