Blame - cachegrind/docs/cg-manual.xml - platform/external/valgrind

blob: c1377b63a864d5c682dd40531fad1c53f58ab19b [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj	7aeb10f	2006-12-10 02:59:16 +0000	[diff] [blame]	3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
				4	[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	5
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	6
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	7	<chapter id="cg-manual" xreflabel="Cachegrind: a cache and branch-prediction profiler">
				8	<title>Cachegrind: a cache and branch-prediction profiler</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	9
				10	<para>To use this tool, you must specify
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	11	<option>--tool=cachegrind</option> on the
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	12	Valgrind command line.</para>
				13
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	14	<sect1 id="cg-manual.overview" xreflabel="Overview">
				15	<title>Overview</title>
				16
				17	<para>Cachegrind simulates how your program interacts with a machine's cache
				18	hierarchy and (optionally) branch predictor. It gathers the following
				19	statistics:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	20	<itemizedlist>
				21	<listitem>
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	22	<para>L1 instruction cache reads and read misses;</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	23	</listitem>
				24	<listitem>
				25	<para>L1 data cache reads and read misses, writes and write
				26	misses;</para>
				27	</listitem>
				28	<listitem>
				29	<para>L2 unified cache reads and read misses, writes and
				30	writes misses.</para>
				31	</listitem>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	32	<listitem>
				33	<para>Conditional branches and mispredicted conditional branches.</para>
				34	</listitem>
				35	<listitem>
				36	<para>Indirect branches and mispredicted indirect branches. An
				37	indirect branch is a jump or call to a destination only known at
				38	run time.</para>
				39	</listitem>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	40	</itemizedlist>
				41
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	42	<para>These statistics are presented for the entire program and for each
				43	function in the program. You can also annotate each line of source code in
				44	the program with the counts that were caused directly by it.</para>
				45
njn	c8cccb1	2005-07-25 23:30:24 +0000	[diff] [blame]	46	<para>On a modern machine, an L1 miss will typically cost
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	47	around 10 cycles, an L2 miss can cost as much as 200
				48	cycles, and a mispredicted branch costs in the region of 10
				49	to 30 cycles. Detailed cache and branch profiling can be very useful
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	50	for understanding how your program interacts with the machine and thus how
				51	to make it faster.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	52
				53	<para>Also, since one instruction cache read is performed per
				54	instruction executed, you can find out how many instructions are
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	55	executed per line, which can be useful for traditional profiling.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	56
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	57	<para>Branch profiling is not enabled by default. To use it, you must
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	58	additionally specify <option>--branch-sim=yes</option>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	59	on the command line.</para>
				60
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	61
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	62	<sect2 id="cg-manual.basics" xreflabel="Basics">
				63	<title>Basics</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	64
				65	<para>First off, as for normal Valgrind use, you probably want to
				66	compile with debugging info (the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	67	<option>-g</option> flag). But by contrast with
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	68	normal Valgrind use, you probably <command>do</command> want to turn
				69	optimisation on, since you should profile your program as it will
				70	be normally run.</para>
				71
				72	<para>The two steps are:</para>
				73	<orderedlist>
				74	<listitem>
				75	<para>Run your program with <computeroutput>valgrind
				76	--tool=cachegrind</computeroutput> in front of the normal
				77	command line invocation. When the program finishes,
				78	Cachegrind will print summary cache statistics. It also
				79	collects line-by-line information in a file
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	80	<computeroutput>cachegrind.out.<pid></computeroutput>, where
				81	<computeroutput><pid></computeroutput> is the program's process
				82	ID.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	83
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	84	<para>Branch prediction statistics are not collected by default.
				85	To do so, add the flag
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	86	<option>--branch-sim=yes</option>.
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	87	</para>
				88
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	89	<para>This step should be done every time you want to collect
				90	information about a new program, a changed program, or about
				91	the same program with different input.</para>
				92	</listitem>
				93
				94	<listitem>
				95	<para>Generate a function-by-function summary, and possibly
				96	annotate source files, using the supplied
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	97	cg_annotate program. Source
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	98	files to annotate can be specified manually, or manually on
				99	the command line, or "interesting" source files can be
				100	annotated automatically with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	101	<option>--auto=yes</option> option. You can
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	102	annotate C/C++ files or assembly language files equally
				103	easily.</para>
				104
				105	<para>This step can be performed as many times as you like
				106	for each Step 2. You may want to do multiple annotations
				107	showing different information each time.</para>
				108	</listitem>
				109
				110	</orderedlist>
				111
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	112	<para>As an optional intermediate step, you can use the supplied
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	113	cg_merge program to sum together the
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	114	outputs of multiple Cachegrind runs, into a single file which you then
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	115	use as the input for cg_annotate.</para>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	116
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	117	<para>These steps are described in detail in the following
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	118	sections.</para>
				119
				120	</sect2>
				121
				122
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	123	<sect2 id="cache-sim" xreflabel="Cache simulation specifics">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	124	<title>Cache simulation specifics</title>
				125
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	126	<para>Cachegrind simulates a machine with independent
				127	first level instruction and data caches (I1 and D1), backed by a
				128	unified second level cache (L2). This configuration is used by almost
				129	all modern machines. Some old Cyrix CPUs had a unified I and D L1
				130	cache, but they are ancient history now.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	131
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	132	<para>Specific characteristics of the simulation are as
				133	follows:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	134
				135	<itemizedlist>
				136
				137	<listitem>
				138	<para>Write-allocate: when a write miss occurs, the block
				139	written to is brought into the D1 cache. Most modern caches
				140	have this property.</para>
				141	</listitem>
				142
				143	<listitem>
weidendo	144b76c	2009-01-26 22:56:14 +0000	[diff] [blame]	144	<para>Bit-selection hash function: the set of line(s) in the cache
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	145	to which a memory block maps is chosen by the middle bits
				146	M--(M+N-1) of the byte address, where:</para>
				147	<itemizedlist>
				148	<listitem>
				149	<para>line size = 2^M bytes</para>
				150	</listitem>
				151	<listitem>
weidendo	144b76c	2009-01-26 22:56:14 +0000	[diff] [blame]	152	<para>(cache size / line size / associativity) = 2^N bytes</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	153	</listitem>
				154	</itemizedlist>
				155	</listitem>
				156
				157	<listitem>
weidendo	144b76c	2009-01-26 22:56:14 +0000	[diff] [blame]	158	<para>Inclusive L2 cache: the L2 cache typically replicates all
				159	the entries of the L1 caches, because fetching into L1 involves
				160	fetching into L2 first (this does not guarantee strict inclusiveness,
				161	as lines evicted from L2 still could reside in L1). This is
				162	standard on Pentium chips, but AMD Opterons, Athlons and Durons
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	163	use an exclusive L2 cache that only holds
				164	blocks evicted from L1. Ditto most modern VIA CPUs.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	165	</listitem>
				166
				167	</itemizedlist>
				168
				169	<para>The cache configuration simulated (cache size,
				170	associativity and line size) is determined automagically using
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	171	the x86 CPUID instruction. If you have an machine that (a)
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	172	doesn't support the CPUID instruction, or (b) supports it in an
				173	early incarnation that doesn't give any cache information, then
				174	Cachegrind will fall back to using a default configuration (that
				175	of a model 3/4 Athlon). Cachegrind will tell you if this
				176	happens. You can manually specify one, two or all three levels
				177	(I1/D1/L2) of the cache from the command line using the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	178	<option>--I1</option>,
				179	<option>--D1</option> and
				180	<option>--L2</option> options.
weidendo	144b76c	2009-01-26 22:56:14 +0000	[diff] [blame]	181	For cache parameters to be valid for simulation, the number
				182	of sets (with associativity being the number of cache lines in
				183	each set) has to be a power of two.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	184
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	185	<para>On PowerPC platforms
				186	Cachegrind cannot automatically
				187	determine the cache configuration, so you will
				188	need to specify it with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	189	<option>--I1</option>,
				190	<option>--D1</option> and
				191	<option>--L2</option> options.</para>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	192
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	193
				194	<para>Other noteworthy behaviour:</para>
				195
				196	<itemizedlist>
				197	<listitem>
				198	<para>References that straddle two cache lines are treated as
				199	follows:</para>
				200	<itemizedlist>
				201	<listitem>
				202	<para>If both blocks hit --> counted as one hit</para>
				203	</listitem>
				204	<listitem>
				205	<para>If one block hits, the other misses --> counted
				206	as one miss.</para>
				207	</listitem>
				208	<listitem>
				209	<para>If both blocks miss --> counted as one miss (not
				210	two)</para>
				211	</listitem>
				212	</itemizedlist>
				213	</listitem>
				214
				215	<listitem>
				216	<para>Instructions that modify a memory location
				217	(eg. <computeroutput>inc</computeroutput> and
				218	<computeroutput>dec</computeroutput>) are counted as doing
				219	just a read, ie. a single data reference. This may seem
				220	strange, but since the write can never cause a miss (the read
				221	guarantees the block is in the cache) it's not very
				222	interesting.</para>
				223
				224	<para>Thus it measures not the number of times the data cache
				225	is accessed, but the number of times a data cache miss could
				226	occur.</para>
				227	</listitem>
				228
				229	</itemizedlist>
				230
				231	<para>If you are interested in simulating a cache with different
				232	properties, it is not particularly hard to write your own cache
				233	simulator, or to modify the existing ones in
weidendo	144b76c	2009-01-26 22:56:14 +0000	[diff] [blame]	234	<computeroutput>cg_sim.c</computeroutput>. We'd be
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	235	interested to hear from anyone who does.</para>
				236
				237	</sect2>
				238
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	239
				240	<sect2 id="branch-sim" xreflabel="Branch simulation specifics">
				241	<title>Branch simulation specifics</title>
				242
				243	<para>Cachegrind simulates branch predictors intended to be
				244	typical of mainstream desktop/server processors of around 2004.</para>
				245
				246	<para>Conditional branches are predicted using an array of 16384 2-bit
				247	saturating counters. The array index used for a branch instruction is
				248	computed partly from the low-order bits of the branch instruction's
				249	address and partly using the taken/not-taken behaviour of the last few
				250	conditional branches. As a result the predictions for any specific
				251	branch depend both on its own history and the behaviour of previous
				252	branches. This is a standard technique for improving prediction
				253	accuracy.</para>
				254
				255	<para>For indirect branches (that is, jumps to unknown destinations)
				256	Cachegrind uses a simple branch target address predictor. Targets are
				257	predicted using an array of 512 entries indexed by the low order 9
				258	bits of the branch instruction's address. Each branch is predicted to
				259	jump to the same address it did last time. Any other behaviour causes
				260	a mispredict.</para>
				261
				262	<para>More recent processors have better branch predictors, in
				263	particular better indirect branch predictors. Cachegrind's predictor
				264	design is deliberately conservative so as to be representative of the
				265	large installed base of processors which pre-date widespread
				266	deployment of more sophisticated indirect branch predictors. In
				267	particular, late model Pentium 4s (Prescott), Pentium M, Core and Core
				268	2 have more sophisticated indirect branch predictors than modelled by
				269	Cachegrind. </para>
				270
				271	<para>Cachegrind does not simulate a return stack predictor. It
				272	assumes that processors perfectly predict function return addresses,
				273	an assumption which is probably close to being true.</para>
				274
				275	<para>See Hennessy and Patterson's classic text "Computer
				276	Architecture: A Quantitative Approach", 4th edition (2007), Section
				277	2.3 (pages 80-89) for background on modern branch predictors.</para>
				278
				279	</sect2>
				280
				281
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	282	</sect1>
				283
				284
				285
				286	<sect1 id="cg-manual.profile" xreflabel="Profiling programs">
				287	<title>Profiling programs</title>
				288
				289	<para>To gather cache profiling information about the program
				290	<computeroutput>ls -l</computeroutput>, invoke Cachegrind like
				291	this:</para>
				292
				293	<programlisting><![CDATA[
				294	valgrind --tool=cachegrind ls -l]]></programlisting>
				295
				296	<para>The program will execute (slowly). Upon completion,
				297	summary statistics that look like this will be printed:</para>
				298
				299	<programlisting><![CDATA[
				300	==31751== I refs: 27,742,716
				301	==31751== I1 misses: 276
				302	==31751== L2 misses: 275
				303	==31751== I1 miss rate: 0.0%
				304	==31751== L2i miss rate: 0.0%
				305	==31751==
				306	==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
				307	==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
				308	==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
				309	==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
				310	==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
				311	==31751==
				312	==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
				313	==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)]]></programlisting>
				314
				315	<para>Cache accesses for instruction fetches are summarised
				316	first, giving the number of fetches made (this is the number of
				317	instructions executed, which can be useful to know in its own
				318	right), the number of I1 misses, and the number of L2 instruction
				319	(<computeroutput>L2i</computeroutput>) misses.</para>
				320
				321	<para>Cache accesses for data follow. The information is similar
				322	to that of the instruction fetches, except that the values are
				323	also shown split between reads and writes (note each row's
				324	<computeroutput>rd</computeroutput> and
				325	<computeroutput>wr</computeroutput> values add up to the row's
				326	total).</para>
				327
				328	<para>Combined instruction and data figures for the L2 cache
				329	follow that.</para>
				330
				331
				332
				333	<sect2 id="cg-manual.outputfile" xreflabel="Output file">
				334	<title>Output file</title>
				335
				336	<para>As well as printing summary information, Cachegrind also
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	337	writes line-by-line cache profiling information to a user-specified
				338	file. By default this file is named
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	339	<computeroutput>cachegrind.out.<pid></computeroutput>. This file
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	340	is human-readable, but is intended to be interpreted by the accompanying
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	341	program cg_annotate, described in the next section.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	342
				343	<para>Things to note about the
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	344	<computeroutput>cachegrind.out.<pid></computeroutput>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	345	file:</para>
				346
				347	<itemizedlist>
				348	<listitem>
				349	<para>It is written every time Cachegrind is run, and will
				350	overwrite any existing
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	351	<computeroutput>cachegrind.out.<pid></computeroutput>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	352	in the current directory (but that won't happen very often
				353	because it takes some time for process ids to be
				354	recycled).</para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	355	</listitem>
				356	<listitem>
				357	<para>To use an output file name other than the default
sewardj	8693e01	2007-02-08 06:47:19 +0000	[diff] [blame]	358	<computeroutput>cachegrind.out</computeroutput>,
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	359	use the <option>--cachegrind-out-file</option>
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	360	switch.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	361	</listitem>
				362	<listitem>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	363	<para>It can be big: <computeroutput>ls -l</computeroutput>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	364	generates a file of about 350KB. Browsing a few files and
				365	web pages with a Konqueror built with full debugging
				366	information generates a file of around 15 MB.</para>
				367	</listitem>
				368	</itemizedlist>
				369
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	370	<para>The default <computeroutput>.<pid></computeroutput> suffix
de	7e109d1	2005-11-18 22:09:58 +0000	[diff] [blame]	371	on the output file name serves two purposes. Firstly, it means you
				372	don't have to rename old log files that you don't want to overwrite.
				373	Secondly, and more importantly, it allows correct profiling with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	374	<option>--trace-children=yes</option> option of
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	375	programs that spawn child processes.</para>
				376
				377	</sect2>
				378
				379
				380
				381	<sect2 id="cg-manual.cgopts" xreflabel="Cachegrind options">
				382	<title>Cachegrind options</title>
				383
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	384	<!-- start of xi:include in the manpage -->
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	385	<para id="cg.opts.para">Using command line options, you can
				386	manually specify the I1/D1/L2 cache
				387	configuration to simulate. For each cache, you can specify the
				388	size, associativity and line size. The size and line size
				389	are measured in bytes. The three items
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	390	must be comma-separated, but with no spaces, eg:
				391	<literallayout> valgrind --tool=cachegrind --I1=65535,2,64</literallayout>
				392
				393	You can specify one, two or three of the I1/D1/L2 caches. Any level not
				394	manually specified will be simulated using the configuration found in
				395	the normal way (via the CPUID instruction for automagic cache
				396	configuration, or failing that, via defaults).</para>
				397
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	398	<para>Cache-simulation specific options are:</para>
				399
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	400	<variablelist id="cg.opts.list">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	401
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	402	<varlistentry id="opt.I1" xreflabel="--I1">
				403	<term>
				404	<option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
				405	</term>
				406	<listitem>
				407	<para>Specify the size, associativity and line size of the level 1
				408	instruction cache. </para>
				409	</listitem>
				410	</varlistentry>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	411
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	412	<varlistentry id="opt.D1" xreflabel="--D1">
				413	<term>
				414	<option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
				415	</term>
				416	<listitem>
				417	<para>Specify the size, associativity and line size of the level 1
				418	data cache.</para>
				419	</listitem>
				420	</varlistentry>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	421
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	422	<varlistentry id="opt.L2" xreflabel="--L2">
				423	<term>
				424	<option><![CDATA[--L2=<size>,<associativity>,<line size> ]]></option>
				425	</term>
				426	<listitem>
				427	<para>Specify the size, associativity and line size of the level 2
				428	cache.</para>
				429	</listitem>
				430	</varlistentry>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	431
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	432	<varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
				433	<term>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	434	<option><![CDATA[--cachegrind-out-file=<file> ]]></option>
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	435	</term>
				436	<listitem>
sewardj	8693e01	2007-02-08 06:47:19 +0000	[diff] [blame]	437	<para>Write the profile data to
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	438	<computeroutput>file</computeroutput> rather than to the default
				439	output file,
				440	<computeroutput>cachegrind.out.<pid></computeroutput>. The
				441	<option>%p</option> and <option>%q</option> format specifiers
				442	can be used to embed the process ID and/or the contents of an
				443	environment variable in the name, as is the case for the core
				444	option <option>--log-file</option>. See <link
				445	linkend="manual-core.basicopts">here</link> for details.
sewardj	e1216cb	2007-02-07 19:55:30 +0000	[diff] [blame]	446	</para>
				447	</listitem>
				448	</varlistentry>
				449
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	450	<varlistentry id="opt.cache-sim" xreflabel="--cache-sim">
				451	<term>
				452	<option><![CDATA[--cache-sim=no\|yes [yes] ]]></option>
				453	</term>
				454	<listitem>
				455	<para>Enables or disables collection of cache access and miss
				456	counts.</para>
				457	</listitem>
				458	</varlistentry>
				459
				460	<varlistentry id="opt.branch-sim" xreflabel="--branch-sim">
				461	<term>
				462	<option><![CDATA[--branch-sim=no\|yes [no] ]]></option>
				463	</term>
				464	<listitem>
				465	<para>Enables or disables collection of branch instruction and
				466	misprediction counts. By default this is disabled as it
				467	slows Cachegrind down by approximately 25%. Note that you
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	468	cannot specify <option>--cache-sim=no</option>
				469	and <option>--branch-sim=no</option>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	470	together, as that would leave Cachegrind with no
				471	information to collect.</para>
				472	</listitem>
				473	</varlistentry>
				474
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	475	</variablelist>
				476	<!-- end of xi:include in the manpage -->
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	477
				478	</sect2>
				479
				480
				481
				482	<sect2 id="cg-manual.annotate" xreflabel="Annotating C/C++ programs">
				483	<title>Annotating C/C++ programs</title>
				484
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	485	<para>Before using cg_annotate,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	486	it is worth widening your window to be at least 120-characters
				487	wide if possible, as the output lines can be quite long.</para>
				488
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	489	<para>To get a function-by-function summary, run <computeroutput>cg_annotate
				490	<filename></computeroutput> on a Cachegrind output file.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	491
				492	<para>The output looks like this:</para>
				493
				494	<programlisting><![CDATA[
				495	--------------------------------------------------------------------------------
				496	I1 cache: 65536 B, 64 B, 2-way associative
				497	D1 cache: 65536 B, 64 B, 2-way associative
				498	L2 cache: 262144 B, 64 B, 8-way associative
				499	Command: concord vg_to_ucode.c
				500	Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				501	Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				502	Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				503	Threshold: 99%
				504	Chosen for annotation:
				505	Auto-annotation: on
				506
				507	--------------------------------------------------------------------------------
				508	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				509	--------------------------------------------------------------------------------
				510	27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
				511
				512	--------------------------------------------------------------------------------
				513	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
				514	--------------------------------------------------------------------------------
				515	8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
				516	5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
				517	2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
				518	2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
				519	2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
				520	1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
				521	897,991 51 51 897,831 95 30 62 1 1 ???:???
				522	598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
				523	598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
				524	598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
				525	446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
				526	341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
				527	320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
				528	298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
				529	149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
				530	149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
				531	95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
				532	85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue]]></programlisting>
				533
				534
				535	<para>First up is a summary of the annotation options:</para>
				536
				537	<itemizedlist>
				538
				539	<listitem>
				540	<para>I1 cache, D1 cache, L2 cache: cache configuration. So
				541	you know the configuration with which these results were
				542	obtained.</para>
				543	</listitem>
				544
				545	<listitem>
				546	<para>Command: the command line invocation of the program
				547	under examination.</para>
				548	</listitem>
				549
				550	<listitem>
				551	<para>Events recorded: event abbreviations are:</para>
				552	<itemizedlist>
				553	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	554	<para><computeroutput>Ir</computeroutput>: I cache reads
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	555	(ie. instructions executed)</para>
				556	</listitem>
				557	<listitem>
				558	<para><computeroutput>I1mr</computeroutput>: I1 cache read
				559	misses</para>
				560	</listitem>
				561	<listitem>
				562	<para><computeroutput>I2mr</computeroutput>: L2 cache
				563	instruction read misses</para>
				564	</listitem>
				565	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	566	<para><computeroutput>Dr</computeroutput>: D cache reads
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	567	(ie. memory reads)</para>
				568	</listitem>
				569	<listitem>
				570	<para><computeroutput>D1mr</computeroutput>: D1 cache read
				571	misses</para>
				572	</listitem>
				573	<listitem>
				574	<para><computeroutput>D2mr</computeroutput>: L2 cache data
				575	read misses</para>
				576	</listitem>
				577	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	578	<para><computeroutput>Dw</computeroutput>: D cache writes
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	579	(ie. memory writes)</para>
				580	</listitem>
				581	<listitem>
				582	<para><computeroutput>D1mw</computeroutput>: D1 cache write
				583	misses</para>
				584	</listitem>
				585	<listitem>
				586	<para><computeroutput>D2mw</computeroutput>: L2 cache data
				587	write misses</para>
				588	</listitem>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	589	<listitem>
				590	<para><computeroutput>Bc</computeroutput>: Conditional branches
				591	executed</para>
				592	</listitem>
				593	<listitem>
				594	<para><computeroutput>Bcm</computeroutput>: Conditional branches
				595	mispredicted</para>
				596	</listitem>
				597	<listitem>
				598	<para><computeroutput>Bi</computeroutput>: Indirect branches
				599	executed</para>
				600	</listitem>
				601	<listitem>
				602	<para><computeroutput>Bim</computeroutput>: Conditional branches
				603	mispredicted</para>
				604	</listitem>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	605	</itemizedlist>
				606
				607	<para>Note that D1 total accesses is given by
				608	<computeroutput>D1mr</computeroutput> +
				609	<computeroutput>D1mw</computeroutput>, and that L2 total
				610	accesses is given by <computeroutput>I2mr</computeroutput> +
				611	<computeroutput>D2mr</computeroutput> +
				612	<computeroutput>D2mw</computeroutput>.</para>
				613	</listitem>
				614
				615	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	616	<para>Events shown: the events shown, which is a subset of the events
				617	gathered. This can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	618	<option>--show</option> option.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	619	</listitem>
				620
				621	<listitem>
				622	<para>Event sort order: the sort order in which functions are
				623	shown. For example, in this case the functions are sorted
				624	from highest <computeroutput>Ir</computeroutput> counts to
				625	lowest. If two functions have identical
				626	<computeroutput>Ir</computeroutput> counts, they will then be
				627	sorted by <computeroutput>I1mr</computeroutput> counts, and
				628	so on. This order can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	629	<option>--sort</option> option.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	630
				631	<para>Note that this dictates the order the functions appear.
				632	It is <command>not</command> the order in which the columns
				633	appear; that is dictated by the "events shown" line (and can
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	634	be changed with the <option>--show</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	635	option).</para>
				636	</listitem>
				637
				638	<listitem>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	639	<para>Threshold: cg_annotate
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	640	by default omits functions that cause very low counts
				641	to avoid drowning you in information. In this case,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	642	cg_annotate shows summaries the functions that account for
				643	99% of the <computeroutput>Ir</computeroutput> counts;
				644	<computeroutput>Ir</computeroutput> is chosen as the
				645	threshold event since it is the primary sort event. The
				646	threshold can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	647	<option>--threshold</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	648	option.</para>
				649	</listitem>
				650
				651	<listitem>
				652	<para>Chosen for annotation: names of files specified
				653	manually for annotation; in this case none.</para>
				654	</listitem>
				655
				656	<listitem>
				657	<para>Auto-annotation: whether auto-annotation was requested
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	658	via the <option>--auto=yes</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	659	option. In this case no.</para>
				660	</listitem>
				661
				662	</itemizedlist>
				663
				664	<para>Then follows summary statistics for the whole
				665	program. These are similar to the summary provided when running
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	666	<computeroutput>valgrind --tool=cachegrind</computeroutput>.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	667
				668	<para>Then follows function-by-function statistics. Each function
				669	is identified by a
				670	<computeroutput>file_name:function_name</computeroutput> pair. If
				671	a column contains only a dot it means the function never performs
				672	that event (eg. the third row shows that
				673	<computeroutput>strcmp()</computeroutput> contains no
				674	instructions that write to memory). The name
				675	<computeroutput>???</computeroutput> is used if the the file name
				676	and/or function name could not be determined from debugging
				677	information. If most of the entries have the form
				678	<computeroutput>???:???</computeroutput> the program probably
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	679	wasn't compiled with <option>-g</option>. If any
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	680	code was invalidated (either due to self-modifying code or
				681	unloading of shared objects) its counts are aggregated into a
				682	single cost centre written as
				683	<computeroutput>(discarded):(discarded)</computeroutput>.</para>
				684
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	685	<para>It is worth noting that functions will come both from
				686	the profiled program (eg. <filename>concord.c</filename>)
				687	and from libraries (eg. <filename>getc.c</filename>)</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	688
				689	<para>There are two ways to annotate source files -- by choosing
				690	them manually, or with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	691	<option>--auto=yes</option> option. To do it
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	692	manually, just specify the filenames as additional arguments to
				693	cg_annotate. For example, the
				694	output from running <filename>cg_annotate <filename>
				695	concord.c</filename> for our example produces the same output as above
				696	followed by an annotated version of <filename>concord.c</filename>, a
				697	section of which looks like:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	698
				699	<programlisting><![CDATA[
				700	--------------------------------------------------------------------------------
				701	-- User-annotated source: concord.c
				702	--------------------------------------------------------------------------------
				703	Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
				704
				705	[snip]
				706
				707	. . . . . . . . . void init_hash_table(char file_name, Word_Node table[])
				708	3 1 1 . . . 1 0 0 {
				709	. . . . . . . . . FILE *file_ptr;
				710	. . . . . . . . . Word_Info *data;
				711	1 0 0 . . . 1 1 1 int line = 1, i;
				712	. . . . . . . . .
				713	5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
				714	. . . . . . . . .
				715	4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
				716	3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
				717	. . . . . . . . .
				718	. . . . . . . . . /* Open file, check it. */
				719	6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
				720	2 0 0 1 0 0 . . . if (!(file_ptr)) {
				721	. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
				722	1 1 1 . . . . . . exit(EXIT_FAILURE);
				723	. . . . . . . . . }
				724	. . . . . . . . .
				725	165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
				726	146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
				727	. . . . . . . . .
				728	4 0 0 1 0 0 2 0 0 free(data);
				729	4 0 0 1 0 0 2 0 0 fclose(file_ptr);
				730	3 0 0 2 0 0 . . . }]]></programlisting>
				731
				732	<para>(Although column widths are automatically minimised, a wide
				733	terminal is clearly useful.)</para>
				734
				735	<para>Each source file is clearly marked
				736	(<computeroutput>User-annotated source</computeroutput>) as
				737	having been chosen manually for annotation. If the file was
				738	found in one of the directories specified with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	739	<option>-I</option>/<option>--include</option> option, the directory
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	740	and file are both given.</para>
				741
				742	<para>Each line is annotated with its event counts. Events not
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	743	applicable for a line are represented by a dot. This is useful
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	744	for distinguishing between an event which cannot happen, and one
				745	which can but did not.</para>
				746
				747	<para>Sometimes only a small section of a source file is
sewardj	8d9fec5	2005-11-15 20:56:23 +0000	[diff] [blame]	748	executed. To minimise uninteresting output, Cachegrind only shows
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	749	annotated lines and lines within a small distance of annotated
				750	lines. Gaps are marked with the line numbers so you know which
				751	part of a file the shown code comes from, eg:</para>
				752
				753	<programlisting><![CDATA[
				754	(figures and code for line 704)
				755	-- line 704 ----------------------------------------
				756	-- line 878 ----------------------------------------
				757	(figures and code for line 878)]]></programlisting>
				758
				759	<para>The amount of context to show around annotated lines is
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	760	controlled by the <option>--context</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	761	option.</para>
				762
				763	<para>To get automatic annotation, run
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	764	<computeroutput>cg_annotate <filename> --auto=yes</computeroutput>.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	765	cg_annotate will automatically annotate every source file it can
				766	find that is mentioned in the function-by-function summary.
				767	Therefore, the files chosen for auto-annotation are affected by
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	768	the <option>--sort</option> and
				769	<option>--threshold</option> options. Each
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	770	source file is clearly marked (<computeroutput>Auto-annotated
				771	source</computeroutput>) as being chosen automatically. Any
				772	files that could not be found are mentioned at the end of the
				773	output, eg:</para>
				774
				775	<programlisting><![CDATA[
				776	------------------------------------------------------------------
				777	The following files chosen for auto-annotation could not be found:
				778	------------------------------------------------------------------
				779	getc.c
				780	ctype.c
				781	../sysdeps/generic/lockfile.c]]></programlisting>
				782
				783	<para>This is quite common for library files, since libraries are
				784	usually compiled with debugging information, but the source files
				785	are often not present on a system. If a file is chosen for
				786	annotation <command>both</command> manually and automatically, it
				787	is marked as <computeroutput>User-annotated
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	788	source</computeroutput>. Use the
				789	<option>-I</option>/<option>--include</option> option to tell Valgrind where
				790	to look for source files if the filenames found from the debugging
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	791	information aren't specific enough.</para>
				792
				793	<para>Beware that cg_annotate can take some time to digest large
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	794	<computeroutput>cachegrind.out.<pid></computeroutput> files,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	795	e.g. 30 seconds or more. Also beware that auto-annotation can
				796	produce a lot of output if your program is large!</para>
				797
				798	</sect2>
				799
				800
				801	<sect2 id="cg-manual.assembler" xreflabel="Annotating assembler programs">
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	802	<title>Annotating assembly code programs</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	803
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	804	<para>Valgrind can annotate assembly code programs too, or annotate
				805	the assembly code generated for your C program. Sometimes this is
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	806	useful for understanding what is really happening when an
				807	interesting line of C code is translated into multiple
				808	instructions.</para>
				809
				810	<para>To do this, you just need to assemble your
njn	85a38bc	2008-10-30 02:41:13 +0000	[diff] [blame]	811	<computeroutput>.s</computeroutput> files with assembly-level debug
				812	information. You can use <computeroutput>gcc
				813	-S</computeroutput> to compile C/C++ programs to assembly code, and then
				814	<computeroutput>gcc -g</computeroutput> on the assembly code files to
				815	achieve this. You can then profile and annotate the assembly code source
				816	files in the same way as C/C++ source files.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	817
				818	</sect2>
				819
njn	7064fb2	2008-05-29 23:09:52 +0000	[diff] [blame]	820	<sect2 id="ms-manual.forkingprograms" xreflabel="Forking Programs">
				821	<title>Forking Programs</title>
				822	<para>If your program forks, the child will inherit all the profiling data that
				823	has been gathered for the parent.</para>
				824
				825	<para>If the output file format string (controlled by
				826	<option>--cachegrind-out-file</option>) does not contain <option>%p</option>,
				827	then the outputs from the parent and child will be intermingled in a single
				828	output file, which will almost certainly make it unreadable by
				829	cg_annotate.</para>
				830	</sect2>
				831
				832
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	833	</sect1>
				834
				835
				836	<sect1 id="cg-manual.annopts" xreflabel="cg_annotate options">
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	837	<title>cg_annotate options</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	838
				839	<itemizedlist>
				840
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	841	<listitem>
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	842	<para><option>-h --help</option></para>
				843	<para><option>-v --version</option></para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	844	<para>Help and version, as usual.</para>
				845	</listitem>
				846
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	847	<listitem id="sort">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	848	<para><option>--sort=A,B,C</option> [default:
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	849	order in
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	850	<computeroutput>cachegrind.out.<pid></computeroutput>]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	851	<para>Specifies the events upon which the sorting of the
				852	function-by-function entries will be based. Useful if you
				853	want to concentrate on eg. I cache misses
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	854	(<option>--sort=I1mr,I2mr</option>), or D cache misses
				855	(<option>--sort=D1mr,D2mr</option>), or L2 misses
				856	(<option>--sort=D2mr,I2mr</option>).</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	857	</listitem>
				858
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	859	<listitem id="show">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	860	<para><option>--show=A,B,C</option> [default:
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	861	all, using order in
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	862	<computeroutput>cachegrind.out.<pid></computeroutput>]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	863	<para>Specifies which events to show (and the column
				864	order). Default is to use all present in the
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	865	<computeroutput>cachegrind.out.<pid></computeroutput> file (and
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	866	use the order in the file).</para>
				867	</listitem>
				868
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	869	<listitem id="threshold">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	870	<para><option>--threshold=X</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	871	[default: 99%]</para>
				872	<para>Sets the threshold for the function-by-function
				873	summary. Functions are shown that account for more than X%
				874	of the primary sort event. If auto-annotating, also affects
				875	which files are annotated.</para>
				876
				877	<para>Note: thresholds can be set for more than one of the
				878	events by appending any events for the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	879	<option>--sort</option> option with a colon
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	880	and a number (no spaces, though). E.g. if you want to see
				881	the functions that cover 99% of L2 read misses and 99% of L2
				882	write misses, use this option:</para>
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	883	<para><option>--sort=D2mr:99,D2mw:99</option></para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	884	</listitem>
				885
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	886	<listitem id="auto">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	887	<para><option>--auto=no</option> [default]</para>
				888	<para><option>--auto=yes</option></para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	889	<para>When enabled, automatically annotates every file that
				890	is mentioned in the function-by-function summary that can be
				891	found. Also gives a list of those that couldn't be found.</para>
				892	</listitem>
				893
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	894	<listitem id="context">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	895	<para><option>--context=N</option> [default: 8]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	896	<para>Print N lines of context before and after each
				897	annotated line. Avoids printing large sections of source
				898	files that were not executed. Use a large number
				899	(eg. 10,000) to show all source lines.</para>
				900	</listitem>
				901
de	bc32e82	2005-06-25 14:43:05 +0000	[diff] [blame]	902	<listitem id="include">
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	903	<para><option>-I<dir>, --include=<dir></option>
				904	[default: empty string]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	905	<para>Adds a directory to the list in which to search for
				906	files. Multiple -I/--include options can be given to add
				907	multiple directories.</para>
				908	</listitem>
				909
				910	</itemizedlist>
				911
				912
				913
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	914	<sect2 id="cg-manual.annopts.warnings" xreflabel="Warnings">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	915	<title>Warnings</title>
				916
				917	<para>There are a couple of situations in which
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	918	cg_annotate issues warnings.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	919
				920	<itemizedlist>
				921	<listitem>
				922	<para>If a source file is more recent than the
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	923	<computeroutput>cachegrind.out.<pid></computeroutput> file.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	924	This is because the information in
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	925	<computeroutput>cachegrind.out.<pid></computeroutput> is only
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	926	recorded with line numbers, so if the line numbers change at
				927	all in the source (eg. lines added, deleted, swapped), any
				928	annotations will be incorrect.</para>
				929	</listitem>
				930	<listitem>
				931	<para>If information is recorded about line numbers past the
				932	end of a file. This can be caused by the above problem,
				933	ie. shortening the source file while using an old
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	934	<computeroutput>cachegrind.out.<pid></computeroutput> file. If
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	935	this happens, the figures for the bogus lines are printed
				936	anyway (clearly marked as bogus) in case they are
				937	important.</para>
				938	</listitem>
				939	</itemizedlist>
				940
				941	</sect2>
				942
				943
				944
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	945	<sect2 id="cg-manual.annopts.things-to-watch-out-for"
				946	xreflabel="Things to watch out for">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	947	<title>Things to watch out for</title>
				948
				949	<para>Some odd things that can occur during annotation:</para>
				950
				951	<itemizedlist>
				952	<listitem>
				953	<para>If annotating at the assembler level, you might see
				954	something like this:</para>
				955	<programlisting><![CDATA[
				956	1 0 0 . . . . . . leal -12(%ebp),%eax
				957	1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
				958	2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
				959	. . . . . . . . . .align 4,0x90
				960	1 0 0 . . . . . . movl $.LnrB,%eax
				961	1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)]]></programlisting>
				962
				963	<para>How can the third instruction be executed twice when
				964	the others are executed only once? As it turns out, it
				965	isn't. Here's a dump of the executable, using
				966	<computeroutput>objdump -d</computeroutput>:</para>
				967	<programlisting><![CDATA[
				968	8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
				969	8048f28: 89 43 54 mov %eax,0x54(%ebx)
				970	8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
				971	8048f32: 89 f6 mov %esi,%esi
				972	8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
				973	8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)]]></programlisting>
				974
				975	<para>Notice the extra <computeroutput>mov
				976	%esi,%esi</computeroutput> instruction. Where did this come
				977	from? The GNU assembler inserted it to serve as the two
				978	bytes of padding needed to align the <computeroutput>movl
				979	$.LnrB,%eax</computeroutput> instruction on a four-byte
				980	boundary, but pretended it didn't exist when adding debug
				981	information. Thus when Valgrind reads the debug info it
				982	thinks that the <computeroutput>movl
				983	$0x1,0xffffffec(%ebp)</computeroutput> instruction covers the
				984	address range 0x8048f2b--0x804833 by itself, and attributes
				985	the counts for the <computeroutput>mov
				986	%esi,%esi</computeroutput> to it.</para>
				987	</listitem>
				988
				989	<listitem>
				990	<para>Inlined functions can cause strange results in the
				991	function-by-function summary. If a function
				992	<computeroutput>inline_me()</computeroutput> is defined in
				993	<filename>foo.h</filename> and inlined in the functions
				994	<computeroutput>f1()</computeroutput>,
				995	<computeroutput>f2()</computeroutput> and
				996	<computeroutput>f3()</computeroutput> in
				997	<filename>bar.c</filename>, there will not be a
				998	<computeroutput>foo.h:inline_me()</computeroutput> function
				999	entry. Instead, there will be separate function entries for
				1000	each inlining site, ie.
				1001	<computeroutput>foo.h:f1()</computeroutput>,
				1002	<computeroutput>foo.h:f2()</computeroutput> and
				1003	<computeroutput>foo.h:f3()</computeroutput>. To find the
				1004	total counts for
				1005	<computeroutput>foo.h:inline_me()</computeroutput>, add up
				1006	the counts from each entry.</para>
				1007
				1008	<para>The reason for this is that although the debug info
				1009	output by gcc indicates the switch from
				1010	<filename>bar.c</filename> to <filename>foo.h</filename>, it
				1011	doesn't indicate the name of the function in
				1012	<filename>foo.h</filename>, so Valgrind keeps using the old
				1013	one.</para>
				1014	</listitem>
				1015
				1016	<listitem>
				1017	<para>Sometimes, the same filename might be represented with
				1018	a relative name and with an absolute name in different parts
				1019	of the debug info, eg:
				1020	<filename>/home/user/proj/proj.h</filename> and
				1021	<filename>../proj.h</filename>. In this case, if you use
				1022	auto-annotation, the file will be annotated twice with the
				1023	counts split between the two.</para>
				1024	</listitem>
				1025
				1026	<listitem>
				1027	<para>Files with more than 65,535 lines cause difficulties
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1028	for the Stabs-format debug info reader. This is because the line
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1029	number in the <computeroutput>struct nlist</computeroutput>
				1030	defined in <filename>a.out.h</filename> under Linux is only a
				1031	16-bit value. Valgrind can handle some files with more than
				1032	65,535 lines correctly by making some guesses to identify
				1033	line number overflows. But some cases are beyond it, in
				1034	which case you'll get a warning message explaining that
				1035	annotations for the file might be incorrect.</para>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1036
				1037	<para>If you are using gcc 3.1 or later, this is most likely
				1038	irrelevant, since gcc switched to using the more modern DWARF2
				1039	format by default at version 3.1. DWARF2 does not have any such
				1040	limitations on line numbers.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1041	</listitem>
				1042
				1043	<listitem>
				1044	<para>If you compile some files with
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame^]	1045	<option>-g</option> and some without, some
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1046	events that take place in a file without debug info could be
				1047	attributed to the last line of a file with debug info
				1048	(whichever one gets placed before the non-debug-info file in
				1049	the executable).</para>
				1050	</listitem>
				1051
				1052	</itemizedlist>
				1053
				1054	<para>This list looks long, but these cases should be fairly
				1055	rare.</para>
				1056
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1057	</sect2>
				1058
				1059
				1060
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1061	<sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1062	<title>Accuracy</title>
				1063
				1064	<para>Valgrind's cache profiling has a number of
				1065	shortcomings:</para>
				1066
				1067	<itemizedlist>
				1068	<listitem>
				1069	<para>It doesn't account for kernel activity -- the effect of
				1070	system calls on the cache contents is ignored.</para>
				1071	</listitem>
				1072
				1073	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1074	<para>It doesn't account for other process activity.
				1075	This is probably desirable when considering a single
				1076	program.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1077	</listitem>
				1078
				1079	<listitem>
				1080	<para>It doesn't account for virtual-to-physical address
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1081	mappings. Hence the simulation is not a true
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1082	representation of what's happening in the
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1083	cache. Most caches are physically indexed, but Cachegrind
				1084	simulates caches using virtual addresses.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1085	</listitem>
				1086
				1087	<listitem>
				1088	<para>It doesn't account for cache misses not visible at the
				1089	instruction level, eg. those arising from TLB misses, or
				1090	speculative execution.</para>
				1091	</listitem>
				1092
				1093	<listitem>
sewardj	8d9fec5	2005-11-15 20:56:23 +0000	[diff] [blame]	1094	<para>Valgrind will schedule
				1095	threads differently from how they would be when running natively.
				1096	This could warp the results for threaded programs.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1097	</listitem>
				1098
				1099	<listitem>
sewardj	8d9fec5	2005-11-15 20:56:23 +0000	[diff] [blame]	1100	<para>The x86/amd64 instructions <computeroutput>bts</computeroutput>,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1101	<computeroutput>btr</computeroutput> and
				1102	<computeroutput>btc</computeroutput> will incorrectly be
				1103	counted as doing a data read if both the arguments are
				1104	registers, eg:</para>
				1105	<programlisting><![CDATA[
				1106	btsl %eax, %edx]]></programlisting>
				1107
				1108	<para>This should only happen rarely.</para>
				1109	</listitem>
				1110
				1111	<listitem>
sewardj	8d9fec5	2005-11-15 20:56:23 +0000	[diff] [blame]	1112	<para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1113	(e.g. <computeroutput>fsave</computeroutput>) are treated as
				1114	though they only access 16 bytes. These instructions seem to
				1115	be rare so hopefully this won't affect accuracy much.</para>
				1116	</listitem>
				1117
				1118	</itemizedlist>
				1119
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1120	<para>Another thing worth noting is that results are very sensitive.
				1121	Changing the size of the the executable being profiled, or the sizes
				1122	of any of the shared libraries it uses, or even the length of their
				1123	file names, can perturb the results. Variations will be small, but
				1124	don't expect perfectly repeatable results if your program changes at
				1125	all.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1126
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	1127	<para>More recent GNU/Linux distributions do address space
				1128	randomisation, in which identical runs of the same program have their
				1129	shared libraries loaded at different locations, as a security measure.
				1130	This also perturbs the results.</para>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1131
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1132	<para>While these factors mean you shouldn't trust the results to
				1133	be super-accurate, hopefully they should be close enough to be
				1134	useful.</para>
				1135
				1136	</sect2>
				1137
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1138	</sect1>
				1139
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1140
				1141
				1142	<sect1 id="cg-manual.cg_merge" xreflabel="cg_merge">
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1143	<title>Merging profiles with cg_merge</title>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1144
				1145	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1146	cg_merge is a simple program which
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1147	reads multiple profile files, as created by cachegrind, merges them
				1148	together, and writes the results into another file in the same format.
				1149	You can then examine the merged results using
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1150	<computeroutput>cg_annotate <filename></computeroutput>, as
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1151	described above. The merging functionality might be useful if you
				1152	want to aggregate costs over multiple runs of the same program, or
				1153	from a single parallel run with multiple instances of the same
				1154	program.</para>
				1155
				1156	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1157	cg_merge is invoked as follows:
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1158	</para>
				1159
				1160	<programlisting><![CDATA[
				1161	cg_merge -o outputfile file1 file2 file3 ...]]></programlisting>
				1162
				1163	<para>
				1164	It reads and checks <computeroutput>file1</computeroutput>, then read
				1165	and checks <computeroutput>file2</computeroutput> and merges it into
				1166	the running totals, then the same with
				1167	<computeroutput>file3</computeroutput>, etc. The final results are
				1168	written to <computeroutput>outputfile</computeroutput>, or to standard
				1169	out if no output file is specified.</para>
				1170
				1171	<para>
				1172	Costs are summed on a per-function, per-line and per-instruction
				1173	basis. Because of this, the order in which the input files does not
				1174	matter, although you should take care to only mention each file once,
				1175	since any file mentioned twice will be added in twice.</para>
				1176
				1177	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1178	cg_merge does not attempt to check
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1179	that the input files come from runs of the same executable. It will
				1180	happily merge together profile files from completely unrelated
				1181	programs. It does however check that the
				1182	<computeroutput>Events:</computeroutput> lines of all the inputs are
				1183	identical, so as to ensure that the addition of costs makes sense.
				1184	For example, it would be nonsensical for it to add a number indicating
				1185	D1 read references to a number from a different file indicating L2
				1186	write misses.</para>
				1187
				1188	<para>
				1189	A number of other syntax and sanity checks are done whilst reading the
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	1190	inputs. cg_merge will stop and
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1191	attempt to print a helpful error message if any of the input files
				1192	fail these checks.</para>
				1193
				1194	</sect1>
				1195
				1196
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1197	<sect1 id="cg-manual.acting-on"
				1198	xreflabel="Acting on Cachegrind's information">
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1199	<title>Acting on Cachegrind's information</title>
				1200	<para>
				1201	So, you've managed to profile your program with Cachegrind. Now what?
				1202	What's the best way to actually act on the information it provides to speed
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1203	up your program? Here are some rules of thumb that we have found to be
				1204	useful.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1205
				1206	<para>
				1207	First of all, the global hit/miss rate numbers are not that useful. If you
				1208	have multiple programs or multiple runs of a program, comparing the numbers
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1209	might identify if any are outliers and worthy of closer investigation.
				1210	Otherwise, they're not enough to act on.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1211
				1212	<para>
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1213	The line-by-line source code annotations are much more useful. In our
				1214	experience, the best place to start is by looking at the
				1215	<computeroutput>Ir</computeroutput> numbers. They simply measure how many
				1216	instructions were executed for each line, and don't include any cache
				1217	information, but they can still be very useful for identifying
				1218	bottlenecks.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1219
				1220	<para>
				1221	After that, we have found that L2 misses are typically a much bigger source
				1222	of slow-downs than L1 misses. So it's worth looking for any snippets of
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1223	code that cause a high proportion of the L2 misses. If you find any, it's
				1224	still not always easy to work out how to improve things. You need to have a
				1225	reasonable understanding of how caches work, the principles of locality, and
				1226	your program's data access patterns. Improving things may require
				1227	redesigning a data structure, for example.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1228
				1229	<para>
				1230	In short, Cachegrind can tell you where some of the bottlenecks in your code
				1231	are, but it can't tell you how to fix them. You have to work that out for
				1232	yourself. But at least you have the information!
				1233	</para>
				1234
				1235	</sect1>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1236
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1237	<sect1 id="cg-manual.impl-details"
				1238	xreflabel="Implementation details">
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1239	<title>Implementation details</title>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1240	<para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1241	This section talks about details you don't need to know about in order to
				1242	use Cachegrind, but may be of interest to some people.
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1243	</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1244
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1245	<sect2 id="cg-manual.impl-details.how-cg-works"
				1246	xreflabel="How Cachegrind works">
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1247	<title>How Cachegrind works</title>
				1248	<para>The best reference for understanding how Cachegrind works is chapter 3 of
				1249	"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
njn	011215f	2006-10-21 23:00:59 +0000	[diff] [blame]	1250	is available on the <ulink url="&vg-pubs;">Valgrind publications
				1251	page</ulink>.</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1252	</sect2>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1253
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1254	<sect2 id="cg-manual.impl-details.file-format"
				1255	xreflabel="Cachegrind output file format">
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1256	<title>Cachegrind output file format</title>
				1257	<para>The file format is fairly straightforward, basically giving the
				1258	cost centre for every line, grouped by files and
				1259	functions. Total counts (eg. total cache accesses, total L1
				1260	misses) are calculated when traversing this structure rather than
				1261	during execution, to save time; the cache simulation functions
				1262	are called so often that even one or two extra adds can make a
				1263	sizeable difference.</para>
				1264
				1265	<para>The file format:</para>
				1266	<programlisting><![CDATA[
				1267	file ::= desc_line* cmd_line events_line data_line+ summary_line
				1268	desc_line ::= "desc:" ws? non_nl_string
				1269	cmd_line ::= "cmd:" ws? cmd
				1270	events_line ::= "events:" ws? (event ws)+
				1271	data_line ::= file_line \| fn_line \| count_line
				1272	file_line ::= "fl=" filename
				1273	fn_line ::= "fn=" fn_name
				1274	count_line ::= line_num ws? (count ws)+
				1275	summary_line ::= "summary:" ws? (count ws)+
				1276	count ::= num \| "."]]></programlisting>
				1277
				1278	<para>Where:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1279	<itemizedlist>
				1280	<listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1281	<para><computeroutput>non_nl_string</computeroutput> is any
				1282	string not containing a newline.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1283	</listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1284	<listitem>
				1285	<para><computeroutput>cmd</computeroutput> is a string holding the
				1286	command line of the profiled program.</para>
				1287	</listitem>
				1288	<listitem>
njn	2624212	2007-01-22 03:21:27 +0000	[diff] [blame]	1289	<para><computeroutput>event</computeroutput> is a string containing
				1290	no whitespace.</para>
				1291	</listitem>
				1292	<listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1293	<para><computeroutput>filename</computeroutput> and
				1294	<computeroutput>fn_name</computeroutput> are strings.</para>
				1295	</listitem>
				1296	<listitem>
				1297	<para><computeroutput>num</computeroutput> and
				1298	<computeroutput>line_num</computeroutput> are decimal
				1299	numbers.</para>
				1300	</listitem>
				1301	<listitem>
				1302	<para><computeroutput>ws</computeroutput> is whitespace.</para>
				1303	</listitem>
				1304	</itemizedlist>
				1305
				1306	<para>The contents of the "desc:" lines are printed out at the top
				1307	of the summary. This is a generic way of providing simulation
				1308	specific information, eg. for giving the cache configuration for
				1309	cache simulation.</para>
				1310
				1311	<para>More than one line of info can be presented for each file/fn/line number.
				1312	In such cases, the counts for the named events will be accumulated.</para>
				1313
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1314	<para>Counts can be "." to represent zero. This makes the files easier for
				1315	humans to read.</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1316
				1317	<para>The number of counts in each
				1318	<computeroutput>line</computeroutput> and the
				1319	<computeroutput>summary_line</computeroutput> should not exceed
				1320	the number of events in the
				1321	<computeroutput>event_line</computeroutput>. If the number in
				1322	each <computeroutput>line</computeroutput> is less, cg_annotate
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1323	treats those missing as though they were a "." entry. This saves space.
				1324	</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1325
				1326	<para>A <computeroutput>file_line</computeroutput> changes the
				1327	current file name. A <computeroutput>fn_line</computeroutput>
				1328	changes the current function name. A
				1329	<computeroutput>count_line</computeroutput> contains counts that
				1330	pertain to the current filename/fn_name. A "fn="
				1331	<computeroutput>file_line</computeroutput> and a
				1332	<computeroutput>fn_line</computeroutput> must appear before any
				1333	<computeroutput>count_line</computeroutput>s to give the context
				1334	of the first <computeroutput>count_line</computeroutput>s.</para>
				1335
				1336	<para>Each <computeroutput>file_line</computeroutput> will normally be
				1337	immediately followed by a <computeroutput>fn_line</computeroutput>. But it
				1338	doesn't have to be.</para>
				1339
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1340
				1341	</sect2>
				1342
				1343	</sect1>
				1344	</chapter>