Blame - cachegrind/docs/cg-manual.xml - platform/external/valgrind

blob: 7f4d8bc1307b9eea55debb7f95053b9d209b6d20 [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj	7aeb10f	2006-12-10 02:59:16 +0000	[diff] [blame]	3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
				4	[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	5
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	6
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	7	<chapter id="cg-manual" xreflabel="Cachegrind: a cache and branch-prediction profiler">
				8	<title>Cachegrind: a cache and branch-prediction profiler</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	9
				10	<para>To use this tool, you must specify
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	11	<option>--tool=cachegrind</option> on the
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	12	Valgrind command line.</para>
				13
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	14	<sect1 id="cg-manual.overview" xreflabel="Overview">
				15	<title>Overview</title>
				16
				17	<para>Cachegrind simulates how your program interacts with a machine's cache
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	18	hierarchy and (optionally) branch predictor. It simulates a machine with
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	19	independent first-level instruction and data caches (I1 and D1), backed by a
				20	unified second-level cache (L2). This exactly matches the configuration of
				21	many modern machines.</para>
				22
				23	<para>However, some modern machines have three levels of cache. For these
				24	machines (in the cases where Cachegrind can auto-detect the cache
				25	configuration) Cachegrind simulates the first-level and third-level caches.
				26	The reason for this choice is that the L3 cache has the most influence on
				27	runtime, as it masks accesses to main memory. Furthermore, the L1 caches
				28	often have low associativity, so simulating them can detect cases where the
				29	code interacts badly with this cache (eg. traversing a matrix column-wise
				30	with the row length being a power of 2).</para>
				31
				32	<para>Therefore, Cachegrind always refers to the I1, D1 and LL (last-level)
				33	caches.</para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	34
				35	<para>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	36	Cachegrind gathers the following statistics (abbreviations used for each statistic
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	37	is given in parentheses):</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	38	<itemizedlist>
				39	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	40	<para>I cache reads (<computeroutput>Ir</computeroutput>,
				41	which equals the number of instructions executed),
				42	I1 cache read misses (<computeroutput>I1mr</computeroutput>) and
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	43	LL cache instruction read misses (<computeroutput>ILmr</computeroutput>).
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	44	</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	45	</listitem>
				46	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	47	<para>D cache reads (<computeroutput>Dr</computeroutput>, which
				48	equals the number of memory reads),
				49	D1 cache read misses (<computeroutput>D1mr</computeroutput>), and
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	50	LL cache data read misses (<computeroutput>DLmr</computeroutput>).
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	51	</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	52	</listitem>
				53	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	54	<para>D cache writes (<computeroutput>Dw</computeroutput>, which equals
				55	the number of memory writes),
				56	D1 cache write misses (<computeroutput>D1mw</computeroutput>), and
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	57	LL cache data write misses (<computeroutput>DLmw</computeroutput>).
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	58	</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	59	</listitem>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	60	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	61	<para>Conditional branches executed (<computeroutput>Bc</computeroutput>) and
				62	conditional branches mispredicted (<computeroutput>Bcm</computeroutput>).
				63	</para>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	64	</listitem>
				65	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	66	<para>Indirect branches executed (<computeroutput>Bi</computeroutput>) and
				67	indirect branches mispredicted (<computeroutput>Bim</computeroutput>).
				68	</para>
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	69	</listitem>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	70	</itemizedlist>
				71
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	72	<para>Note that D1 total accesses is given by
				73	<computeroutput>D1mr</computeroutput> +
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	74	<computeroutput>D1mw</computeroutput>, and that LL total
				75	accesses is given by <computeroutput>ILmr</computeroutput> +
				76	<computeroutput>DLmr</computeroutput> +
				77	<computeroutput>DLmw</computeroutput>.
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	78	</para>
				79
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	80	<para>These statistics are presented for the entire program and for each
				81	function in the program. You can also annotate each line of source code in
				82	the program with the counts that were caused directly by it.</para>
				83
njn	c8cccb1	2005-07-25 23:30:24 +0000	[diff] [blame]	84	<para>On a modern machine, an L1 miss will typically cost
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	85	around 10 cycles, an LL miss can cost as much as 200
sewardj	8badbaa	2007-05-08 09:20:25 +0000	[diff] [blame]	86	cycles, and a mispredicted branch costs in the region of 10
				87	to 30 cycles. Detailed cache and branch profiling can be very useful
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	88	for understanding how your program interacts with the machine and thus how
				89	to make it faster.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	90
				91	<para>Also, since one instruction cache read is performed per
				92	instruction executed, you can find out how many instructions are
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	93	executed per line, which can be useful for traditional profiling.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	94
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	95	</sect1>
				96
				97
				98
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	99	<sect1 id="cg-manual.profile"
				100	xreflabel="Using Cachegrind, cg_annotate and cg_merge">
				101	<title>Using Cachegrind, cg_annotate and cg_merge</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	102
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	103	<para>First off, as for normal Valgrind use, you probably want to
				104	compile with debugging info (the
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	105	<option>-g</option> option). But by contrast with
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	106	normal Valgrind use, you probably do want to turn
				107	optimisation on, since you should profile your program as it will
				108	be normally run.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	109
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	110	<para>Then, you need to run Cachegrind itself to gather the profiling
				111	information, and then run cg_annotate to get a detailed presentation of that
				112	information. As an optional intermediate step, you can use cg_merge to sum
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	113	together the outputs of multiple Cachegrind runs into a single file which
				114	you then use as the input for cg_annotate. Alternatively, you can use
				115	cg_diff to difference the outputs of two Cachegrind runs into a signel file
				116	which you then use as the input for cg_annotate.</para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	117
				118
				119	<sect2 id="cg-manual.running-cachegrind" xreflabel="Running Cachegrind">
				120	<title>Running Cachegrind</title>
				121
				122	<para>To run Cachegrind on a program <filename>prog</filename>, run:</para>
				123	<screen><![CDATA[
				124	valgrind --tool=cachegrind prog
				125	]]></screen>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	126
				127	<para>The program will execute (slowly). Upon completion,
				128	summary statistics that look like this will be printed:</para>
				129
				130	<programlisting><![CDATA[
				131	==31751== I refs: 27,742,716
				132	==31751== I1 misses: 276
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	133	==31751== LLi misses: 275
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	134	==31751== I1 miss rate: 0.0%
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	135	==31751== LLi miss rate: 0.0%
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	136	==31751==
				137	==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
				138	==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	139	==31751== LLd misses: 23,085 ( 3,987 rd + 19,098 wr)
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	140	==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	141	==31751== LLd miss rate: 0.1% ( 0.0% + 0.4%)
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	142	==31751==
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	143	==31751== LL misses: 23,360 ( 4,262 rd + 19,098 wr)
				144	==31751== LL miss rate: 0.0% ( 0.0% + 0.4%)]]></programlisting>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	145
				146	<para>Cache accesses for instruction fetches are summarised
				147	first, giving the number of fetches made (this is the number of
				148	instructions executed, which can be useful to know in its own
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	149	right), the number of I1 misses, and the number of LL instruction
				150	(<computeroutput>LLi</computeroutput>) misses.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	151
				152	<para>Cache accesses for data follow. The information is similar
				153	to that of the instruction fetches, except that the values are
				154	also shown split between reads and writes (note each row's
				155	<computeroutput>rd</computeroutput> and
				156	<computeroutput>wr</computeroutput> values add up to the row's
				157	total).</para>
				158
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	159	<para>Combined instruction and data figures for the LL cache
				160	follow that. Note that the LL miss rate is computed relative to the total
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	161	number of memory accesses, not the number of L1 misses. I.e. it is
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	162	<computeroutput>(ILmr + DLmr + DLmw) / (Ir + Dr + Dw)</computeroutput>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	163	not
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	164	<computeroutput>(ILmr + DLmr + DLmw) / (I1mr + D1mr + D1mw)</computeroutput>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	165	</para>
				166
				167	<para>Branch prediction statistics are not collected by default.
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	168	To do so, add the option <option>--branch-sim=yes</option>.</para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	169
				170	</sect2>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	171
				172
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	173	<sect2 id="cg-manual.outputfile" xreflabel="Output File">
				174	<title>Output File</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	175
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	176	<para>As well as printing summary information, Cachegrind also writes
				177	more detailed profiling information to a file. By default this file is named
				178	<filename>cachegrind.out.<pid></filename> (where
				179	<filename><pid></filename> is the program's process ID), but its name
				180	can be changed with the <option>--cachegrind-out-file</option> option. This
				181	file is human-readable, but is intended to be interpreted by the
				182	accompanying program cg_annotate, described in the next section.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	183
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	184	<para>The default <computeroutput>.<pid></computeroutput> suffix
de	7e109d1	2005-11-18 22:09:58 +0000	[diff] [blame]	185	on the output file name serves two purposes. Firstly, it means you
				186	don't have to rename old log files that you don't want to overwrite.
				187	Secondly, and more importantly, it allows correct profiling with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	188	<option>--trace-children=yes</option> option of
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	189	programs that spawn child processes.</para>
				190
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	191	<para>The output file can be big, many megabytes for large applications
				192	built with full debugging information.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	193
				194	</sect2>
				195
				196
				197
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	198	<sect2 id="cg-manual.running-cg_annotate" xreflabel="Running cg_annotate">
				199	<title>Running cg_annotate</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	200
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	201	<para>Before using cg_annotate,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	202	it is worth widening your window to be at least 120-characters
				203	wide if possible, as the output lines can be quite long.</para>
				204
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	205	<para>To get a function-by-function summary, run:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	206
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	207	<screen>cg_annotate <filename></screen>
				208
				209	<para>on a Cachegrind output file.</para>
				210
				211	</sect2>
				212
				213
				214	<sect2 id="cg-manual.the-output-preamble" xreflabel="The Output Preamble">
				215	<title>The Output Preamble</title>
				216
				217	<para>The first part of the output looks like this:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	218
				219	<programlisting><![CDATA[
				220	--------------------------------------------------------------------------------
				221	I1 cache: 65536 B, 64 B, 2-way associative
				222	D1 cache: 65536 B, 64 B, 2-way associative
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	223	LL cache: 262144 B, 64 B, 8-way associative
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	224	Command: concord vg_to_ucode.c
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	225	Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
				226	Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
				227	Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	228	Threshold: 99%
				229	Chosen for annotation:
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	230	Auto-annotation: off
				231	]]></programlisting>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	232
				233
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	234	<para>This is a summary of the annotation options:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	235
				236	<itemizedlist>
				237
				238	<listitem>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	239	<para>I1 cache, D1 cache, LL cache: cache configuration. So
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	240	you know the configuration with which these results were
				241	obtained.</para>
				242	</listitem>
				243
				244	<listitem>
				245	<para>Command: the command line invocation of the program
				246	under examination.</para>
				247	</listitem>
				248
				249	<listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	250	<para>Events recorded: which events were recorded.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	251
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	252	</listitem>
				253
				254	<listitem>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	255	<para>Events shown: the events shown, which is a subset of the events
				256	gathered. This can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	257	<option>--show</option> option.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	258	</listitem>
				259
				260	<listitem>
				261	<para>Event sort order: the sort order in which functions are
				262	shown. For example, in this case the functions are sorted
				263	from highest <computeroutput>Ir</computeroutput> counts to
				264	lowest. If two functions have identical
				265	<computeroutput>Ir</computeroutput> counts, they will then be
				266	sorted by <computeroutput>I1mr</computeroutput> counts, and
				267	so on. This order can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	268	<option>--sort</option> option.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	269
				270	<para>Note that this dictates the order the functions appear.
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	271	It is <emphasis>not</emphasis> the order in which the columns
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	272	appear; that is dictated by the "events shown" line (and can
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	273	be changed with the <option>--show</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	274	option).</para>
				275	</listitem>
				276
				277	<listitem>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	278	<para>Threshold: cg_annotate
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	279	by default omits functions that cause very low counts
				280	to avoid drowning you in information. In this case,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	281	cg_annotate shows summaries the functions that account for
				282	99% of the <computeroutput>Ir</computeroutput> counts;
				283	<computeroutput>Ir</computeroutput> is chosen as the
				284	threshold event since it is the primary sort event. The
				285	threshold can be adjusted with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	286	<option>--threshold</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	287	option.</para>
				288	</listitem>
				289
				290	<listitem>
				291	<para>Chosen for annotation: names of files specified
				292	manually for annotation; in this case none.</para>
				293	</listitem>
				294
				295	<listitem>
				296	<para>Auto-annotation: whether auto-annotation was requested
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	297	via the <option>--auto=yes</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	298	option. In this case no.</para>
				299	</listitem>
				300
				301	</itemizedlist>
				302
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	303	</sect2>
				304
				305
				306	<sect2 id="cg-manual.the-global"
				307	xreflabel="The Global and Function-level Counts">
				308	<title>The Global and Function-level Counts</title>
				309
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	310	<para>Then follows summary statistics for the whole
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	311	program:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	312
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	313	<programlisting><![CDATA[
				314	--------------------------------------------------------------------------------
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	315	Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	316	--------------------------------------------------------------------------------
				317	27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS]]></programlisting>
				318
				319	<para>
				320	These are similar to the summary provided when Cachegrind finishes running.
				321	</para>
				322
				323	<para>Then comes function-by-function statistics:</para>
				324
				325	<programlisting><![CDATA[
				326	--------------------------------------------------------------------------------
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	327	Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	328	--------------------------------------------------------------------------------
				329	8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
				330	5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
				331	2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
				332	2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
				333	2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
				334	1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
				335	897,991 51 51 897,831 95 30 62 1 1 ???:???
				336	598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
				337	598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
				338	598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
				339	446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
				340	341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
				341	320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
				342	298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
				343	149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
				344	149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
				345	95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
				346	85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue]]></programlisting>
				347
				348	<para>Each function
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	349	is identified by a
				350	<computeroutput>file_name:function_name</computeroutput> pair. If
				351	a column contains only a dot it means the function never performs
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	352	that event (e.g. the third row shows that
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	353	<computeroutput>strcmp()</computeroutput> contains no
				354	instructions that write to memory). The name
				355	<computeroutput>???</computeroutput> is used if the the file name
				356	and/or function name could not be determined from debugging
				357	information. If most of the entries have the form
				358	<computeroutput>???:???</computeroutput> the program probably
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	359	wasn't compiled with <option>-g</option>.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	360
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	361	<para>It is worth noting that functions will come both from
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	362	the profiled program (e.g. <filename>concord.c</filename>)
				363	and from libraries (e.g. <filename>getc.c</filename>)</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	364
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	365	</sect2>
				366
				367
				368	<sect2 id="cg-manual.line-by-line" xreflabel="Line-by-line Counts">
				369	<title>Line-by-line Counts</title>
				370
				371	<para>There are two ways to annotate source files -- by specifying them
				372	manually as arguments to cg_annotate, or with the
				373	<option>--auto=yes</option> option. For example, the output from running
				374	<filename>cg_annotate <filename> concord.c</filename> for our example
				375	produces the same output as above followed by an annotated version of
				376	<filename>concord.c</filename>, a section of which looks like:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	377
				378	<programlisting><![CDATA[
				379	--------------------------------------------------------------------------------
				380	-- User-annotated source: concord.c
				381	--------------------------------------------------------------------------------
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	382	Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	383
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	384	. . . . . . . . . void init_hash_table(char file_name, Word_Node table[])
				385	3 1 1 . . . 1 0 0 {
				386	. . . . . . . . . FILE *file_ptr;
				387	. . . . . . . . . Word_Info *data;
				388	1 0 0 . . . 1 1 1 int line = 1, i;
				389	. . . . . . . . .
				390	5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
				391	. . . . . . . . .
				392	4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
				393	3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
				394	. . . . . . . . .
				395	. . . . . . . . . /* Open file, check it. */
				396	6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
				397	2 0 0 1 0 0 . . . if (!(file_ptr)) {
				398	. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
				399	1 1 1 . . . . . . exit(EXIT_FAILURE);
				400	. . . . . . . . . }
				401	. . . . . . . . .
				402	165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
				403	146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
				404	. . . . . . . . .
				405	4 0 0 1 0 0 2 0 0 free(data);
				406	4 0 0 1 0 0 2 0 0 fclose(file_ptr);
				407	3 0 0 2 0 0 . . . }]]></programlisting>
				408
				409	<para>(Although column widths are automatically minimised, a wide
				410	terminal is clearly useful.)</para>
				411
				412	<para>Each source file is clearly marked
				413	(<computeroutput>User-annotated source</computeroutput>) as
				414	having been chosen manually for annotation. If the file was
				415	found in one of the directories specified with the
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	416	<option>-I</option>/<option>--include</option> option, the directory
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	417	and file are both given.</para>
				418
				419	<para>Each line is annotated with its event counts. Events not
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	420	applicable for a line are represented by a dot. This is useful
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	421	for distinguishing between an event which cannot happen, and one
				422	which can but did not.</para>
				423
				424	<para>Sometimes only a small section of a source file is
sewardj	8d9fec5	2005-11-15 20:56:23 +0000	[diff] [blame]	425	executed. To minimise uninteresting output, Cachegrind only shows
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	426	annotated lines and lines within a small distance of annotated
				427	lines. Gaps are marked with the line numbers so you know which
				428	part of a file the shown code comes from, eg:</para>
				429
				430	<programlisting><![CDATA[
				431	(figures and code for line 704)
				432	-- line 704 ----------------------------------------
				433	-- line 878 ----------------------------------------
				434	(figures and code for line 878)]]></programlisting>
				435
				436	<para>The amount of context to show around annotated lines is
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	437	controlled by the <option>--context</option>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	438	option.</para>
				439
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	440	<para>To get automatic annotation, use the <option>--auto=yes</option> option.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	441	cg_annotate will automatically annotate every source file it can
				442	find that is mentioned in the function-by-function summary.
				443	Therefore, the files chosen for auto-annotation are affected by
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	444	the <option>--sort</option> and
				445	<option>--threshold</option> options. Each
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	446	source file is clearly marked (<computeroutput>Auto-annotated
				447	source</computeroutput>) as being chosen automatically. Any
				448	files that could not be found are mentioned at the end of the
				449	output, eg:</para>
				450
				451	<programlisting><![CDATA[
				452	------------------------------------------------------------------
				453	The following files chosen for auto-annotation could not be found:
				454	------------------------------------------------------------------
				455	getc.c
				456	ctype.c
				457	../sysdeps/generic/lockfile.c]]></programlisting>
				458
				459	<para>This is quite common for library files, since libraries are
				460	usually compiled with debugging information, but the source files
				461	are often not present on a system. If a file is chosen for
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	462	annotation both manually and automatically, it
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	463	is marked as <computeroutput>User-annotated
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	464	source</computeroutput>. Use the
				465	<option>-I</option>/<option>--include</option> option to tell Valgrind where
				466	to look for source files if the filenames found from the debugging
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	467	information aren't specific enough.</para>
				468
				469	<para>Beware that cg_annotate can take some time to digest large
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	470	<filename>cachegrind.out.<pid></filename> files,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	471	e.g. 30 seconds or more. Also beware that auto-annotation can
				472	produce a lot of output if your program is large!</para>
				473
				474	</sect2>
				475
				476
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	477	<sect2 id="cg-manual.assembler" xreflabel="Annotating Assembly Code Programs">
				478	<title>Annotating Assembly Code Programs</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	479
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	480	<para>Valgrind can annotate assembly code programs too, or annotate
				481	the assembly code generated for your C program. Sometimes this is
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	482	useful for understanding what is really happening when an
				483	interesting line of C code is translated into multiple
				484	instructions.</para>
				485
				486	<para>To do this, you just need to assemble your
njn	85a38bc	2008-10-30 02:41:13 +0000	[diff] [blame]	487	<computeroutput>.s</computeroutput> files with assembly-level debug
njn	7316df2	2009-08-04 01:16:01 +0000	[diff] [blame]	488	information. You can use compile with the <option>-S</option> to compile C/C++
				489	programs to assembly code, and then assemble the assembly code files with
				490	<option>-g</option> to achieve this. You can then profile and annotate the
				491	assembly code source files in the same way as C/C++ source files.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	492
				493	</sect2>
				494
njn	7064fb2	2008-05-29 23:09:52 +0000	[diff] [blame]	495	<sect2 id="ms-manual.forkingprograms" xreflabel="Forking Programs">
				496	<title>Forking Programs</title>
				497	<para>If your program forks, the child will inherit all the profiling data that
				498	has been gathered for the parent.</para>
				499
				500	<para>If the output file format string (controlled by
				501	<option>--cachegrind-out-file</option>) does not contain <option>%p</option>,
				502	then the outputs from the parent and child will be intermingled in a single
				503	output file, which will almost certainly make it unreadable by
				504	cg_annotate.</para>
				505	</sect2>
				506
				507
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	508	<sect2 id="cg-manual.annopts.warnings" xreflabel="cg_annotate Warnings">
				509	<title>cg_annotate Warnings</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	510
				511	<para>There are a couple of situations in which
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	512	cg_annotate issues warnings.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	513
				514	<itemizedlist>
				515	<listitem>
				516	<para>If a source file is more recent than the
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	517	<filename>cachegrind.out.<pid></filename> file.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	518	This is because the information in
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	519	<filename>cachegrind.out.<pid></filename> is only
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	520	recorded with line numbers, so if the line numbers change at
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	521	all in the source (e.g. lines added, deleted, swapped), any
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	522	annotations will be incorrect.</para>
				523	</listitem>
				524	<listitem>
				525	<para>If information is recorded about line numbers past the
				526	end of a file. This can be caused by the above problem,
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	527	i.e. shortening the source file while using an old
				528	<filename>cachegrind.out.<pid></filename> file. If
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	529	this happens, the figures for the bogus lines are printed
				530	anyway (clearly marked as bogus) in case they are
				531	important.</para>
				532	</listitem>
				533	</itemizedlist>
				534
				535	</sect2>
				536
				537
				538
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	539	<sect2 id="cg-manual.annopts.things-to-watch-out-for"
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	540	xreflabel="Unusual Annotation Cases">
				541	<title>Unusual Annotation Cases</title>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	542
				543	<para>Some odd things that can occur during annotation:</para>
				544
				545	<itemizedlist>
				546	<listitem>
				547	<para>If annotating at the assembler level, you might see
				548	something like this:</para>
				549	<programlisting><![CDATA[
				550	1 0 0 . . . . . . leal -12(%ebp),%eax
				551	1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
				552	2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
				553	. . . . . . . . . .align 4,0x90
				554	1 0 0 . . . . . . movl $.LnrB,%eax
				555	1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)]]></programlisting>
				556
				557	<para>How can the third instruction be executed twice when
				558	the others are executed only once? As it turns out, it
				559	isn't. Here's a dump of the executable, using
				560	<computeroutput>objdump -d</computeroutput>:</para>
				561	<programlisting><![CDATA[
				562	8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
				563	8048f28: 89 43 54 mov %eax,0x54(%ebx)
				564	8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
				565	8048f32: 89 f6 mov %esi,%esi
				566	8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
				567	8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)]]></programlisting>
				568
				569	<para>Notice the extra <computeroutput>mov
				570	%esi,%esi</computeroutput> instruction. Where did this come
				571	from? The GNU assembler inserted it to serve as the two
				572	bytes of padding needed to align the <computeroutput>movl
				573	$.LnrB,%eax</computeroutput> instruction on a four-byte
				574	boundary, but pretended it didn't exist when adding debug
				575	information. Thus when Valgrind reads the debug info it
				576	thinks that the <computeroutput>movl
				577	$0x1,0xffffffec(%ebp)</computeroutput> instruction covers the
				578	address range 0x8048f2b--0x804833 by itself, and attributes
				579	the counts for the <computeroutput>mov
				580	%esi,%esi</computeroutput> to it.</para>
				581	</listitem>
				582
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	583	<!--
				584	I think this isn't true any more, not since cost centres were moved from
				585	being associated with instruction addresses to being associated with
				586	source line numbers.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	587	<listitem>
				588	<para>Inlined functions can cause strange results in the
				589	function-by-function summary. If a function
				590	<computeroutput>inline_me()</computeroutput> is defined in
				591	<filename>foo.h</filename> and inlined in the functions
				592	<computeroutput>f1()</computeroutput>,
				593	<computeroutput>f2()</computeroutput> and
				594	<computeroutput>f3()</computeroutput> in
				595	<filename>bar.c</filename>, there will not be a
				596	<computeroutput>foo.h:inline_me()</computeroutput> function
				597	entry. Instead, there will be separate function entries for
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	598	each inlining site, i.e.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	599	<computeroutput>foo.h:f1()</computeroutput>,
				600	<computeroutput>foo.h:f2()</computeroutput> and
				601	<computeroutput>foo.h:f3()</computeroutput>. To find the
				602	total counts for
				603	<computeroutput>foo.h:inline_me()</computeroutput>, add up
				604	the counts from each entry.</para>
				605
				606	<para>The reason for this is that although the debug info
njn	7316df2	2009-08-04 01:16:01 +0000	[diff] [blame]	607	output by GCC indicates the switch from
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	608	<filename>bar.c</filename> to <filename>foo.h</filename>, it
				609	doesn't indicate the name of the function in
				610	<filename>foo.h</filename>, so Valgrind keeps using the old
				611	one.</para>
				612	</listitem>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	613	-->
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	614
				615	<listitem>
				616	<para>Sometimes, the same filename might be represented with
				617	a relative name and with an absolute name in different parts
				618	of the debug info, eg:
				619	<filename>/home/user/proj/proj.h</filename> and
				620	<filename>../proj.h</filename>. In this case, if you use
				621	auto-annotation, the file will be annotated twice with the
				622	counts split between the two.</para>
				623	</listitem>
				624
				625	<listitem>
				626	<para>Files with more than 65,535 lines cause difficulties
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	627	for the Stabs-format debug info reader. This is because the line
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	628	number in the <computeroutput>struct nlist</computeroutput>
				629	defined in <filename>a.out.h</filename> under Linux is only a
				630	16-bit value. Valgrind can handle some files with more than
				631	65,535 lines correctly by making some guesses to identify
				632	line number overflows. But some cases are beyond it, in
				633	which case you'll get a warning message explaining that
				634	annotations for the file might be incorrect.</para>
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	635
njn	7316df2	2009-08-04 01:16:01 +0000	[diff] [blame]	636	<para>If you are using GCC 3.1 or later, this is most likely
				637	irrelevant, since GCC switched to using the more modern DWARF2
sewardj	08e31e2	2007-05-23 21:58:33 +0000	[diff] [blame]	638	format by default at version 3.1. DWARF2 does not have any such
				639	limitations on line numbers.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	640	</listitem>
				641
				642	<listitem>
				643	<para>If you compile some files with
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	644	<option>-g</option> and some without, some
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	645	events that take place in a file without debug info could be
				646	attributed to the last line of a file with debug info
				647	(whichever one gets placed before the non-debug-info file in
				648	the executable).</para>
				649	</listitem>
				650
				651	</itemizedlist>
				652
				653	<para>This list looks long, but these cases should be fairly
				654	rare.</para>
				655
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	656	</sect2>
				657
				658
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	659	<sect2 id="cg-manual.cg_merge" xreflabel="cg_merge">
				660	<title>Merging Profiles with cg_merge</title>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	661
				662	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	663	cg_merge is a simple program which
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	664	reads multiple profile files, as created by Cachegrind, merges them
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	665	together, and writes the results into another file in the same format.
				666	You can then examine the merged results using
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	667	<computeroutput>cg_annotate <filename></computeroutput>, as
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	668	described above. The merging functionality might be useful if you
				669	want to aggregate costs over multiple runs of the same program, or
				670	from a single parallel run with multiple instances of the same
				671	program.</para>
				672
				673	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	674	cg_merge is invoked as follows:
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	675	</para>
				676
				677	<programlisting><![CDATA[
				678	cg_merge -o outputfile file1 file2 file3 ...]]></programlisting>
				679
				680	<para>
				681	It reads and checks <computeroutput>file1</computeroutput>, then read
				682	and checks <computeroutput>file2</computeroutput> and merges it into
				683	the running totals, then the same with
				684	<computeroutput>file3</computeroutput>, etc. The final results are
				685	written to <computeroutput>outputfile</computeroutput>, or to standard
				686	out if no output file is specified.</para>
				687
				688	<para>
				689	Costs are summed on a per-function, per-line and per-instruction
				690	basis. Because of this, the order in which the input files does not
				691	matter, although you should take care to only mention each file once,
				692	since any file mentioned twice will be added in twice.</para>
				693
				694	<para>
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	695	cg_merge does not attempt to check
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	696	that the input files come from runs of the same executable. It will
				697	happily merge together profile files from completely unrelated
				698	programs. It does however check that the
				699	<computeroutput>Events:</computeroutput> lines of all the inputs are
				700	identical, so as to ensure that the addition of costs makes sense.
				701	For example, it would be nonsensical for it to add a number indicating
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	702	D1 read references to a number from a different file indicating LL
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	703	write misses.</para>
				704
				705	<para>
				706	A number of other syntax and sanity checks are done whilst reading the
njn	374a36d	2007-11-23 01:41:32 +0000	[diff] [blame]	707	inputs. cg_merge will stop and
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	708	attempt to print a helpful error message if any of the input files
				709	fail these checks.</para>
				710
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	711	</sect2>
				712
				713
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	714	<sect2 id="cg-manual.cg_diff" xreflabel="cg_diff">
				715	<title>Differencing Profiles with cg_diff</title>
				716
				717	<para>
				718	cg_diff is a simple program which
				719	reads two profile files, as created by Cachegrind, finds the difference
				720	between them, and writes the results into another file in the same format.
				721	You can then examine the merged results using
				722	<computeroutput>cg_annotate <filename></computeroutput>, as
				723	described above. This is very useful if you want to measure how a change to
				724	a program affected its performance.
				725	</para>
				726
				727	<para>
				728	cg_diff is invoked as follows:
				729	</para>
				730
				731	<programlisting><![CDATA[
				732	cg_diff file1 file2]]></programlisting>
				733
				734	<para>
				735	It reads and checks <computeroutput>file1</computeroutput>, then read
				736	and checks <computeroutput>file2</computeroutput>, then computes the
				737	difference (effectively <computeroutput>file1</computeroutput> -
				738	<computeroutput>file2</computeroutput>). The final results are written to
				739	standard output.</para>
				740
				741	<para>
				742	Costs are summed on a per-function basis. Per-line costs are not summed,
				743	because doing so is too difficult. For example, consider differencing two
				744	profiles, one from a single-file program A, and one from the same program A
				745	where a single blank line was inserted at the top of the file. Every single
				746	per-line count has changed. In comparison, the per-function counts have not
				747	changed. The per-function count differences are still very useful for
				748	determining differences between programs. Note that because the result is
				749	the difference of two profiles, many of the counts will be negative; this
				750	indicates that the counts for the relevant function are fewer in the second
				751	version than those in the first version.</para>
				752
				753	<para>
				754	cg_diff does not attempt to check
				755	that the input files come from runs of the same executable. It will
				756	happily merge together profile files from completely unrelated
				757	programs. It does however check that the
				758	<computeroutput>Events:</computeroutput> lines of all the inputs are
				759	identical, so as to ensure that the addition of costs makes sense.
				760	For example, it would be nonsensical for it to add a number indicating
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	761	D1 read references to a number from a different file indicating LL
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	762	write misses.</para>
				763
				764	<para>
				765	A number of other syntax and sanity checks are done whilst reading the
				766	inputs. cg_diff will stop and
				767	attempt to print a helpful error message if any of the input files
				768	fail these checks.</para>
				769
				770	<para>
				771	Sometimes you will want to compare Cachegrind profiles of two versions of a
				772	program that you have sitting side-by-side. For example, you might have
				773	<computeroutput>version1/prog.c</computeroutput> and
				774	<computeroutput>version2/prog.c</computeroutput>, where the second is
				775	slightly different to the first. A straight comparison of the two will not
				776	be useful -- because functions are qualified with filenames, a function
				777	<function>f</function> will be listed as
				778	<computeroutput>version1/prog.c:f</computeroutput> for the first version but
				779	<computeroutput>version2/prog.c:f</computeroutput> for the second
				780	version.</para>
				781
				782	<para>
				783	When this happens, you can use the <option>--mod-filename</option> option.
				784	Its argument is a Perl search-and-replace expression that will be applied
				785	to all the filenames in both Cachegrind output files. It can be used to
				786	remove minor differences in filenames. For example, the option
				787	<option>--mod-filename='s/version[0-9]/versionN/'</option> will suffice for
				788	this case.</para>
				789
njn	e5930da	2010-12-17 00:45:19 +0000	[diff] [blame]	790	<para>
				791	Similarly, sometimes compilers auto-generate certain functions and give them
				792	randomized names. For example, GCC sometimes auto-generates functions with
				793	names like <function>T.1234</function>, and the suffixes vary from build to
				794	build. You can use the <option>--mod-funcname</option> option to remove
				795	small differences like these; it works in the same way as
				796	<option>--mod-filename</option>.</para>
				797
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	798	</sect2>
				799
				800
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	801	</sect1>
				802
				803
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	804
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	805	<sect1 id="cg-manual.cgopts" xreflabel="Cachegrind Command-line Options">
				806	<title>Cachegrind Command-line Options</title>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	807
				808	<!-- start of xi:include in the manpage -->
				809	<para>Cachegrind-specific options are:</para>
				810
				811	<variablelist id="cg.opts.list">
				812
				813	<varlistentry id="opt.I1" xreflabel="--I1">
				814	<term>
				815	<option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
				816	</term>
				817	<listitem>
				818	<para>Specify the size, associativity and line size of the level 1
				819	instruction cache. </para>
				820	</listitem>
				821	</varlistentry>
				822
				823	<varlistentry id="opt.D1" xreflabel="--D1">
				824	<term>
				825	<option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
				826	</term>
				827	<listitem>
				828	<para>Specify the size, associativity and line size of the level 1
				829	data cache.</para>
				830	</listitem>
				831	</varlistentry>
				832
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	833	<varlistentry id="opt.LL" xreflabel="--LL">
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	834	<term>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	835	<option><![CDATA[--LL=<size>,<associativity>,<line size> ]]></option>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	836	</term>
				837	<listitem>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	838	<para>Specify the size, associativity and line size of the last-level
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	839	cache.</para>
				840	</listitem>
				841	</varlistentry>
				842
				843	<varlistentry id="opt.cache-sim" xreflabel="--cache-sim">
				844	<term>
				845	<option><![CDATA[--cache-sim=no\|yes [yes] ]]></option>
				846	</term>
				847	<listitem>
				848	<para>Enables or disables collection of cache access and miss
				849	counts.</para>
				850	</listitem>
				851	</varlistentry>
				852
				853	<varlistentry id="opt.branch-sim" xreflabel="--branch-sim">
				854	<term>
				855	<option><![CDATA[--branch-sim=no\|yes [no] ]]></option>
				856	</term>
				857	<listitem>
				858	<para>Enables or disables collection of branch instruction and
				859	misprediction counts. By default this is disabled as it
				860	slows Cachegrind down by approximately 25%. Note that you
				861	cannot specify <option>--cache-sim=no</option>
				862	and <option>--branch-sim=no</option>
				863	together, as that would leave Cachegrind with no
				864	information to collect.</para>
				865	</listitem>
				866	</varlistentry>
				867
				868	<varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
				869	<term>
				870	<option><![CDATA[--cachegrind-out-file=<file> ]]></option>
				871	</term>
				872	<listitem>
				873	<para>Write the profile data to
				874	<computeroutput>file</computeroutput> rather than to the default
				875	output file,
				876	<filename>cachegrind.out.<pid></filename>. The
				877	<option>%p</option> and <option>%q</option> format specifiers
				878	can be used to embed the process ID and/or the contents of an
				879	environment variable in the name, as is the case for the core
				880	option <option><xref linkend="opt.log-file"/></option>.
				881	</para>
				882	</listitem>
				883	</varlistentry>
				884
				885	</variablelist>
				886	<!-- end of xi:include in the manpage -->
				887
				888	</sect1>
				889
				890
				891
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	892	<sect1 id="cg-manual.annopts" xreflabel="cg_annotate Command-line Options">
				893	<title>cg_annotate Command-line Options</title>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	894
njn	c206a81	2009-08-07 07:56:20 +0000	[diff] [blame]	895	<!-- start of xi:include in the manpage -->
				896	<variablelist id="cg_annotate.opts.list">
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	897
				898	<varlistentry>
				899	<term>
				900	<option><![CDATA[-h --help ]]></option>
				901	</term>
				902	<listitem>
				903	<para>Show the help message.</para>
				904	</listitem>
				905	</varlistentry>
				906
				907	<varlistentry>
				908	<term>
				909	<option><![CDATA[--version ]]></option>
				910	</term>
				911	<listitem>
				912	<para>Show the version number.</para>
				913	</listitem>
				914	</varlistentry>
				915
				916	<varlistentry>
				917	<term>
				918	<option><![CDATA[--show=A,B,C [default: all, using order in
				919	cachegrind.out.<pid>] ]]></option>
				920	</term>
				921	<listitem>
				922	<para>Specifies which events to show (and the column
				923	order). Default is to use all present in the
				924	<filename>cachegrind.out.<pid></filename> file (and
				925	use the order in the file). Useful if you want to concentrate on, for
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	926	example, I cache misses (<option>--show=I1mr,ILmr</option>), or data
				927	read misses (<option>--show=D1mr,DLmr</option>), or LL data misses
				928	(<option>--show=DLmr,DLmw</option>). Best used in conjunction with
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	929	<option>--sort</option>.</para>
				930	</listitem>
				931	</varlistentry>
				932
				933	<varlistentry>
				934	<term>
				935	<option><![CDATA[--sort=A,B,C [default: order in
				936	cachegrind.out.<pid>] ]]></option>
				937	</term>
				938	<listitem>
				939	<para>Specifies the events upon which the sorting of the
				940	function-by-function entries will be based.</para>
				941	</listitem>
				942	</varlistentry>
				943
				944	<varlistentry>
				945	<term>
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	946	<option><![CDATA[--threshold=X [default: 0.1%] ]]></option>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	947	</term>
				948	<listitem>
				949	<para>Sets the threshold for the function-by-function
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	950	summary. A function is shown if it accounts for more than X%
				951	of the counts for the primary sort event. If auto-annotating, also
				952	affects which files are annotated.</para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	953
				954	<para>Note: thresholds can be set for more than one of the
				955	events by appending any events for the
				956	<option>--sort</option> option with a colon
				957	and a number (no spaces, though). E.g. if you want to see
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	958	each function that covers more than 1% of LL read misses or 1% of LL
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	959	write misses, use this option:</para>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	960	<para><option>--sort=DLmr:1,DLmw:1</option></para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	961	</listitem>
				962	</varlistentry>
				963
				964	<varlistentry>
				965	<term>
				966	<option><![CDATA[--auto=<no\|yes> [default: no] ]]></option>
				967	</term>
				968	<listitem>
				969	<para>When enabled, automatically annotates every file that
				970	is mentioned in the function-by-function summary that can be
				971	found. Also gives a list of those that couldn't be found.</para>
				972	</listitem>
				973	</varlistentry>
				974
				975	<varlistentry>
				976	<term>
				977	<option><![CDATA[--context=N [default: 8] ]]></option>
				978	</term>
				979	<listitem>
				980	<para>Print N lines of context before and after each
				981	annotated line. Avoids printing large sections of source
				982	files that were not executed. Use a large number
				983	(e.g. 100000) to show all source lines.</para>
				984	</listitem>
				985	</varlistentry>
				986
				987	<varlistentry>
				988	<term>
				989	<option><![CDATA[-I<dir> --include=<dir> [default: none] ]]></option>
				990	</term>
				991	<listitem>
				992	<para>Adds a directory to the list in which to search for
				993	files. Multiple <option>-I</option>/<option>--include</option>
				994	options can be given to add multiple directories.</para>
				995	</listitem>
				996	</varlistentry>
				997
				998	</variablelist>
njn	c206a81	2009-08-07 07:56:20 +0000	[diff] [blame]	999	<!-- end of xi:include in the manpage -->
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1000
				1001	</sect1>
				1002
				1003
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	1004	<sect1 id="cg-manual.diffopts" xreflabel="cg_diff Command-line Options">
				1005	<title>cg_diff Command-line Options</title>
				1006
				1007	<!-- start of xi:include in the manpage -->
				1008	<variablelist id="cg_diff.opts.list">
				1009
				1010	<varlistentry>
				1011	<term>
				1012	<option><![CDATA[-h --help ]]></option>
				1013	</term>
				1014	<listitem>
				1015	<para>Show the help message.</para>
				1016	</listitem>
				1017	</varlistentry>
				1018
				1019	<varlistentry>
				1020	<term>
				1021	<option><![CDATA[--version ]]></option>
				1022	</term>
				1023	<listitem>
				1024	<para>Show the version number.</para>
				1025	</listitem>
				1026	</varlistentry>
				1027
				1028	<varlistentry>
				1029	<term>
				1030	<option><![CDATA[--mod-filename=<expr> [default: none]]]></option>
				1031	</term>
				1032	<listitem>
				1033	<para>Specifies a Perl search-and-replace expression that is applied
				1034	to all filenames. Useful for removing minor differences in paths
				1035	between two different versions of a program that are sitting in
				1036	different directories.</para>
				1037	</listitem>
				1038	</varlistentry>
				1039
njn	e5930da	2010-12-17 00:45:19 +0000	[diff] [blame]	1040	<varlistentry>
				1041	<term>
				1042	<option><![CDATA[--mod-funcname=<expr> [default: none]]]></option>
				1043	</term>
				1044	<listitem>
				1045	<para>Like <option>--mod-filename</option>, but for filenames.
				1046	Useful for removing minor differences in randomized names of
				1047	auto-generated functions generated by some compilers.</para>
				1048	</listitem>
				1049	</varlistentry>
				1050
njn	69d495d	2010-06-30 05:23:34 +0000	[diff] [blame]	1051	</variablelist>
				1052	<!-- end of xi:include in the manpage -->
				1053
				1054	</sect1>
				1055
				1056
				1057
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1058
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1059	<sect1 id="cg-manual.acting-on"
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1060	xreflabel="Acting on Cachegrind's Information">
				1061	<title>Acting on Cachegrind's Information</title>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1062	<para>
njn	a31dac2	2009-07-30 03:21:42 +0000	[diff] [blame]	1063	Cachegrind gives you lots of information, but acting on that information
				1064	isn't always easy. Here are some rules of thumb that we have found to be
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1065	useful.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1066
				1067	<para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1068	First of all, the global hit/miss counts and miss rates are not that useful.
				1069	If you have multiple programs or multiple runs of a program, comparing the
				1070	numbers might identify if any are outliers and worthy of closer
				1071	investigation. Otherwise, they're not enough to act on.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1072
				1073	<para>
njn	a31dac2	2009-07-30 03:21:42 +0000	[diff] [blame]	1074	The function-by-function counts are more useful to look at, as they pinpoint
				1075	which functions are causing large numbers of counts. However, beware that
				1076	inlining can make these counts misleading. If a function
				1077	<function>f</function> is always inlined, counts will be attributed to the
				1078	functions it is inlined into, rather than itself. However, if you look at
				1079	the line-by-line annotations for <function>f</function> you'll see the
				1080	counts that belong to <function>f</function>. (This is hard to avoid, it's
				1081	how the debug info is structured.) So it's worth looking for large numbers
				1082	in the line-by-line annotations.</para>
				1083
				1084	<para>
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1085	The line-by-line source code annotations are much more useful. In our
				1086	experience, the best place to start is by looking at the
				1087	<computeroutput>Ir</computeroutput> numbers. They simply measure how many
				1088	instructions were executed for each line, and don't include any cache
				1089	information, but they can still be very useful for identifying
				1090	bottlenecks.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1091
				1092	<para>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1093	After that, we have found that LL misses are typically a much bigger source
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1094	of slow-downs than L1 misses. So it's worth looking for any snippets of
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1095	code with high <computeroutput>DLmr</computeroutput> or
				1096	<computeroutput>DLmw</computeroutput> counts. (You can use
				1097	<option>--show=DLmr
				1098	--sort=DLmr</option> with cg_annotate to focus just on
				1099	<literal>DLmr</literal> counts, for example.) If you find any, it's still
njn	a31dac2	2009-07-30 03:21:42 +0000	[diff] [blame]	1100	not always easy to work out how to improve things. You need to have a
njn	07f9656	2007-09-17 22:28:21 +0000	[diff] [blame]	1101	reasonable understanding of how caches work, the principles of locality, and
				1102	your program's data access patterns. Improving things may require
				1103	redesigning a data structure, for example.</para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1104
				1105	<para>
njn	a31dac2	2009-07-30 03:21:42 +0000	[diff] [blame]	1106	Looking at the <computeroutput>Bcm</computeroutput> and
				1107	<computeroutput>Bim</computeroutput> misses can also be helpful.
				1108	In particular, <computeroutput>Bim</computeroutput> misses are often caused
				1109	by <literal>switch</literal> statements, and in some cases these
				1110	<literal>switch</literal> statements can be replaced with table-driven code.
				1111	For example, you might replace code like this:</para>
				1112
				1113	<programlisting><![CDATA[
				1114	enum E { A, B, C };
				1115	enum E e;
				1116	int i;
				1117	...
				1118	switch (e)
				1119	{
tom	270e2a3	2011-08-15 11:11:41 +0000	[diff] [blame]	1120	case A: i += 1; break;
				1121	case B: i += 2; break;
				1122	case C: i += 3; break;
njn	a31dac2	2009-07-30 03:21:42 +0000	[diff] [blame]	1123	}
				1124	]]></programlisting>
				1125
				1126	<para>with code like this:</para>
				1127
				1128	<programlisting><![CDATA[
				1129	enum E { A, B, C };
				1130	enum E e;
				1131	enum E table[] = { 1, 2, 3 };
				1132	int i;
				1133	...
				1134	i += table[e];
				1135	]]></programlisting>
				1136
				1137	<para>
				1138	This is obviously a contrived example, but the basic principle applies in a
				1139	wide variety of situations.</para>
				1140
				1141	<para>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1142	In short, Cachegrind can tell you where some of the bottlenecks in your code
				1143	are, but it can't tell you how to fix them. You have to work that out for
				1144	yourself. But at least you have the information!
				1145	</para>
				1146
				1147	</sect1>
sewardj	94dc508	2007-02-08 11:31:03 +0000	[diff] [blame]	1148
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1149
				1150	<sect1 id="cg-manual.sim-details"
				1151	xreflabel="Simulation Details">
				1152	<title>Simulation Details</title>
				1153	<para>
				1154	This section talks about details you don't need to know about in order to
				1155	use Cachegrind, but may be of interest to some people.
				1156	</para>
				1157
				1158	<sect2 id="cache-sim" xreflabel="Cache Simulation Specifics">
				1159	<title>Cache Simulation Specifics</title>
				1160
				1161	<para>Specific characteristics of the cache simulation are as
				1162	follows:</para>
				1163
				1164	<itemizedlist>
				1165
				1166	<listitem>
				1167	<para>Write-allocate: when a write miss occurs, the block
				1168	written to is brought into the D1 cache. Most modern caches
				1169	have this property.</para>
				1170	</listitem>
				1171
				1172	<listitem>
				1173	<para>Bit-selection hash function: the set of line(s) in the cache
				1174	to which a memory block maps is chosen by the middle bits
				1175	M--(M+N-1) of the byte address, where:</para>
				1176	<itemizedlist>
				1177	<listitem>
				1178	<para>line size = 2^M bytes</para>
				1179	</listitem>
				1180	<listitem>
				1181	<para>(cache size / line size / associativity) = 2^N bytes</para>
				1182	</listitem>
				1183	</itemizedlist>
				1184	</listitem>
				1185
				1186	<listitem>
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1187	<para>Inclusive LL cache: the LL cache typically replicates all
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1188	the entries of the L1 caches, because fetching into L1 involves
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1189	fetching into LL first (this does not guarantee strict inclusiveness,
				1190	as lines evicted from LL still could reside in L1). This is
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1191	standard on Pentium chips, but AMD Opterons, Athlons and Durons
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1192	use an exclusive LL cache that only holds
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1193	blocks evicted from L1. Ditto most modern VIA CPUs.</para>
				1194	</listitem>
				1195
				1196	</itemizedlist>
				1197
				1198	<para>The cache configuration simulated (cache size,
				1199	associativity and line size) is determined automatically using
				1200	the x86 CPUID instruction. If you have a machine that (a)
				1201	doesn't support the CPUID instruction, or (b) supports it in an
				1202	early incarnation that doesn't give any cache information, then
				1203	Cachegrind will fall back to using a default configuration (that
				1204	of a model 3/4 Athlon). Cachegrind will tell you if this
				1205	happens. You can manually specify one, two or all three levels
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1206	(I1/D1/LL) of the cache from the command line using the
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1207	<option>--I1</option>,
				1208	<option>--D1</option> and
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1209	<option>--LL</option> options.
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1210	For cache parameters to be valid for simulation, the number
				1211	of sets (with associativity being the number of cache lines in
				1212	each set) has to be a power of two.</para>
				1213
				1214	<para>On PowerPC platforms
				1215	Cachegrind cannot automatically
				1216	determine the cache configuration, so you will
				1217	need to specify it with the
				1218	<option>--I1</option>,
				1219	<option>--D1</option> and
njn	2d853a1	2010-10-06 22:46:31 +0000	[diff] [blame]	1220	<option>--LL</option> options.</para>
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1221
				1222
				1223	<para>Other noteworthy behaviour:</para>
				1224
				1225	<itemizedlist>
				1226	<listitem>
				1227	<para>References that straddle two cache lines are treated as
				1228	follows:</para>
				1229	<itemizedlist>
				1230	<listitem>
				1231	<para>If both blocks hit --> counted as one hit</para>
				1232	</listitem>
				1233	<listitem>
				1234	<para>If one block hits, the other misses --> counted
				1235	as one miss.</para>
				1236	</listitem>
				1237	<listitem>
				1238	<para>If both blocks miss --> counted as one miss (not
				1239	two)</para>
				1240	</listitem>
				1241	</itemizedlist>
				1242	</listitem>
				1243
				1244	<listitem>
				1245	<para>Instructions that modify a memory location
				1246	(e.g. <computeroutput>inc</computeroutput> and
				1247	<computeroutput>dec</computeroutput>) are counted as doing
				1248	just a read, i.e. a single data reference. This may seem
				1249	strange, but since the write can never cause a miss (the read
				1250	guarantees the block is in the cache) it's not very
				1251	interesting.</para>
				1252
				1253	<para>Thus it measures not the number of times the data cache
				1254	is accessed, but the number of times a data cache miss could
				1255	occur.</para>
				1256	</listitem>
				1257
				1258	</itemizedlist>
				1259
				1260	<para>If you are interested in simulating a cache with different
				1261	properties, it is not particularly hard to write your own cache
				1262	simulator, or to modify the existing ones in
				1263	<computeroutput>cg_sim.c</computeroutput>. We'd be
				1264	interested to hear from anyone who does.</para>
				1265
				1266	</sect2>
				1267
				1268
				1269	<sect2 id="branch-sim" xreflabel="Branch Simulation Specifics">
				1270	<title>Branch Simulation Specifics</title>
				1271
				1272	<para>Cachegrind simulates branch predictors intended to be
				1273	typical of mainstream desktop/server processors of around 2004.</para>
				1274
				1275	<para>Conditional branches are predicted using an array of 16384 2-bit
				1276	saturating counters. The array index used for a branch instruction is
				1277	computed partly from the low-order bits of the branch instruction's
				1278	address and partly using the taken/not-taken behaviour of the last few
				1279	conditional branches. As a result the predictions for any specific
				1280	branch depend both on its own history and the behaviour of previous
				1281	branches. This is a standard technique for improving prediction
				1282	accuracy.</para>
				1283
				1284	<para>For indirect branches (that is, jumps to unknown destinations)
				1285	Cachegrind uses a simple branch target address predictor. Targets are
				1286	predicted using an array of 512 entries indexed by the low order 9
				1287	bits of the branch instruction's address. Each branch is predicted to
				1288	jump to the same address it did last time. Any other behaviour causes
				1289	a mispredict.</para>
				1290
				1291	<para>More recent processors have better branch predictors, in
				1292	particular better indirect branch predictors. Cachegrind's predictor
				1293	design is deliberately conservative so as to be representative of the
				1294	large installed base of processors which pre-date widespread
				1295	deployment of more sophisticated indirect branch predictors. In
				1296	particular, late model Pentium 4s (Prescott), Pentium M, Core and Core
				1297	2 have more sophisticated indirect branch predictors than modelled by
				1298	Cachegrind. </para>
				1299
				1300	<para>Cachegrind does not simulate a return stack predictor. It
				1301	assumes that processors perfectly predict function return addresses,
				1302	an assumption which is probably close to being true.</para>
				1303
				1304	<para>See Hennessy and Patterson's classic text "Computer
				1305	Architecture: A Quantitative Approach", 4th edition (2007), Section
				1306	2.3 (pages 80-89) for background on modern branch predictors.</para>
				1307
				1308	</sect2>
				1309
				1310	<sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
				1311	<title>Accuracy</title>
				1312
				1313	<para>Valgrind's cache profiling has a number of
				1314	shortcomings:</para>
				1315
				1316	<itemizedlist>
				1317	<listitem>
				1318	<para>It doesn't account for kernel activity -- the effect of system
				1319	calls on the cache and branch predictor contents is ignored.</para>
				1320	</listitem>
				1321
				1322	<listitem>
				1323	<para>It doesn't account for other process activity.
				1324	This is probably desirable when considering a single
				1325	program.</para>
				1326	</listitem>
				1327
				1328	<listitem>
				1329	<para>It doesn't account for virtual-to-physical address
				1330	mappings. Hence the simulation is not a true
				1331	representation of what's happening in the
				1332	cache. Most caches and branch predictors are physically indexed, but
				1333	Cachegrind simulates caches using virtual addresses.</para>
				1334	</listitem>
				1335
				1336	<listitem>
				1337	<para>It doesn't account for cache misses not visible at the
				1338	instruction level, e.g. those arising from TLB misses, or
				1339	speculative execution.</para>
				1340	</listitem>
				1341
				1342	<listitem>
				1343	<para>Valgrind will schedule
				1344	threads differently from how they would be when running natively.
				1345	This could warp the results for threaded programs.</para>
				1346	</listitem>
				1347
				1348	<listitem>
				1349	<para>The x86/amd64 instructions <computeroutput>bts</computeroutput>,
				1350	<computeroutput>btr</computeroutput> and
				1351	<computeroutput>btc</computeroutput> will incorrectly be
				1352	counted as doing a data read if both the arguments are
				1353	registers, eg:</para>
				1354	<programlisting><![CDATA[
				1355	btsl %eax, %edx]]></programlisting>
				1356
				1357	<para>This should only happen rarely.</para>
				1358	</listitem>
				1359
				1360	<listitem>
				1361	<para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
				1362	(e.g. <computeroutput>fsave</computeroutput>) are treated as
				1363	though they only access 16 bytes. These instructions seem to
				1364	be rare so hopefully this won't affect accuracy much.</para>
				1365	</listitem>
				1366
				1367	</itemizedlist>
				1368
				1369	<para>Another thing worth noting is that results are very sensitive.
				1370	Changing the size of the the executable being profiled, or the sizes
				1371	of any of the shared libraries it uses, or even the length of their
				1372	file names, can perturb the results. Variations will be small, but
				1373	don't expect perfectly repeatable results if your program changes at
				1374	all.</para>
				1375
				1376	<para>More recent GNU/Linux distributions do address space
				1377	randomisation, in which identical runs of the same program have their
				1378	shared libraries loaded at different locations, as a security measure.
				1379	This also perturbs the results.</para>
				1380
				1381	<para>While these factors mean you shouldn't trust the results to
				1382	be super-accurate, they should be close enough to be useful.</para>
				1383
				1384	</sect2>
				1385
				1386	</sect1>
				1387
				1388
				1389
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1390	<sect1 id="cg-manual.impl-details"
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1391	xreflabel="Implementation Details">
				1392	<title>Implementation Details</title>
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1393	<para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1394	This section talks about details you don't need to know about in order to
				1395	use Cachegrind, but may be of interest to some people.
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1396	</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1397
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1398	<sect2 id="cg-manual.impl-details.how-cg-works"
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1399	xreflabel="How Cachegrind Works">
				1400	<title>How Cachegrind Works</title>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1401	<para>The best reference for understanding how Cachegrind works is chapter 3 of
				1402	"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
njn	25ac384	2009-08-07 02:58:11 +0000	[diff] [blame]	1403	is available on the <ulink url="&vg-pubs-url;">Valgrind publications
njn	011215f	2006-10-21 23:00:59 +0000	[diff] [blame]	1404	page</ulink>.</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1405	</sect2>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1406
sewardj	778d783	2007-11-22 01:21:56 +0000	[diff] [blame]	1407	<sect2 id="cg-manual.impl-details.file-format"
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1408	xreflabel="Cachegrind Output File Format">
				1409	<title>Cachegrind Output File Format</title>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1410	<para>The file format is fairly straightforward, basically giving the
				1411	cost centre for every line, grouped by files and
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1412	functions. It's also totally generic and self-describing, in the sense that
				1413	it can be used for any events that can be counted on a line-by-line basis,
				1414	not just cache and branch predictor events. For example, earlier versions
				1415	of Cachegrind didn't have a branch predictor simulation. When this was
				1416	added, the file format didn't need to change at all. So the format (and
				1417	consequently, cg_annotate) could be used by other tools.</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1418
				1419	<para>The file format:</para>
				1420	<programlisting><![CDATA[
				1421	file ::= desc_line* cmd_line events_line data_line+ summary_line
				1422	desc_line ::= "desc:" ws? non_nl_string
				1423	cmd_line ::= "cmd:" ws? cmd
				1424	events_line ::= "events:" ws? (event ws)+
				1425	data_line ::= file_line \| fn_line \| count_line
				1426	file_line ::= "fl=" filename
				1427	fn_line ::= "fn=" fn_name
				1428	count_line ::= line_num ws? (count ws)+
				1429	summary_line ::= "summary:" ws? (count ws)+
				1430	count ::= num \| "."]]></programlisting>
				1431
				1432	<para>Where:</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1433	<itemizedlist>
				1434	<listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1435	<para><computeroutput>non_nl_string</computeroutput> is any
				1436	string not containing a newline.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1437	</listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1438	<listitem>
				1439	<para><computeroutput>cmd</computeroutput> is a string holding the
				1440	command line of the profiled program.</para>
				1441	</listitem>
				1442	<listitem>
njn	2624212	2007-01-22 03:21:27 +0000	[diff] [blame]	1443	<para><computeroutput>event</computeroutput> is a string containing
				1444	no whitespace.</para>
				1445	</listitem>
				1446	<listitem>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1447	<para><computeroutput>filename</computeroutput> and
				1448	<computeroutput>fn_name</computeroutput> are strings.</para>
				1449	</listitem>
				1450	<listitem>
				1451	<para><computeroutput>num</computeroutput> and
				1452	<computeroutput>line_num</computeroutput> are decimal
				1453	numbers.</para>
				1454	</listitem>
				1455	<listitem>
				1456	<para><computeroutput>ws</computeroutput> is whitespace.</para>
				1457	</listitem>
				1458	</itemizedlist>
				1459
				1460	<para>The contents of the "desc:" lines are printed out at the top
				1461	of the summary. This is a generic way of providing simulation
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1462	specific information, e.g. for giving the cache configuration for
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1463	cache simulation.</para>
				1464
				1465	<para>More than one line of info can be presented for each file/fn/line number.
				1466	In such cases, the counts for the named events will be accumulated.</para>
				1467
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1468	<para>Counts can be "." to represent zero. This makes the files easier for
				1469	humans to read.</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1470
				1471	<para>The number of counts in each
				1472	<computeroutput>line</computeroutput> and the
				1473	<computeroutput>summary_line</computeroutput> should not exceed
				1474	the number of events in the
				1475	<computeroutput>event_line</computeroutput>. If the number in
				1476	each <computeroutput>line</computeroutput> is less, cg_annotate
njn	3a9d5dc	2007-09-17 22:19:01 +0000	[diff] [blame]	1477	treats those missing as though they were a "." entry. This saves space.
				1478	</para>
njn	534f781	2006-10-21 22:22:59 +0000	[diff] [blame]	1479
				1480	<para>A <computeroutput>file_line</computeroutput> changes the
				1481	current file name. A <computeroutput>fn_line</computeroutput>
				1482	changes the current function name. A
				1483	<computeroutput>count_line</computeroutput> contains counts that
				1484	pertain to the current filename/fn_name. A "fn="
				1485	<computeroutput>file_line</computeroutput> and a
				1486	<computeroutput>fn_line</computeroutput> must appear before any
				1487	<computeroutput>count_line</computeroutput>s to give the context
				1488	of the first <computeroutput>count_line</computeroutput>s.</para>
				1489
				1490	<para>Each <computeroutput>file_line</computeroutput> will normally be
				1491	immediately followed by a <computeroutput>fn_line</computeroutput>. But it
				1492	doesn't have to be.</para>
				1493
njn	3da8196	2009-08-07 00:18:25 +0000	[diff] [blame]	1494	<para>The summary line is redundant, because it just holds the total counts
				1495	for each event. But this serves as a useful sanity check of the data; if
				1496	the totals for each event don't match the summary line, something has gone
				1497	wrong.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1498
				1499	</sect2>
				1500
				1501	</sect1>
				1502	</chapter>