Blame - cachegrind/docs/cg_techdocs.html - fp2-dev/platform/external/valgrind

blob: 95f29c00180c6ddc4d46b5110ab0cb0d38c361c9 [file] [log] [blame]

sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<style type="text/css">
				4	body { background-color: #ffffff;
				5	color: #000000;
				6	font-family: Times, Helvetica, Arial;
				7	font-size: 14pt}
				8	h4 { margin-bottom: 0.3em}
				9	code { color: #000000;
				10	font-family: Courier;
				11	font-size: 13pt }
				12	pre { color: #000000;
				13	font-family: Courier;
				14	font-size: 13pt }
				15	a:link { color: #0000C0;
				16	text-decoration: none; }
				17	a:visited { color: #0000C0;
				18	text-decoration: none; }
				19	a:active { color: #0000C0;
				20	text-decoration: none; }
				21	</style>
sewardj	f555ac7	2002-11-18 00:07:28 +0000	[diff] [blame]	22	<title>How Cachegrind works</title>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	23	</head>
				24
				25	<body bgcolor="#ffffff">
				26
sewardj	f555ac7	2002-11-18 00:07:28 +0000	[diff] [blame]	27	<a name="cg-techdocs"> </a>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	28	<h1 align=center>How Cachegrind works</h1>
				29
				30	<center>
				31	Detailed technical notes for hackers, maintainers and the
				32	overly-curious<br>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	33	<p>
sewardj	f555ac7	2002-11-18 00:07:28 +0000	[diff] [blame]	34	<a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
				35	<a
				36	href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	37	<p>
njn	0e1b514	2003-04-15 14:58:06 +0000	[diff] [blame]	38	Copyright © 2001-2003 Nick Nethercote
sewardj	f555ac7	2002-11-18 00:07:28 +0000	[diff] [blame]	39	<p>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	40	</center>
				41
				42	<p>
				43
				44
				45
				46
				47	<hr width="100%">
				48
				49	<h2>Cache profiling</h2>
				50	Valgrind is a very nice platform for doing cache profiling and other kinds of
				51	simulation, because it converts horrible x86 instructions into nice clean
				52	RISC-like UCode. For example, for cache profiling we are interested in
				53	instructions that read and write memory; in UCode there are only four
				54	instructions that do this: <code>LOAD</code>, <code>STORE</code>,
				55	<code>FPU_R</code> and <code>FPU_W</code>. By contrast, because of the x86
				56	addressing modes, almost every instruction can read or write memory.<p>
				57
				58	Most of the cache profiling machinery is in the file
				59	<code>vg_cachesim.c</code>.<p>
				60
				61	These notes are a somewhat haphazard guide to how Valgrind's cache profiling
				62	works.<p>
				63
				64	<h3>Cost centres</h3>
				65	Valgrind gathers cache profiling about every instruction executed,
				66	individually. Each instruction has a <b>cost centre</b> associated with it.
				67	There are two kinds of cost centre: one for instructions that don't reference
				68	memory (<code>iCC</code>), and one for instructions that do
				69	(<code>idCC</code>):
				70
				71	<pre>
				72	typedef struct _CC {
				73	ULong a;
				74	ULong m1;
				75	ULong m2;
				76	} CC;
				77
				78	typedef struct _iCC {
				79	/* word 1 */
				80	UChar tag;
				81	UChar instr_size;
				82
				83	/* words 2+ */
				84	Addr instr_addr;
				85	CC I;
				86	} iCC;
				87
				88	typedef struct _idCC {
				89	/* word 1 */
				90	UChar tag;
				91	UChar instr_size;
				92	UChar data_size;
				93
				94	/* words 2+ */
				95	Addr instr_addr;
				96	CC I;
				97	CC D;
				98	} idCC;
				99	</pre>
				100
				101	Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
				102	<code>m2</code> for recording references, level 1 misses and level 2 misses.
				103	Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
				104	ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
				105
				106	A <code>iCC</code> has one <code>CC</code> for instruction cache accesses. A
				107	<code>idCC</code> has two, one for instruction cache accesses, and one for data
				108	cache accesses.<p>
				109
				110	The <code>iCC</code> and <code>dCC</code> structs also store unchanging
				111	information about the instruction:
				112	<ul>
				113	<li>An instruction-type identification tag (explained below)</li><p>
				114	<li>Instruction size</li><p>
				115	<li>Data reference size (<code>idCC</code> only)</li><p>
				116	<li>Instruction address</li><p>
				117	</ul>
				118
				119	Note that data address is not one of the fields for <code>idCC</code>. This is
				120	because for many memory-referencing instructions the data address can change
				121	each time it's executed (eg. if it uses register-offset addressing). We have
				122	to give this item to the cache simulation in a different way (see
				123	Instrumentation section below). Some memory-referencing instructions do always
				124	reference the same address, but we don't try to treat them specialy in order to
				125	keep things simple.<p>
				126
				127	Also note that there is only room for recording info about one data cache
				128	access in an <code>idCC</code>. So what about instructions that do a read then
				129	a write, such as:
				130
				131	<blockquote><code>inc %(esi)</code></blockquote>
				132
				133	In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
				134	since it immediately follows the read which will drag the block into the cache
				135	if it's not already there. So the write access isn't really interesting, and
				136	Valgrind doesn't record it. This means that Valgrind doesn't measure
				137	memory references, but rather memory references that could miss in the cache.
				138	This behaviour is the same as that used by the AMD Athlon hardware counters.
				139	It also has the benefit of simplifying the implementation -- instructions that
				140	read and write memory can be treated like instructions that read memory.<p>
				141
				142	<h3>Storing cost-centres</h3>
				143	Cost centres are stored in a way that makes them very cheap to lookup, which is
				144	important since one is looked up for every original x86 instruction
				145	executed.<p>
				146
				147	Valgrind does JIT translations at the basic block level, and cost centres are
				148	also setup and stored at the basic block level. By doing things carefully, we
				149	store all the cost centres for a basic block in a contiguous array, and lookup
				150	comes almost for free.<p>
				151
				152	Consider this part of a basic block (for exposition purposes, pretend it's an
				153	entire basic block):
				154
				155	<pre>
				156	movl $0x0,%eax
				157	movl $0x99, -4(%ebp)
				158	</pre>
				159
				160	The translation to UCode looks like this:
				161
				162	<pre>
				163	MOVL $0x0, t20
				164	PUTL t20, %EAX
				165	INCEIPo $5
				166
				167	LEA1L -4(t4), t14
				168	MOVL $0x99, t18
				169	STL t18, (t14)
				170	INCEIPo $7
				171	</pre>
				172
				173	The first step is to allocate the cost centres. This requires a preliminary
				174	pass to count how many x86 instructions were in the basic block, and their
				175	types (and thus sizes). UCode translations for single x86 instructions are
				176	delimited by the <code>INCEIPo</code> instruction, the argument of which gives
				177	the byte size of the instruction (note that lazy INCEIP updating is turned off
				178	to allow this).<p>
				179
				180	We can tell if an x86 instruction references memory by looking for
				181	<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
				182	cost centre is required. From this we can determine how many cost centres we
				183	need for the basic block, and their sizes. We can then allocate them in a
				184	single array.<p>
				185
				186	Consider the example code above. After the preliminary pass, we know we need
				187	two cost centres, one <code>iCC</code> and one <code>dCC</code>. So we
				188	allocate an array to store these which looks like this:
				189
				190	<pre>
				191	\|(uninit)\| tag (1 byte)
				192	\|(uninit)\| instr_size (1 bytes)
				193	\|(uninit)\| (padding) (2 bytes)
				194	\|(uninit)\| instr_addr (4 bytes)
				195	\|(uninit)\| I.a (8 bytes)
				196	\|(uninit)\| I.m1 (8 bytes)
				197	\|(uninit)\| I.m2 (8 bytes)
				198
				199	\|(uninit)\| tag (1 byte)
				200	\|(uninit)\| instr_size (1 byte)
				201	\|(uninit)\| data_size (1 byte)
				202	\|(uninit)\| (padding) (1 byte)
				203	\|(uninit)\| instr_addr (4 bytes)
				204	\|(uninit)\| I.a (8 bytes)
				205	\|(uninit)\| I.m1 (8 bytes)
				206	\|(uninit)\| I.m2 (8 bytes)
				207	\|(uninit)\| D.a (8 bytes)
				208	\|(uninit)\| D.m1 (8 bytes)
				209	\|(uninit)\| D.m2 (8 bytes)
				210	</pre>
				211
				212	(We can see now why we need tags to distinguish between the two types of cost
				213	centres.)<p>
				214
				215	We also record the size of the array. We look up the debug info of the first
				216	instruction in the basic block, and then stick the array into a table indexed
				217	by filename and function name. This makes it easy to dump the information
				218	quickly to file at the end.<p>
				219
				220	<h3>Instrumentation</h3>
				221	The instrumentation pass has two main jobs:
				222
				223	<ol>
				224	<li>Fill in the gaps in the allocated cost centres.</li><p>
				225	<li>Add UCode to call the cache simulator for each instruction.</li><p>
				226	</ol>
				227
				228	The instrumentation pass steps through the UCode and the cost centres in
				229	tandem. As each original x86 instruction's UCode is processed, the appropriate
				230	gaps in the instructions cost centre are filled in, for example:
				231
				232	<pre>
				233	\|INSTR_CC\| tag (1 byte)
				234	\|5 \| instr_size (1 bytes)
				235	\|(uninit)\| (padding) (2 bytes)
				236	\|i_addr1 \| instr_addr (4 bytes)
				237	\|0 \| I.a (8 bytes)
				238	\|0 \| I.m1 (8 bytes)
				239	\|0 \| I.m2 (8 bytes)
				240
				241	\|WRITE_CC\| tag (1 byte)
				242	\|7 \| instr_size (1 byte)
				243	\|4 \| data_size (1 byte)
				244	\|(uninit)\| (padding) (1 byte)
				245	\|i_addr2 \| instr_addr (4 bytes)
				246	\|0 \| I.a (8 bytes)
				247	\|0 \| I.m1 (8 bytes)
				248	\|0 \| I.m2 (8 bytes)
				249	\|0 \| D.a (8 bytes)
				250	\|0 \| D.m1 (8 bytes)
				251	\|0 \| D.m2 (8 bytes)
				252	</pre>
				253
				254	(Note that this step is not performed if a basic block is re-translated; see
				255	<a href="#retranslations">here</a> for more information.)<p>
				256
				257	GCC inserts padding before the <code>instr_size</code> field so that it is word
				258	aligned.<p>
				259
				260	The instrumentation added to call the cache simulation function looks like this
				261	(instrumentation is indented to distinguish it from the original UCode):
				262
				263	<pre>
				264	MOVL $0x0, t20
				265	PUTL t20, %EAX
				266	PUSHL %eax
				267	PUSHL %ecx
				268	PUSHL %edx
				269	MOVL $0x4091F8A4, t46 # address of 1st CC
				270	PUSHL t46
				271	CALLMo $0x12 # second cachesim function
				272	CLEARo $0x4
				273	POPL %edx
				274	POPL %ecx
				275	POPL %eax
				276	INCEIPo $5
				277
				278	LEA1L -4(t4), t14
				279	MOVL $0x99, t18
				280	MOVL t14, t42
				281	STL t18, (t14)
				282	PUSHL %eax
				283	PUSHL %ecx
				284	PUSHL %edx
				285	PUSHL t42
				286	MOVL $0x4091F8C4, t44 # address of 2nd CC
				287	PUSHL t44
				288	CALLMo $0x13 # second cachesim function
				289	CLEARo $0x8
				290	POPL %edx
				291	POPL %ecx
				292	POPL %eax
				293	INCEIPo $7
				294	</pre>
				295
				296	Consider the first instruction's UCode. Each call is surrounded by three
				297	<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
				298	caller-save registers. Then the address of the instruction's cost centre is
				299	pushed onto the stack, to be the first argument to the cache simulation
				300	function. The address is known at this point because we are doing a
				301	simultaneous pass through the cost centre array. This means the cost centre
				302	lookup for each instruction is almost free (just the cost of pushing an
				303	argument for a function call). Then the call to the cache simulation function
				304	for non-memory-reference instructions is made (note that the
				305	<code>CALLMo</code> UInstruction takes an offset into a table of predefined
				306	functions; it is not an absolute address), and the single argument is
				307	<code>CLEAR</code>ed from the stack.<p>
				308
				309	The second instruction's UCode is similar. The only difference is that, as
				310	mentioned before, we have to pass the address of the data item referenced to
				311	the cache simulation function too. This explains the <code>MOVL t14,
				312	t42</code> and <code>PUSHL t42</code> UInstructions. (Note that the seemingly
				313	redundant <code>MOV</code>ing will probably be optimised away during register
				314	allocation.)<p>
				315
				316	Note that instead of storing unchanging information about each instruction
				317	(instruction size, data size, etc) in its cost centre, we could have passed in
				318	these arguments to the simulation function. But this would slow the calls down
				319	(two or three extra arguments pushed onto the stack). Also it would bloat the
				320	UCode instrumentation by amounts similar to the space required for them in the
				321	cost centre; bloated UCode would also fill the translation cache more quickly,
				322	requiring more translations for large programs and slowing them down more.<p>
				323
				324	<a name="retranslations"></a>
				325	<h3>Handling basic block retranslations</h3>
				326	The above description ignores one complication. Valgrind has a limited size
				327	cache for basic block translations; if it fills up, old translations are
				328	discarded. If a discarded basic block is executed again, it must be
				329	re-translated.<p>
				330
				331	However, we can't use this approach for profiling -- we can't throw away cost
				332	centres for instructions in the middle of execution! So when a basic block is
				333	translated, we first look for its cost centre array in the hash table. If
				334	there is no cost centre array, it must be the first translation, so we proceed
				335	as described above. But if there is a cost centre array already, it must be a
				336	retranslation. In this case, we skip the cost centre allocation and
				337	initialisation steps, but still do the UCode instrumentation step.<p>
				338
				339	<h3>The cache simulation</h3>
				340	The cache simulation is fairly straightforward. It just tracks which memory
				341	blocks are in the cache at the moment (it doesn't track the contents, since
				342	that is irrelevant).<p>
				343
				344	The interface to the simulation is quite clean. The functions called from the
				345	UCode contain calls to the simulation functions in the files
				346	<Code>vg_cachesim_{I1,D1,L2}.c</code>; these calls are inlined so that only
				347	one function call is done per simulated x86 instruction. The file
				348	<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
				349	containing the simulation, which makes plugging in new cache simulations is
				350	very easy -- you just replace the three files and recompile.<p>
				351
				352	<h3>Output</h3>
				353	Output is fairly straightforward, basically printing the cost centre for every
				354	instruction, grouped by files and functions. Total counts (eg. total cache
				355	accesses, total L1 misses) are calculated when traversing this structure rather
				356	than during execution, to save time; the cache simulation functions are called
				357	so often that even one or two extra adds can make a sizeable difference.<p>
				358
				359	Input file has the following format:
				360
				361	<pre>
				362	file ::= desc_line* cmd_line events_line data_line+ summary_line
				363	desc_line ::= "desc:" ws? non_nl_string
				364	cmd_line ::= "cmd:" ws? cmd
				365	events_line ::= "events:" ws? (event ws)+
				366	data_line ::= file_line \| fn_line \| count_line
				367	file_line ::= ("fl=" \| "fi=" \| "fe=") filename
				368	fn_line ::= "fn=" fn_name
				369	count_line ::= line_num ws? (count ws)+
				370	summary_line ::= "summary:" ws? (count ws)+
				371	count ::= num \| "."
				372	</pre>
				373
				374	Where:
				375
				376	<ul>
				377	<li><code>non_nl_string</code> is any string not containing a newline.</li><p>
				378	<li><code>cmd</code> is a command line invocation.</li><p>
				379	<li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
				380	<li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
				381	<li><code>ws</code> is whitespace.</li><p>
				382	<li><code>nl</code> is a newline.</li><p>
				383	</ul>
				384
				385	The contents of the "desc:" lines is printed out at the top of the summary.
				386	This is a generic way of providing simulation specific information, eg. for
				387	giving the cache configuration for cache simulation.<p>
				388
				389	Counts can be "." to represent "N/A", eg. the number of write misses for an
				390	instruction that doesn't write to memory.<p>
				391
				392	The number of counts in each <code>line</code> and the
				393	<code>summary_line</code> should not exceed the number of events in the
				394	<code>event_line</code>. If the number in each <code>line</code> is less,
				395	cg_annotate treats those missing as though they were a "." entry. <p>
				396
				397	A <code>file_line</code> changes the current file name. A <code>fn_line</code>
				398	changes the current function name. A <code>count_line</code> contains counts
				399	that pertain to the current filename/fn_name. A "fn=" <code>file_line</code>
				400	and a <code>fn_line</code> must appear before any <code>count_line</code>s to
				401	give the context of the first <code>count_line</code>s.<p>
				402
				403	Each <code>file_line</code> should be immediately followed by a
				404	<code>fn_line</code>. "fi=" <code>file_lines</code> are used to switch
				405	filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
				406	are put at the end of a basic block in which the file name hasn't been switched
				407	back to the original file name. (fi and fe lines behave the same, they are
				408	only distinguished to help debugging.)<p>
				409
				410
				411	<h3>Summary of performance features</h3>
				412	Quite a lot of work has gone into making the profiling as fast as possible.
				413	This is a summary of the important features:
				414
				415	<ul>
				416	<li>The basic block-level cost centre storage allows almost free cost centre
				417	lookup.</li><p>
				418
				419	<li>Only one function call is made per instruction simulated; even this
				420	accounts for a sizeable percentage of execution time, but it seems
				421	unavoidable if we want flexibility in the cache simulator.</li><p>
				422
				423	<li>Unchanging information about an instruction is stored in its cost centre,
				424	avoiding unnecessary argument pushing, and minimising UCode
				425	instrumentation bloat.</li><p>
				426
				427	<li>Summary counts are calculated at the end, rather than during
				428	execution.</li><p>
				429
				430	<li>The <code>cachegrind.out</code> output files can contain huge amounts of
				431	information; file format was carefully chosen to minimise file
				432	sizes.</li><p>
				433	</ul>
				434
				435
				436	<h3>Annotation</h3>
				437	Annotation is done by cg_annotate. It is a fairly straightforward Perl script
				438	that slurps up all the cost centres, and then runs through all the chosen
				439	source files, printing out cost centres with them. It too has been carefully
				440	optimised.
				441
				442
				443	<h3>Similar work, extensions</h3>
				444	It would be relatively straightforward to do other simulations and obtain
				445	line-by-line information about interesting events. A good example would be
				446	branch prediction -- all branches could be instrumented to interact with a
				447	branch prediction simulator, using very similar techniques to those described
				448	above.<p>
				449
				450	In particular, cg_annotate would not need to change -- the file format is such
				451	that it is not specific to the cache simulation, but could be used for any kind
				452	of line-by-line information. The only part of cg_annotate that is specific to
				453	the cache simulation is the name of the input file
				454	(<code>cachegrind.out</code>), although it would be very simple to add an
				455	option to control this.<p>
				456
				457	</body>
				458	</html>