Blame - cachegrind/docs/cg-tech-docs.xml - fp2-dev/platform/external/valgrind

blob: bc49a7ae9a08097b077f753e95ef756fe0100c1a [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
				4
				5	<chapter id="cg-tech-docs" xreflabel="How Cachegrind works">
				6
				7	<title>How Cachegrind works</title>
				8
				9	<sect1 id="cg-tech-docs.profiling" xreflabel="Cache profiling">
				10	<title>Cache profiling</title>
				11
njn	c4fcca3	2004-12-01 00:02:36 +0000	[diff] [blame]	12	<para>[Note: this document is now very old, and a lot of its contents are out
				13	of date, and misleading.]</para>
				14
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	15	<para>Valgrind is a very nice platform for doing cache profiling
				16	and other kinds of simulation, because it converts horrible x86
				17	instructions into nice clean RISC-like UCode. For example, for
				18	cache profiling we are interested in instructions that read and
				19	write memory; in UCode there are only four instructions that do
				20	this: <computeroutput>LOAD</computeroutput>,
				21	<computeroutput>STORE</computeroutput>,
				22	<computeroutput>FPU_R</computeroutput> and
				23	<computeroutput>FPU_W</computeroutput>. By contrast, because of
				24	the x86 addressing modes, almost every instruction can read or
				25	write memory.</para>
				26
				27	<para>Most of the cache profiling machinery is in the file
				28	<filename>vg_cachesim.c</filename>.</para>
				29
				30	<para>These notes are a somewhat haphazard guide to how
				31	Valgrind's cache profiling works.</para>
				32
				33	</sect1>
				34
				35
				36	<sect1 id="cg-tech-docs.costcentres" xreflabel="Cost centres">
				37	<title>Cost centres</title>
				38
				39	<para>Valgrind gathers cache profiling about every instruction
				40	executed, individually. Each instruction has a <command>cost
				41	centre</command> associated with it. There are two kinds of cost
				42	centre: one for instructions that don't reference memory
				43	(<computeroutput>iCC</computeroutput>), and one for instructions
				44	that do (<computeroutput>idCC</computeroutput>):</para>
				45
				46	<programlisting><![CDATA[
				47	typedef struct _CC {
				48	ULong a;
				49	ULong m1;
				50	ULong m2;
				51	} CC;
				52
				53	typedef struct _iCC {
				54	/* word 1 */
				55	UChar tag;
				56	UChar instr_size;
				57
				58	/* words 2+ */
				59	Addr instr_addr;
				60	CC I;
				61	} iCC;
				62
				63	typedef struct _idCC {
				64	/* word 1 */
				65	UChar tag;
				66	UChar instr_size;
				67	UChar data_size;
				68
				69	/* words 2+ */
				70	Addr instr_addr;
				71	CC I;
				72	CC D;
				73	} idCC; ]]></programlisting>
				74
				75	<para>Each <computeroutput>CC</computeroutput> has three fields
				76	<computeroutput>a</computeroutput>,
				77	<computeroutput>m1</computeroutput>,
				78	<computeroutput>m2</computeroutput> for recording references,
				79	level 1 misses and level 2 misses. Each of these is a 64-bit
				80	<computeroutput>ULong</computeroutput> -- the numbers can get
				81	very large, ie. greater than 4.2 billion allowed by a 32-bit
				82	unsigned int.</para>
				83
				84	<para>A <computeroutput>iCC</computeroutput> has one
				85	<computeroutput>CC</computeroutput> for instruction cache
				86	accesses. A <computeroutput>idCC</computeroutput> has two, one
				87	for instruction cache accesses, and one for data cache
				88	accesses.</para>
				89
				90	<para>The <computeroutput>iCC</computeroutput> and
				91	<computeroutput>dCC</computeroutput> structs also store
				92	unchanging information about the instruction:</para>
				93	<itemizedlist>
				94	<listitem>
				95	<para>An instruction-type identification tag (explained
				96	below)</para>
				97	</listitem>
				98	<listitem>
				99	<para>Instruction size</para>
				100	</listitem>
				101	<listitem>
				102	<para>Data reference size
				103	(<computeroutput>idCC</computeroutput> only)</para>
				104	</listitem>
				105	<listitem>
				106	<para>Instruction address</para>
				107	</listitem>
				108	</itemizedlist>
				109
				110	<para>Note that data address is not one of the fields for
				111	<computeroutput>idCC</computeroutput>. This is because for many
				112	memory-referencing instructions the data address can change each
				113	time it's executed (eg. if it uses register-offset addressing).
				114	We have to give this item to the cache simulation in a different
				115	way (see Instrumentation section below). Some memory-referencing
				116	instructions do always reference the same address, but we don't
				117	try to treat them specialy in order to keep things simple.</para>
				118
				119	<para>Also note that there is only room for recording info about
				120	one data cache access in an
				121	<computeroutput>idCC</computeroutput>. So what about
				122	instructions that do a read then a write, such as:</para>
				123	<programlisting><![CDATA[
				124	inc %(esi)]]></programlisting>
				125
				126	<para>In a write-allocate cache, as simulated by Valgrind, the
				127	write cannot miss, since it immediately follows the read which
				128	will drag the block into the cache if it's not already there. So
				129	the write access isn't really interesting, and Valgrind doesn't
				130	record it. This means that Valgrind doesn't measure memory
				131	references, but rather memory references that could miss in the
				132	cache. This behaviour is the same as that used by the AMD Athlon
				133	hardware counters. It also has the benefit of simplifying the
				134	implementation -- instructions that read and write memory can be
				135	treated like instructions that read memory.</para>
				136
				137	</sect1>
				138
				139
				140	<sect1 id="cg-tech-docs.ccstore" xreflabel="Storing cost-centres">
				141	<title>Storing cost-centres</title>
				142
				143	<para>Cost centres are stored in a way that makes them very cheap
				144	to lookup, which is important since one is looked up for every
				145	original x86 instruction executed.</para>
				146
				147	<para>Valgrind does JIT translations at the basic block level,
				148	and cost centres are also setup and stored at the basic block
				149	level. By doing things carefully, we store all the cost centres
				150	for a basic block in a contiguous array, and lookup comes almost
				151	for free.</para>
				152
				153	<para>Consider this part of a basic block (for exposition
				154	purposes, pretend it's an entire basic block):</para>
				155	<programlisting><![CDATA[
				156	movl $0x0,%eax
				157	movl $0x99, -4(%ebp)]]></programlisting>
				158
				159	<para>The translation to UCode looks like this:</para>
				160	<programlisting><![CDATA[
				161	MOVL $0x0, t20
				162	PUTL t20, %EAX
				163	INCEIPo $5
				164
				165	LEA1L -4(t4), t14
				166	MOVL $0x99, t18
				167	STL t18, (t14)
				168	INCEIPo $7]]></programlisting>
				169
				170	<para>The first step is to allocate the cost centres. This
				171	requires a preliminary pass to count how many x86 instructions
				172	were in the basic block, and their types (and thus sizes). UCode
				173	translations for single x86 instructions are delimited by the
				174	<computeroutput>INCEIPo</computeroutput> instruction, the
				175	argument of which gives the byte size of the instruction (note
				176	that lazy INCEIP updating is turned off to allow this).</para>
				177
				178	<para>We can tell if an x86 instruction references memory by
				179	looking for <computeroutput>LDL</computeroutput> and
				180	<computeroutput>STL</computeroutput> UCode instructions, and thus
				181	what kind of cost centre is required. From this we can determine
				182	how many cost centres we need for the basic block, and their
				183	sizes. We can then allocate them in a single array.</para>
				184
				185	<para>Consider the example code above. After the preliminary
				186	pass, we know we need two cost centres, one
				187	<computeroutput>iCC</computeroutput> and one
				188	<computeroutput>dCC</computeroutput>. So we allocate an array to
				189	store these which looks like this:</para>
				190
				191	<programlisting><![CDATA[
				192	\|(uninit)\| tag (1 byte)
				193	\|(uninit)\| instr_size (1 bytes)
				194	\|(uninit)\| (padding) (2 bytes)
				195	\|(uninit)\| instr_addr (4 bytes)
				196	\|(uninit)\| I.a (8 bytes)
				197	\|(uninit)\| I.m1 (8 bytes)
				198	\|(uninit)\| I.m2 (8 bytes)
				199
				200	\|(uninit)\| tag (1 byte)
				201	\|(uninit)\| instr_size (1 byte)
				202	\|(uninit)\| data_size (1 byte)
				203	\|(uninit)\| (padding) (1 byte)
				204	\|(uninit)\| instr_addr (4 bytes)
				205	\|(uninit)\| I.a (8 bytes)
				206	\|(uninit)\| I.m1 (8 bytes)
				207	\|(uninit)\| I.m2 (8 bytes)
				208	\|(uninit)\| D.a (8 bytes)
				209	\|(uninit)\| D.m1 (8 bytes)
				210	\|(uninit)\| D.m2 (8 bytes)]]></programlisting>
				211
				212	<para>(We can see now why we need tags to distinguish between the
				213	two types of cost centres.)</para>
				214
				215	<para>We also record the size of the array. We look up the debug
				216	info of the first instruction in the basic block, and then stick
				217	the array into a table indexed by filename and function name.
				218	This makes it easy to dump the information quickly to file at the
				219	end.</para>
				220
				221	</sect1>
				222
				223
				224	<sect1 id="cg-tech-docs.instrum" xreflabel="Instrumentation">
				225	<title>Instrumentation</title>
				226
				227	<para>The instrumentation pass has two main jobs:</para>
				228
				229	<orderedlist>
				230	<listitem>
				231	<para>Fill in the gaps in the allocated cost centres.</para>
				232	</listitem>
				233	<listitem>
				234	<para>Add UCode to call the cache simulator for each
				235	instruction.</para>
				236	</listitem>
				237	</orderedlist>
				238
				239	<para>The instrumentation pass steps through the UCode and the
				240	cost centres in tandem. As each original x86 instruction's UCode
				241	is processed, the appropriate gaps in the instructions cost
				242	centre are filled in, for example:</para>
				243
				244	<programlisting><![CDATA[
				245	\|INSTR_CC\| tag (1 byte)
				246	\|5 \| instr_size (1 bytes)
				247	\|(uninit)\| (padding) (2 bytes)
				248	\|i_addr1 \| instr_addr (4 bytes)
				249	\|0 \| I.a (8 bytes)
				250	\|0 \| I.m1 (8 bytes)
				251	\|0 \| I.m2 (8 bytes)
				252
				253	\|WRITE_CC\| tag (1 byte)
				254	\|7 \| instr_size (1 byte)
				255	\|4 \| data_size (1 byte)
				256	\|(uninit)\| (padding) (1 byte)
				257	\|i_addr2 \| instr_addr (4 bytes)
				258	\|0 \| I.a (8 bytes)
				259	\|0 \| I.m1 (8 bytes)
				260	\|0 \| I.m2 (8 bytes)
				261	\|0 \| D.a (8 bytes)
				262	\|0 \| D.m1 (8 bytes)
				263	\|0 \| D.m2 (8 bytes)]]></programlisting>
				264
				265	<para>(Note that this step is not performed if a basic block is
				266	re-translated; see <xref linkend="cg-tech-docs.retranslations"/> for
				267	more information.)</para>
				268
				269	<para>GCC inserts padding before the
				270	<computeroutput>instr_size</computeroutput> field so that it is
				271	word aligned.</para>
				272
				273	<para>The instrumentation added to call the cache simulation
				274	function looks like this (instrumentation is indented to
				275	distinguish it from the original UCode):</para>
				276
				277	<programlisting><![CDATA[
				278	MOVL $0x0, t20
				279	PUTL t20, %EAX
				280	PUSHL %eax
				281	PUSHL %ecx
				282	PUSHL %edx
				283	MOVL $0x4091F8A4, t46 # address of 1st CC
				284	PUSHL t46
				285	CALLMo $0x12 # second cachesim function
				286	CLEARo $0x4
				287	POPL %edx
				288	POPL %ecx
				289	POPL %eax
				290	INCEIPo $5
				291
				292	LEA1L -4(t4), t14
				293	MOVL $0x99, t18
				294	MOVL t14, t42
				295	STL t18, (t14)
				296	PUSHL %eax
				297	PUSHL %ecx
				298	PUSHL %edx
				299	PUSHL t42
				300	MOVL $0x4091F8C4, t44 # address of 2nd CC
				301	PUSHL t44
				302	CALLMo $0x13 # second cachesim function
				303	CLEARo $0x8
				304	POPL %edx
				305	POPL %ecx
				306	POPL %eax
				307	INCEIPo $7]]></programlisting>
				308
				309	<para>Consider the first instruction's UCode. Each call is
				310	surrounded by three <computeroutput>PUSHL</computeroutput> and
				311	<computeroutput>POPL</computeroutput> instructions to save and
				312	restore the caller-save registers. Then the address of the
				313	instruction's cost centre is pushed onto the stack, to be the
				314	first argument to the cache simulation function. The address is
				315	known at this point because we are doing a simultaneous pass
				316	through the cost centre array. This means the cost centre lookup
				317	for each instruction is almost free (just the cost of pushing an
				318	argument for a function call). Then the call to the cache
				319	simulation function for non-memory-reference instructions is made
				320	(note that the <computeroutput>CALLMo</computeroutput>
				321	UInstruction takes an offset into a table of predefined
				322	functions; it is not an absolute address), and the single
				323	argument is <computeroutput>CLEAR</computeroutput>ed from the
				324	stack.</para>
				325
				326	<para>The second instruction's UCode is similar. The only
				327	difference is that, as mentioned before, we have to pass the
				328	address of the data item referenced to the cache simulation
				329	function too. This explains the <computeroutput>MOVL t14,
				330	t42</computeroutput> and <computeroutput>PUSHL
				331	t42</computeroutput> UInstructions. (Note that the seemingly
				332	redundant <computeroutput>MOV</computeroutput>ing will probably
				333	be optimised away during register allocation.)</para>
				334
				335	<para>Note that instead of storing unchanging information about
				336	each instruction (instruction size, data size, etc) in its cost
				337	centre, we could have passed in these arguments to the simulation
				338	function. But this would slow the calls down (two or three extra
				339	arguments pushed onto the stack). Also it would bloat the UCode
				340	instrumentation by amounts similar to the space required for them
				341	in the cost centre; bloated UCode would also fill the translation
				342	cache more quickly, requiring more translations for large
				343	programs and slowing them down more.</para>
				344
				345	</sect1>
				346
				347
				348	<sect1 id="cg-tech-docs.retranslations"
				349	xreflabel="Handling basic block retranslations">
				350	<title>Handling basic block retranslations</title>
				351
				352	<para>The above description ignores one complication. Valgrind
				353	has a limited size cache for basic block translations; if it
				354	fills up, old translations are discarded. If a discarded basic
				355	block is executed again, it must be re-translated.</para>
				356
				357	<para>However, we can't use this approach for profiling -- we
				358	can't throw away cost centres for instructions in the middle of
				359	execution! So when a basic block is translated, we first look
				360	for its cost centre array in the hash table. If there is no cost
				361	centre array, it must be the first translation, so we proceed as
				362	described above. But if there is a cost centre array already, it
				363	must be a retranslation. In this case, we skip the cost centre
				364	allocation and initialisation steps, but still do the UCode
				365	instrumentation step.</para>
				366
				367	</sect1>
				368
				369
				370
				371	<sect1 id="cg-tech-docs.cachesim" xreflabel="The cache simulation">
				372	<title>The cache simulation</title>
				373
				374	<para>The cache simulation is fairly straightforward. It just
				375	tracks which memory blocks are in the cache at the moment (it
				376	doesn't track the contents, since that is irrelevant).</para>
				377
				378	<para>The interface to the simulation is quite clean. The
				379	functions called from the UCode contain calls to the simulation
				380	functions in the files
				381	<filename>vg_cachesim_{I1,D1,L2}.c</filename>; these calls are
				382	inlined so that only one function call is done per simulated x86
				383	instruction. The file <filename>vg_cachesim.c</filename> simply
				384	<computeroutput>#include</computeroutput>s the three files
				385	containing the simulation, which makes plugging in new cache
				386	simulations is very easy -- you just replace the three files and
				387	recompile.</para>
				388
				389	</sect1>
				390
				391
				392	<sect1 id="cg-tech-docs.output" xreflabel="Output">
				393	<title>Output</title>
				394
				395	<para>Output is fairly straightforward, basically printing the
				396	cost centre for every instruction, grouped by files and
				397	functions. Total counts (eg. total cache accesses, total L1
				398	misses) are calculated when traversing this structure rather than
				399	during execution, to save time; the cache simulation functions
				400	are called so often that even one or two extra adds can make a
				401	sizeable difference.</para>
				402
				403	<para>Input file has the following format:</para>
				404	<programlisting><![CDATA[
				405	file ::= desc_line* cmd_line events_line data_line+ summary_line
				406	desc_line ::= "desc:" ws? non_nl_string
				407	cmd_line ::= "cmd:" ws? cmd
				408	events_line ::= "events:" ws? (event ws)+
				409	data_line ::= file_line \| fn_line \| count_line
				410	file_line ::= ("fl=" \| "fi=" \| "fe=") filename
				411	fn_line ::= "fn=" fn_name
				412	count_line ::= line_num ws? (count ws)+
				413	summary_line ::= "summary:" ws? (count ws)+
				414	count ::= num \| "."]]></programlisting>
				415
				416	<para>Where:</para>
				417	<itemizedlist>
				418	<listitem>
				419	<para><computeroutput>non_nl_string</computeroutput> is any
				420	string not containing a newline.</para>
				421	</listitem>
				422	<listitem>
				423	<para><computeroutput>cmd</computeroutput> is a command line
				424	invocation.</para>
				425	</listitem>
				426	<listitem>
				427	<para><computeroutput>filename</computeroutput> and
				428	<computeroutput>fn_name</computeroutput> can be anything.</para>
				429	</listitem>
				430	<listitem>
				431	<para><computeroutput>num</computeroutput> and
				432	<computeroutput>line_num</computeroutput> are decimal
				433	numbers.</para>
				434	</listitem>
				435	<listitem>
				436	<para><computeroutput>ws</computeroutput> is whitespace.</para>
				437	</listitem>
				438	<listitem>
				439	<para><computeroutput>nl</computeroutput> is a newline.</para>
				440	</listitem>
				441
				442	</itemizedlist>
				443
				444	<para>The contents of the "desc:" lines is printed out at the top
				445	of the summary. This is a generic way of providing simulation
				446	specific information, eg. for giving the cache configuration for
				447	cache simulation.</para>
				448
				449	<para>Counts can be "." to represent "N/A", eg. the number of
				450	write misses for an instruction that doesn't write to
				451	memory.</para>
				452
				453	<para>The number of counts in each
				454	<computeroutput>line</computeroutput> and the
				455	<computeroutput>summary_line</computeroutput> should not exceed
				456	the number of events in the
				457	<computeroutput>event_line</computeroutput>. If the number in
				458	each <computeroutput>line</computeroutput> is less, cg_annotate
				459	treats those missing as though they were a "." entry.</para>
				460
				461	<para>A <computeroutput>file_line</computeroutput> changes the
				462	current file name. A <computeroutput>fn_line</computeroutput>
				463	changes the current function name. A
				464	<computeroutput>count_line</computeroutput> contains counts that
				465	pertain to the current filename/fn_name. A "fn="
				466	<computeroutput>file_line</computeroutput> and a
				467	<computeroutput>fn_line</computeroutput> must appear before any
				468	<computeroutput>count_line</computeroutput>s to give the context
				469	of the first <computeroutput>count_line</computeroutput>s.</para>
				470
				471	<para>Each <computeroutput>file_line</computeroutput> should be
				472	immediately followed by a
				473	<computeroutput>fn_line</computeroutput>. "fi="
				474	<computeroutput>file_lines</computeroutput> are used to switch
				475	filenames for inlined functions; "fe="
				476	<computeroutput>file_lines</computeroutput> are similar, but are
				477	put at the end of a basic block in which the file name hasn't
				478	been switched back to the original file name. (fi and fe lines
				479	behave the same, they are only distinguished to help
				480	debugging.)</para>
				481
				482	</sect1>
				483
				484
				485
				486	<sect1 id="cg-tech-docs.summary"
				487	xreflabel="Summary of performance features">
				488	<title>Summary of performance features</title>
				489
				490	<para>Quite a lot of work has gone into making the profiling as
				491	fast as possible. This is a summary of the important
				492	features:</para>
				493
				494	<itemizedlist>
				495
				496	<listitem>
				497	<para>The basic block-level cost centre storage allows almost
				498	free cost centre lookup.</para>
				499	</listitem>
				500
				501	<listitem>
				502	<para>Only one function call is made per instruction
				503	simulated; even this accounts for a sizeable percentage of
				504	execution time, but it seems unavoidable if we want
				505	flexibility in the cache simulator.</para>
				506	</listitem>
				507
				508	<listitem>
				509	<para>Unchanging information about an instruction is stored
				510	in its cost centre, avoiding unnecessary argument pushing,
				511	and minimising UCode instrumentation bloat.</para>
				512	</listitem>
				513
				514	<listitem>
				515	<para>Summary counts are calculated at the end, rather than
				516	during execution.</para>
				517	</listitem>
				518
				519	<listitem>
				520	<para>The <computeroutput>cachegrind.out</computeroutput>
				521	output files can contain huge amounts of information; file
				522	format was carefully chosen to minimise file sizes.</para>
				523	</listitem>
				524
				525	</itemizedlist>
				526
				527	</sect1>
				528
				529
				530
				531	<sect1 id="cg-tech-docs.annotate" xreflabel="Annotation">
				532	<title>Annotation</title>
				533
				534	<para>Annotation is done by cg_annotate. It is a fairly
				535	straightforward Perl script that slurps up all the cost centres,
				536	and then runs through all the chosen source files, printing out
				537	cost centres with them. It too has been carefully optimised.</para>
				538
				539	</sect1>
				540
				541
				542
				543	<sect1 id="cg-tech-docs.extensions" xreflabel="Similar work, extensions">
				544	<title>Similar work, extensions</title>
				545
				546	<para>It would be relatively straightforward to do other
				547	simulations and obtain line-by-line information about interesting
				548	events. A good example would be branch prediction -- all
				549	branches could be instrumented to interact with a branch
				550	prediction simulator, using very similar techniques to those
				551	described above.</para>
				552
				553	<para>In particular, cg_annotate would not need to change -- the
				554	file format is such that it is not specific to the cache
				555	simulation, but could be used for any kind of line-by-line
				556	information. The only part of cg_annotate that is specific to
				557	the cache simulation is the name of the input file
				558	(<computeroutput>cachegrind.out</computeroutput>), although it
				559	would be very simple to add an option to control this.</para>
				560
				561	</sect1>
				562
				563	</chapter>