blob: 210dee0768ed31a997d9f11e173fd1c5b62feab9 [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<chapter id="cg-tech-docs" xreflabel="How Cachegrind works">
6
7<title>How Cachegrind works</title>
8
9<sect1 id="cg-tech-docs.profiling" xreflabel="Cache profiling">
10<title>Cache profiling</title>
11
12<para>Valgrind is a very nice platform for doing cache profiling
13and other kinds of simulation, because it converts horrible x86
14instructions into nice clean RISC-like UCode. For example, for
15cache profiling we are interested in instructions that read and
16write memory; in UCode there are only four instructions that do
17this: <computeroutput>LOAD</computeroutput>,
18<computeroutput>STORE</computeroutput>,
19<computeroutput>FPU_R</computeroutput> and
20<computeroutput>FPU_W</computeroutput>. By contrast, because of
21the x86 addressing modes, almost every instruction can read or
22write memory.</para>
23
24<para>Most of the cache profiling machinery is in the file
25<filename>vg_cachesim.c</filename>.</para>
26
27<para>These notes are a somewhat haphazard guide to how
28Valgrind's cache profiling works.</para>
29
30</sect1>
31
32
33<sect1 id="cg-tech-docs.costcentres" xreflabel="Cost centres">
34<title>Cost centres</title>
35
36<para>Valgrind gathers cache profiling about every instruction
37executed, individually. Each instruction has a <command>cost
38centre</command> associated with it. There are two kinds of cost
39centre: one for instructions that don't reference memory
40(<computeroutput>iCC</computeroutput>), and one for instructions
41that do (<computeroutput>idCC</computeroutput>):</para>
42
43<programlisting><![CDATA[
44typedef struct _CC {
45 ULong a;
46 ULong m1;
47 ULong m2;
48} CC;
49
50typedef struct _iCC {
51 /* word 1 */
52 UChar tag;
53 UChar instr_size;
54
55 /* words 2+ */
56 Addr instr_addr;
57 CC I;
58} iCC;
59
60typedef struct _idCC {
61 /* word 1 */
62 UChar tag;
63 UChar instr_size;
64 UChar data_size;
65
66 /* words 2+ */
67 Addr instr_addr;
68 CC I;
69 CC D;
70} idCC; ]]></programlisting>
71
72<para>Each <computeroutput>CC</computeroutput> has three fields
73<computeroutput>a</computeroutput>,
74<computeroutput>m1</computeroutput>,
75<computeroutput>m2</computeroutput> for recording references,
76level 1 misses and level 2 misses. Each of these is a 64-bit
77<computeroutput>ULong</computeroutput> -- the numbers can get
78very large, ie. greater than 4.2 billion allowed by a 32-bit
79unsigned int.</para>
80
81<para>A <computeroutput>iCC</computeroutput> has one
82<computeroutput>CC</computeroutput> for instruction cache
83accesses. A <computeroutput>idCC</computeroutput> has two, one
84for instruction cache accesses, and one for data cache
85accesses.</para>
86
87<para>The <computeroutput>iCC</computeroutput> and
88<computeroutput>dCC</computeroutput> structs also store
89unchanging information about the instruction:</para>
90<itemizedlist>
91 <listitem>
92 <para>An instruction-type identification tag (explained
93 below)</para>
94 </listitem>
95 <listitem>
96 <para>Instruction size</para>
97 </listitem>
98 <listitem>
99 <para>Data reference size
100 (<computeroutput>idCC</computeroutput> only)</para>
101 </listitem>
102 <listitem>
103 <para>Instruction address</para>
104 </listitem>
105</itemizedlist>
106
107<para>Note that data address is not one of the fields for
108<computeroutput>idCC</computeroutput>. This is because for many
109memory-referencing instructions the data address can change each
110time it's executed (eg. if it uses register-offset addressing).
111We have to give this item to the cache simulation in a different
112way (see Instrumentation section below). Some memory-referencing
113instructions do always reference the same address, but we don't
114try to treat them specialy in order to keep things simple.</para>
115
116<para>Also note that there is only room for recording info about
117one data cache access in an
118<computeroutput>idCC</computeroutput>. So what about
119instructions that do a read then a write, such as:</para>
120<programlisting><![CDATA[
121inc %(esi)]]></programlisting>
122
123<para>In a write-allocate cache, as simulated by Valgrind, the
124write cannot miss, since it immediately follows the read which
125will drag the block into the cache if it's not already there. So
126the write access isn't really interesting, and Valgrind doesn't
127record it. This means that Valgrind doesn't measure memory
128references, but rather memory references that could miss in the
129cache. This behaviour is the same as that used by the AMD Athlon
130hardware counters. It also has the benefit of simplifying the
131implementation -- instructions that read and write memory can be
132treated like instructions that read memory.</para>
133
134</sect1>
135
136
137<sect1 id="cg-tech-docs.ccstore" xreflabel="Storing cost-centres">
138<title>Storing cost-centres</title>
139
140<para>Cost centres are stored in a way that makes them very cheap
141to lookup, which is important since one is looked up for every
142original x86 instruction executed.</para>
143
144<para>Valgrind does JIT translations at the basic block level,
145and cost centres are also setup and stored at the basic block
146level. By doing things carefully, we store all the cost centres
147for a basic block in a contiguous array, and lookup comes almost
148for free.</para>
149
150<para>Consider this part of a basic block (for exposition
151purposes, pretend it's an entire basic block):</para>
152<programlisting><![CDATA[
153movl $0x0,%eax
154movl $0x99, -4(%ebp)]]></programlisting>
155
156<para>The translation to UCode looks like this:</para>
157<programlisting><![CDATA[
158MOVL $0x0, t20
159PUTL t20, %EAX
160INCEIPo $5
161
162LEA1L -4(t4), t14
163MOVL $0x99, t18
164STL t18, (t14)
165INCEIPo $7]]></programlisting>
166
167<para>The first step is to allocate the cost centres. This
168requires a preliminary pass to count how many x86 instructions
169were in the basic block, and their types (and thus sizes). UCode
170translations for single x86 instructions are delimited by the
171<computeroutput>INCEIPo</computeroutput> instruction, the
172argument of which gives the byte size of the instruction (note
173that lazy INCEIP updating is turned off to allow this).</para>
174
175<para>We can tell if an x86 instruction references memory by
176looking for <computeroutput>LDL</computeroutput> and
177<computeroutput>STL</computeroutput> UCode instructions, and thus
178what kind of cost centre is required. From this we can determine
179how many cost centres we need for the basic block, and their
180sizes. We can then allocate them in a single array.</para>
181
182<para>Consider the example code above. After the preliminary
183pass, we know we need two cost centres, one
184<computeroutput>iCC</computeroutput> and one
185<computeroutput>dCC</computeroutput>. So we allocate an array to
186store these which looks like this:</para>
187
188<programlisting><![CDATA[
189|(uninit)| tag (1 byte)
190|(uninit)| instr_size (1 bytes)
191|(uninit)| (padding) (2 bytes)
192|(uninit)| instr_addr (4 bytes)
193|(uninit)| I.a (8 bytes)
194|(uninit)| I.m1 (8 bytes)
195|(uninit)| I.m2 (8 bytes)
196
197|(uninit)| tag (1 byte)
198|(uninit)| instr_size (1 byte)
199|(uninit)| data_size (1 byte)
200|(uninit)| (padding) (1 byte)
201|(uninit)| instr_addr (4 bytes)
202|(uninit)| I.a (8 bytes)
203|(uninit)| I.m1 (8 bytes)
204|(uninit)| I.m2 (8 bytes)
205|(uninit)| D.a (8 bytes)
206|(uninit)| D.m1 (8 bytes)
207|(uninit)| D.m2 (8 bytes)]]></programlisting>
208
209<para>(We can see now why we need tags to distinguish between the
210two types of cost centres.)</para>
211
212<para>We also record the size of the array. We look up the debug
213info of the first instruction in the basic block, and then stick
214the array into a table indexed by filename and function name.
215This makes it easy to dump the information quickly to file at the
216end.</para>
217
218</sect1>
219
220
221<sect1 id="cg-tech-docs.instrum" xreflabel="Instrumentation">
222<title>Instrumentation</title>
223
224<para>The instrumentation pass has two main jobs:</para>
225
226<orderedlist>
227 <listitem>
228 <para>Fill in the gaps in the allocated cost centres.</para>
229 </listitem>
230 <listitem>
231 <para>Add UCode to call the cache simulator for each
232 instruction.</para>
233 </listitem>
234</orderedlist>
235
236<para>The instrumentation pass steps through the UCode and the
237cost centres in tandem. As each original x86 instruction's UCode
238is processed, the appropriate gaps in the instructions cost
239centre are filled in, for example:</para>
240
241<programlisting><![CDATA[
242|INSTR_CC| tag (1 byte)
243|5 | instr_size (1 bytes)
244|(uninit)| (padding) (2 bytes)
245|i_addr1 | instr_addr (4 bytes)
246|0 | I.a (8 bytes)
247|0 | I.m1 (8 bytes)
248|0 | I.m2 (8 bytes)
249
250|WRITE_CC| tag (1 byte)
251|7 | instr_size (1 byte)
252|4 | data_size (1 byte)
253|(uninit)| (padding) (1 byte)
254|i_addr2 | instr_addr (4 bytes)
255|0 | I.a (8 bytes)
256|0 | I.m1 (8 bytes)
257|0 | I.m2 (8 bytes)
258|0 | D.a (8 bytes)
259|0 | D.m1 (8 bytes)
260|0 | D.m2 (8 bytes)]]></programlisting>
261
262<para>(Note that this step is not performed if a basic block is
263re-translated; see <xref linkend="cg-tech-docs.retranslations"/> for
264more information.)</para>
265
266<para>GCC inserts padding before the
267<computeroutput>instr_size</computeroutput> field so that it is
268word aligned.</para>
269
270<para>The instrumentation added to call the cache simulation
271function looks like this (instrumentation is indented to
272distinguish it from the original UCode):</para>
273
274<programlisting><![CDATA[
275MOVL $0x0, t20
276PUTL t20, %EAX
277 PUSHL %eax
278 PUSHL %ecx
279 PUSHL %edx
280 MOVL $0x4091F8A4, t46 # address of 1st CC
281 PUSHL t46
282 CALLMo $0x12 # second cachesim function
283 CLEARo $0x4
284 POPL %edx
285 POPL %ecx
286 POPL %eax
287INCEIPo $5
288
289LEA1L -4(t4), t14
290MOVL $0x99, t18
291 MOVL t14, t42
292STL t18, (t14)
293 PUSHL %eax
294 PUSHL %ecx
295 PUSHL %edx
296 PUSHL t42
297 MOVL $0x4091F8C4, t44 # address of 2nd CC
298 PUSHL t44
299 CALLMo $0x13 # second cachesim function
300 CLEARo $0x8
301 POPL %edx
302 POPL %ecx
303 POPL %eax
304INCEIPo $7]]></programlisting>
305
306<para>Consider the first instruction's UCode. Each call is
307surrounded by three <computeroutput>PUSHL</computeroutput> and
308<computeroutput>POPL</computeroutput> instructions to save and
309restore the caller-save registers. Then the address of the
310instruction's cost centre is pushed onto the stack, to be the
311first argument to the cache simulation function. The address is
312known at this point because we are doing a simultaneous pass
313through the cost centre array. This means the cost centre lookup
314for each instruction is almost free (just the cost of pushing an
315argument for a function call). Then the call to the cache
316simulation function for non-memory-reference instructions is made
317(note that the <computeroutput>CALLMo</computeroutput>
318UInstruction takes an offset into a table of predefined
319functions; it is not an absolute address), and the single
320argument is <computeroutput>CLEAR</computeroutput>ed from the
321stack.</para>
322
323<para>The second instruction's UCode is similar. The only
324difference is that, as mentioned before, we have to pass the
325address of the data item referenced to the cache simulation
326function too. This explains the <computeroutput>MOVL t14,
327t42</computeroutput> and <computeroutput>PUSHL
328t42</computeroutput> UInstructions. (Note that the seemingly
329redundant <computeroutput>MOV</computeroutput>ing will probably
330be optimised away during register allocation.)</para>
331
332<para>Note that instead of storing unchanging information about
333each instruction (instruction size, data size, etc) in its cost
334centre, we could have passed in these arguments to the simulation
335function. But this would slow the calls down (two or three extra
336arguments pushed onto the stack). Also it would bloat the UCode
337instrumentation by amounts similar to the space required for them
338in the cost centre; bloated UCode would also fill the translation
339cache more quickly, requiring more translations for large
340programs and slowing them down more.</para>
341
342</sect1>
343
344
345<sect1 id="cg-tech-docs.retranslations"
346 xreflabel="Handling basic block retranslations">
347<title>Handling basic block retranslations</title>
348
349<para>The above description ignores one complication. Valgrind
350has a limited size cache for basic block translations; if it
351fills up, old translations are discarded. If a discarded basic
352block is executed again, it must be re-translated.</para>
353
354<para>However, we can't use this approach for profiling -- we
355can't throw away cost centres for instructions in the middle of
356execution! So when a basic block is translated, we first look
357for its cost centre array in the hash table. If there is no cost
358centre array, it must be the first translation, so we proceed as
359described above. But if there is a cost centre array already, it
360must be a retranslation. In this case, we skip the cost centre
361allocation and initialisation steps, but still do the UCode
362instrumentation step.</para>
363
364</sect1>
365
366
367
368<sect1 id="cg-tech-docs.cachesim" xreflabel="The cache simulation">
369<title>The cache simulation</title>
370
371<para>The cache simulation is fairly straightforward. It just
372tracks which memory blocks are in the cache at the moment (it
373doesn't track the contents, since that is irrelevant).</para>
374
375<para>The interface to the simulation is quite clean. The
376functions called from the UCode contain calls to the simulation
377functions in the files
378<filename>vg_cachesim_{I1,D1,L2}.c</filename>; these calls are
379inlined so that only one function call is done per simulated x86
380instruction. The file <filename>vg_cachesim.c</filename> simply
381<computeroutput>#include</computeroutput>s the three files
382containing the simulation, which makes plugging in new cache
383simulations is very easy -- you just replace the three files and
384recompile.</para>
385
386</sect1>
387
388
389<sect1 id="cg-tech-docs.output" xreflabel="Output">
390<title>Output</title>
391
392<para>Output is fairly straightforward, basically printing the
393cost centre for every instruction, grouped by files and
394functions. Total counts (eg. total cache accesses, total L1
395misses) are calculated when traversing this structure rather than
396during execution, to save time; the cache simulation functions
397are called so often that even one or two extra adds can make a
398sizeable difference.</para>
399
400<para>Input file has the following format:</para>
401<programlisting><![CDATA[
402file ::= desc_line* cmd_line events_line data_line+ summary_line
403desc_line ::= "desc:" ws? non_nl_string
404cmd_line ::= "cmd:" ws? cmd
405events_line ::= "events:" ws? (event ws)+
406data_line ::= file_line | fn_line | count_line
407file_line ::= ("fl=" | "fi=" | "fe=") filename
408fn_line ::= "fn=" fn_name
409count_line ::= line_num ws? (count ws)+
410summary_line ::= "summary:" ws? (count ws)+
411count ::= num | "."]]></programlisting>
412
413<para>Where:</para>
414<itemizedlist>
415 <listitem>
416 <para><computeroutput>non_nl_string</computeroutput> is any
417 string not containing a newline.</para>
418 </listitem>
419 <listitem>
420 <para><computeroutput>cmd</computeroutput> is a command line
421 invocation.</para>
422 </listitem>
423 <listitem>
424 <para><computeroutput>filename</computeroutput> and
425 <computeroutput>fn_name</computeroutput> can be anything.</para>
426 </listitem>
427 <listitem>
428 <para><computeroutput>num</computeroutput> and
429 <computeroutput>line_num</computeroutput> are decimal
430 numbers.</para>
431 </listitem>
432 <listitem>
433 <para><computeroutput>ws</computeroutput> is whitespace.</para>
434 </listitem>
435 <listitem>
436 <para><computeroutput>nl</computeroutput> is a newline.</para>
437 </listitem>
438
439</itemizedlist>
440
441<para>The contents of the "desc:" lines is printed out at the top
442of the summary. This is a generic way of providing simulation
443specific information, eg. for giving the cache configuration for
444cache simulation.</para>
445
446<para>Counts can be "." to represent "N/A", eg. the number of
447write misses for an instruction that doesn't write to
448memory.</para>
449
450<para>The number of counts in each
451<computeroutput>line</computeroutput> and the
452<computeroutput>summary_line</computeroutput> should not exceed
453the number of events in the
454<computeroutput>event_line</computeroutput>. If the number in
455each <computeroutput>line</computeroutput> is less, cg_annotate
456treats those missing as though they were a "." entry.</para>
457
458<para>A <computeroutput>file_line</computeroutput> changes the
459current file name. A <computeroutput>fn_line</computeroutput>
460changes the current function name. A
461<computeroutput>count_line</computeroutput> contains counts that
462pertain to the current filename/fn_name. A "fn="
463<computeroutput>file_line</computeroutput> and a
464<computeroutput>fn_line</computeroutput> must appear before any
465<computeroutput>count_line</computeroutput>s to give the context
466of the first <computeroutput>count_line</computeroutput>s.</para>
467
468<para>Each <computeroutput>file_line</computeroutput> should be
469immediately followed by a
470<computeroutput>fn_line</computeroutput>. "fi="
471<computeroutput>file_lines</computeroutput> are used to switch
472filenames for inlined functions; "fe="
473<computeroutput>file_lines</computeroutput> are similar, but are
474put at the end of a basic block in which the file name hasn't
475been switched back to the original file name. (fi and fe lines
476behave the same, they are only distinguished to help
477debugging.)</para>
478
479</sect1>
480
481
482
483<sect1 id="cg-tech-docs.summary"
484 xreflabel="Summary of performance features">
485<title>Summary of performance features</title>
486
487<para>Quite a lot of work has gone into making the profiling as
488fast as possible. This is a summary of the important
489features:</para>
490
491<itemizedlist>
492
493 <listitem>
494 <para>The basic block-level cost centre storage allows almost
495 free cost centre lookup.</para>
496 </listitem>
497
498 <listitem>
499 <para>Only one function call is made per instruction
500 simulated; even this accounts for a sizeable percentage of
501 execution time, but it seems unavoidable if we want
502 flexibility in the cache simulator.</para>
503 </listitem>
504
505 <listitem>
506 <para>Unchanging information about an instruction is stored
507 in its cost centre, avoiding unnecessary argument pushing,
508 and minimising UCode instrumentation bloat.</para>
509 </listitem>
510
511 <listitem>
512 <para>Summary counts are calculated at the end, rather than
513 during execution.</para>
514 </listitem>
515
516 <listitem>
517 <para>The <computeroutput>cachegrind.out</computeroutput>
518 output files can contain huge amounts of information; file
519 format was carefully chosen to minimise file sizes.</para>
520 </listitem>
521
522</itemizedlist>
523
524</sect1>
525
526
527
528<sect1 id="cg-tech-docs.annotate" xreflabel="Annotation">
529<title>Annotation</title>
530
531<para>Annotation is done by cg_annotate. It is a fairly
532straightforward Perl script that slurps up all the cost centres,
533and then runs through all the chosen source files, printing out
534cost centres with them. It too has been carefully optimised.</para>
535
536</sect1>
537
538
539
540<sect1 id="cg-tech-docs.extensions" xreflabel="Similar work, extensions">
541<title>Similar work, extensions</title>
542
543<para>It would be relatively straightforward to do other
544simulations and obtain line-by-line information about interesting
545events. A good example would be branch prediction -- all
546branches could be instrumented to interact with a branch
547prediction simulator, using very similar techniques to those
548described above.</para>
549
550<para>In particular, cg_annotate would not need to change -- the
551file format is such that it is not specific to the cache
552simulation, but could be used for any kind of line-by-line
553information. The only part of cg_annotate that is specific to
554the cache simulation is the name of the input file
555(<computeroutput>cachegrind.out</computeroutput>), although it
556would be very simple to add an option to control this.</para>
557
558</sect1>
559
560</chapter>