blob: bc49a7ae9a08097b077f753e95ef756fe0100c1a [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<chapter id="cg-tech-docs" xreflabel="How Cachegrind works">
6
7<title>How Cachegrind works</title>
8
9<sect1 id="cg-tech-docs.profiling" xreflabel="Cache profiling">
10<title>Cache profiling</title>
11
njnc4fcca32004-12-01 00:02:36 +000012<para>[Note: this document is now very old, and a lot of its contents are out
13of date, and misleading.]</para>
14
njn3e986b22004-11-30 10:43:45 +000015<para>Valgrind is a very nice platform for doing cache profiling
16and other kinds of simulation, because it converts horrible x86
17instructions into nice clean RISC-like UCode. For example, for
18cache profiling we are interested in instructions that read and
19write memory; in UCode there are only four instructions that do
20this: <computeroutput>LOAD</computeroutput>,
21<computeroutput>STORE</computeroutput>,
22<computeroutput>FPU_R</computeroutput> and
23<computeroutput>FPU_W</computeroutput>. By contrast, because of
24the x86 addressing modes, almost every instruction can read or
25write memory.</para>
26
27<para>Most of the cache profiling machinery is in the file
28<filename>vg_cachesim.c</filename>.</para>
29
30<para>These notes are a somewhat haphazard guide to how
31Valgrind's cache profiling works.</para>
32
33</sect1>
34
35
36<sect1 id="cg-tech-docs.costcentres" xreflabel="Cost centres">
37<title>Cost centres</title>
38
39<para>Valgrind gathers cache profiling about every instruction
40executed, individually. Each instruction has a <command>cost
41centre</command> associated with it. There are two kinds of cost
42centre: one for instructions that don't reference memory
43(<computeroutput>iCC</computeroutput>), and one for instructions
44that do (<computeroutput>idCC</computeroutput>):</para>
45
46<programlisting><![CDATA[
47typedef struct _CC {
48 ULong a;
49 ULong m1;
50 ULong m2;
51} CC;
52
53typedef struct _iCC {
54 /* word 1 */
55 UChar tag;
56 UChar instr_size;
57
58 /* words 2+ */
59 Addr instr_addr;
60 CC I;
61} iCC;
62
63typedef struct _idCC {
64 /* word 1 */
65 UChar tag;
66 UChar instr_size;
67 UChar data_size;
68
69 /* words 2+ */
70 Addr instr_addr;
71 CC I;
72 CC D;
73} idCC; ]]></programlisting>
74
75<para>Each <computeroutput>CC</computeroutput> has three fields
76<computeroutput>a</computeroutput>,
77<computeroutput>m1</computeroutput>,
78<computeroutput>m2</computeroutput> for recording references,
79level 1 misses and level 2 misses. Each of these is a 64-bit
80<computeroutput>ULong</computeroutput> -- the numbers can get
81very large, ie. greater than 4.2 billion allowed by a 32-bit
82unsigned int.</para>
83
84<para>A <computeroutput>iCC</computeroutput> has one
85<computeroutput>CC</computeroutput> for instruction cache
86accesses. A <computeroutput>idCC</computeroutput> has two, one
87for instruction cache accesses, and one for data cache
88accesses.</para>
89
90<para>The <computeroutput>iCC</computeroutput> and
91<computeroutput>dCC</computeroutput> structs also store
92unchanging information about the instruction:</para>
93<itemizedlist>
94 <listitem>
95 <para>An instruction-type identification tag (explained
96 below)</para>
97 </listitem>
98 <listitem>
99 <para>Instruction size</para>
100 </listitem>
101 <listitem>
102 <para>Data reference size
103 (<computeroutput>idCC</computeroutput> only)</para>
104 </listitem>
105 <listitem>
106 <para>Instruction address</para>
107 </listitem>
108</itemizedlist>
109
110<para>Note that data address is not one of the fields for
111<computeroutput>idCC</computeroutput>. This is because for many
112memory-referencing instructions the data address can change each
113time it's executed (eg. if it uses register-offset addressing).
114We have to give this item to the cache simulation in a different
115way (see Instrumentation section below). Some memory-referencing
116instructions do always reference the same address, but we don't
117try to treat them specialy in order to keep things simple.</para>
118
119<para>Also note that there is only room for recording info about
120one data cache access in an
121<computeroutput>idCC</computeroutput>. So what about
122instructions that do a read then a write, such as:</para>
123<programlisting><![CDATA[
124inc %(esi)]]></programlisting>
125
126<para>In a write-allocate cache, as simulated by Valgrind, the
127write cannot miss, since it immediately follows the read which
128will drag the block into the cache if it's not already there. So
129the write access isn't really interesting, and Valgrind doesn't
130record it. This means that Valgrind doesn't measure memory
131references, but rather memory references that could miss in the
132cache. This behaviour is the same as that used by the AMD Athlon
133hardware counters. It also has the benefit of simplifying the
134implementation -- instructions that read and write memory can be
135treated like instructions that read memory.</para>
136
137</sect1>
138
139
140<sect1 id="cg-tech-docs.ccstore" xreflabel="Storing cost-centres">
141<title>Storing cost-centres</title>
142
143<para>Cost centres are stored in a way that makes them very cheap
144to lookup, which is important since one is looked up for every
145original x86 instruction executed.</para>
146
147<para>Valgrind does JIT translations at the basic block level,
148and cost centres are also setup and stored at the basic block
149level. By doing things carefully, we store all the cost centres
150for a basic block in a contiguous array, and lookup comes almost
151for free.</para>
152
153<para>Consider this part of a basic block (for exposition
154purposes, pretend it's an entire basic block):</para>
155<programlisting><![CDATA[
156movl $0x0,%eax
157movl $0x99, -4(%ebp)]]></programlisting>
158
159<para>The translation to UCode looks like this:</para>
160<programlisting><![CDATA[
161MOVL $0x0, t20
162PUTL t20, %EAX
163INCEIPo $5
164
165LEA1L -4(t4), t14
166MOVL $0x99, t18
167STL t18, (t14)
168INCEIPo $7]]></programlisting>
169
170<para>The first step is to allocate the cost centres. This
171requires a preliminary pass to count how many x86 instructions
172were in the basic block, and their types (and thus sizes). UCode
173translations for single x86 instructions are delimited by the
174<computeroutput>INCEIPo</computeroutput> instruction, the
175argument of which gives the byte size of the instruction (note
176that lazy INCEIP updating is turned off to allow this).</para>
177
178<para>We can tell if an x86 instruction references memory by
179looking for <computeroutput>LDL</computeroutput> and
180<computeroutput>STL</computeroutput> UCode instructions, and thus
181what kind of cost centre is required. From this we can determine
182how many cost centres we need for the basic block, and their
183sizes. We can then allocate them in a single array.</para>
184
185<para>Consider the example code above. After the preliminary
186pass, we know we need two cost centres, one
187<computeroutput>iCC</computeroutput> and one
188<computeroutput>dCC</computeroutput>. So we allocate an array to
189store these which looks like this:</para>
190
191<programlisting><![CDATA[
192|(uninit)| tag (1 byte)
193|(uninit)| instr_size (1 bytes)
194|(uninit)| (padding) (2 bytes)
195|(uninit)| instr_addr (4 bytes)
196|(uninit)| I.a (8 bytes)
197|(uninit)| I.m1 (8 bytes)
198|(uninit)| I.m2 (8 bytes)
199
200|(uninit)| tag (1 byte)
201|(uninit)| instr_size (1 byte)
202|(uninit)| data_size (1 byte)
203|(uninit)| (padding) (1 byte)
204|(uninit)| instr_addr (4 bytes)
205|(uninit)| I.a (8 bytes)
206|(uninit)| I.m1 (8 bytes)
207|(uninit)| I.m2 (8 bytes)
208|(uninit)| D.a (8 bytes)
209|(uninit)| D.m1 (8 bytes)
210|(uninit)| D.m2 (8 bytes)]]></programlisting>
211
212<para>(We can see now why we need tags to distinguish between the
213two types of cost centres.)</para>
214
215<para>We also record the size of the array. We look up the debug
216info of the first instruction in the basic block, and then stick
217the array into a table indexed by filename and function name.
218This makes it easy to dump the information quickly to file at the
219end.</para>
220
221</sect1>
222
223
224<sect1 id="cg-tech-docs.instrum" xreflabel="Instrumentation">
225<title>Instrumentation</title>
226
227<para>The instrumentation pass has two main jobs:</para>
228
229<orderedlist>
230 <listitem>
231 <para>Fill in the gaps in the allocated cost centres.</para>
232 </listitem>
233 <listitem>
234 <para>Add UCode to call the cache simulator for each
235 instruction.</para>
236 </listitem>
237</orderedlist>
238
239<para>The instrumentation pass steps through the UCode and the
240cost centres in tandem. As each original x86 instruction's UCode
241is processed, the appropriate gaps in the instructions cost
242centre are filled in, for example:</para>
243
244<programlisting><![CDATA[
245|INSTR_CC| tag (1 byte)
246|5 | instr_size (1 bytes)
247|(uninit)| (padding) (2 bytes)
248|i_addr1 | instr_addr (4 bytes)
249|0 | I.a (8 bytes)
250|0 | I.m1 (8 bytes)
251|0 | I.m2 (8 bytes)
252
253|WRITE_CC| tag (1 byte)
254|7 | instr_size (1 byte)
255|4 | data_size (1 byte)
256|(uninit)| (padding) (1 byte)
257|i_addr2 | instr_addr (4 bytes)
258|0 | I.a (8 bytes)
259|0 | I.m1 (8 bytes)
260|0 | I.m2 (8 bytes)
261|0 | D.a (8 bytes)
262|0 | D.m1 (8 bytes)
263|0 | D.m2 (8 bytes)]]></programlisting>
264
265<para>(Note that this step is not performed if a basic block is
266re-translated; see <xref linkend="cg-tech-docs.retranslations"/> for
267more information.)</para>
268
269<para>GCC inserts padding before the
270<computeroutput>instr_size</computeroutput> field so that it is
271word aligned.</para>
272
273<para>The instrumentation added to call the cache simulation
274function looks like this (instrumentation is indented to
275distinguish it from the original UCode):</para>
276
277<programlisting><![CDATA[
278MOVL $0x0, t20
279PUTL t20, %EAX
280 PUSHL %eax
281 PUSHL %ecx
282 PUSHL %edx
283 MOVL $0x4091F8A4, t46 # address of 1st CC
284 PUSHL t46
285 CALLMo $0x12 # second cachesim function
286 CLEARo $0x4
287 POPL %edx
288 POPL %ecx
289 POPL %eax
290INCEIPo $5
291
292LEA1L -4(t4), t14
293MOVL $0x99, t18
294 MOVL t14, t42
295STL t18, (t14)
296 PUSHL %eax
297 PUSHL %ecx
298 PUSHL %edx
299 PUSHL t42
300 MOVL $0x4091F8C4, t44 # address of 2nd CC
301 PUSHL t44
302 CALLMo $0x13 # second cachesim function
303 CLEARo $0x8
304 POPL %edx
305 POPL %ecx
306 POPL %eax
307INCEIPo $7]]></programlisting>
308
309<para>Consider the first instruction's UCode. Each call is
310surrounded by three <computeroutput>PUSHL</computeroutput> and
311<computeroutput>POPL</computeroutput> instructions to save and
312restore the caller-save registers. Then the address of the
313instruction's cost centre is pushed onto the stack, to be the
314first argument to the cache simulation function. The address is
315known at this point because we are doing a simultaneous pass
316through the cost centre array. This means the cost centre lookup
317for each instruction is almost free (just the cost of pushing an
318argument for a function call). Then the call to the cache
319simulation function for non-memory-reference instructions is made
320(note that the <computeroutput>CALLMo</computeroutput>
321UInstruction takes an offset into a table of predefined
322functions; it is not an absolute address), and the single
323argument is <computeroutput>CLEAR</computeroutput>ed from the
324stack.</para>
325
326<para>The second instruction's UCode is similar. The only
327difference is that, as mentioned before, we have to pass the
328address of the data item referenced to the cache simulation
329function too. This explains the <computeroutput>MOVL t14,
330t42</computeroutput> and <computeroutput>PUSHL
331t42</computeroutput> UInstructions. (Note that the seemingly
332redundant <computeroutput>MOV</computeroutput>ing will probably
333be optimised away during register allocation.)</para>
334
335<para>Note that instead of storing unchanging information about
336each instruction (instruction size, data size, etc) in its cost
337centre, we could have passed in these arguments to the simulation
338function. But this would slow the calls down (two or three extra
339arguments pushed onto the stack). Also it would bloat the UCode
340instrumentation by amounts similar to the space required for them
341in the cost centre; bloated UCode would also fill the translation
342cache more quickly, requiring more translations for large
343programs and slowing them down more.</para>
344
345</sect1>
346
347
348<sect1 id="cg-tech-docs.retranslations"
349 xreflabel="Handling basic block retranslations">
350<title>Handling basic block retranslations</title>
351
352<para>The above description ignores one complication. Valgrind
353has a limited size cache for basic block translations; if it
354fills up, old translations are discarded. If a discarded basic
355block is executed again, it must be re-translated.</para>
356
357<para>However, we can't use this approach for profiling -- we
358can't throw away cost centres for instructions in the middle of
359execution! So when a basic block is translated, we first look
360for its cost centre array in the hash table. If there is no cost
361centre array, it must be the first translation, so we proceed as
362described above. But if there is a cost centre array already, it
363must be a retranslation. In this case, we skip the cost centre
364allocation and initialisation steps, but still do the UCode
365instrumentation step.</para>
366
367</sect1>
368
369
370
371<sect1 id="cg-tech-docs.cachesim" xreflabel="The cache simulation">
372<title>The cache simulation</title>
373
374<para>The cache simulation is fairly straightforward. It just
375tracks which memory blocks are in the cache at the moment (it
376doesn't track the contents, since that is irrelevant).</para>
377
378<para>The interface to the simulation is quite clean. The
379functions called from the UCode contain calls to the simulation
380functions in the files
381<filename>vg_cachesim_{I1,D1,L2}.c</filename>; these calls are
382inlined so that only one function call is done per simulated x86
383instruction. The file <filename>vg_cachesim.c</filename> simply
384<computeroutput>#include</computeroutput>s the three files
385containing the simulation, which makes plugging in new cache
386simulations is very easy -- you just replace the three files and
387recompile.</para>
388
389</sect1>
390
391
392<sect1 id="cg-tech-docs.output" xreflabel="Output">
393<title>Output</title>
394
395<para>Output is fairly straightforward, basically printing the
396cost centre for every instruction, grouped by files and
397functions. Total counts (eg. total cache accesses, total L1
398misses) are calculated when traversing this structure rather than
399during execution, to save time; the cache simulation functions
400are called so often that even one or two extra adds can make a
401sizeable difference.</para>
402
403<para>Input file has the following format:</para>
404<programlisting><![CDATA[
405file ::= desc_line* cmd_line events_line data_line+ summary_line
406desc_line ::= "desc:" ws? non_nl_string
407cmd_line ::= "cmd:" ws? cmd
408events_line ::= "events:" ws? (event ws)+
409data_line ::= file_line | fn_line | count_line
410file_line ::= ("fl=" | "fi=" | "fe=") filename
411fn_line ::= "fn=" fn_name
412count_line ::= line_num ws? (count ws)+
413summary_line ::= "summary:" ws? (count ws)+
414count ::= num | "."]]></programlisting>
415
416<para>Where:</para>
417<itemizedlist>
418 <listitem>
419 <para><computeroutput>non_nl_string</computeroutput> is any
420 string not containing a newline.</para>
421 </listitem>
422 <listitem>
423 <para><computeroutput>cmd</computeroutput> is a command line
424 invocation.</para>
425 </listitem>
426 <listitem>
427 <para><computeroutput>filename</computeroutput> and
428 <computeroutput>fn_name</computeroutput> can be anything.</para>
429 </listitem>
430 <listitem>
431 <para><computeroutput>num</computeroutput> and
432 <computeroutput>line_num</computeroutput> are decimal
433 numbers.</para>
434 </listitem>
435 <listitem>
436 <para><computeroutput>ws</computeroutput> is whitespace.</para>
437 </listitem>
438 <listitem>
439 <para><computeroutput>nl</computeroutput> is a newline.</para>
440 </listitem>
441
442</itemizedlist>
443
444<para>The contents of the "desc:" lines is printed out at the top
445of the summary. This is a generic way of providing simulation
446specific information, eg. for giving the cache configuration for
447cache simulation.</para>
448
449<para>Counts can be "." to represent "N/A", eg. the number of
450write misses for an instruction that doesn't write to
451memory.</para>
452
453<para>The number of counts in each
454<computeroutput>line</computeroutput> and the
455<computeroutput>summary_line</computeroutput> should not exceed
456the number of events in the
457<computeroutput>event_line</computeroutput>. If the number in
458each <computeroutput>line</computeroutput> is less, cg_annotate
459treats those missing as though they were a "." entry.</para>
460
461<para>A <computeroutput>file_line</computeroutput> changes the
462current file name. A <computeroutput>fn_line</computeroutput>
463changes the current function name. A
464<computeroutput>count_line</computeroutput> contains counts that
465pertain to the current filename/fn_name. A "fn="
466<computeroutput>file_line</computeroutput> and a
467<computeroutput>fn_line</computeroutput> must appear before any
468<computeroutput>count_line</computeroutput>s to give the context
469of the first <computeroutput>count_line</computeroutput>s.</para>
470
471<para>Each <computeroutput>file_line</computeroutput> should be
472immediately followed by a
473<computeroutput>fn_line</computeroutput>. "fi="
474<computeroutput>file_lines</computeroutput> are used to switch
475filenames for inlined functions; "fe="
476<computeroutput>file_lines</computeroutput> are similar, but are
477put at the end of a basic block in which the file name hasn't
478been switched back to the original file name. (fi and fe lines
479behave the same, they are only distinguished to help
480debugging.)</para>
481
482</sect1>
483
484
485
486<sect1 id="cg-tech-docs.summary"
487 xreflabel="Summary of performance features">
488<title>Summary of performance features</title>
489
490<para>Quite a lot of work has gone into making the profiling as
491fast as possible. This is a summary of the important
492features:</para>
493
494<itemizedlist>
495
496 <listitem>
497 <para>The basic block-level cost centre storage allows almost
498 free cost centre lookup.</para>
499 </listitem>
500
501 <listitem>
502 <para>Only one function call is made per instruction
503 simulated; even this accounts for a sizeable percentage of
504 execution time, but it seems unavoidable if we want
505 flexibility in the cache simulator.</para>
506 </listitem>
507
508 <listitem>
509 <para>Unchanging information about an instruction is stored
510 in its cost centre, avoiding unnecessary argument pushing,
511 and minimising UCode instrumentation bloat.</para>
512 </listitem>
513
514 <listitem>
515 <para>Summary counts are calculated at the end, rather than
516 during execution.</para>
517 </listitem>
518
519 <listitem>
520 <para>The <computeroutput>cachegrind.out</computeroutput>
521 output files can contain huge amounts of information; file
522 format was carefully chosen to minimise file sizes.</para>
523 </listitem>
524
525</itemizedlist>
526
527</sect1>
528
529
530
531<sect1 id="cg-tech-docs.annotate" xreflabel="Annotation">
532<title>Annotation</title>
533
534<para>Annotation is done by cg_annotate. It is a fairly
535straightforward Perl script that slurps up all the cost centres,
536and then runs through all the chosen source files, printing out
537cost centres with them. It too has been carefully optimised.</para>
538
539</sect1>
540
541
542
543<sect1 id="cg-tech-docs.extensions" xreflabel="Similar work, extensions">
544<title>Similar work, extensions</title>
545
546<para>It would be relatively straightforward to do other
547simulations and obtain line-by-line information about interesting
548events. A good example would be branch prediction -- all
549branches could be instrumented to interact with a branch
550prediction simulator, using very similar techniques to those
551described above.</para>
552
553<para>In particular, cg_annotate would not need to change -- the
554file format is such that it is not specific to the cache
555simulation, but could be used for any kind of line-by-line
556information. The only part of cg_annotate that is specific to
557the cache simulation is the name of the input file
558(<computeroutput>cachegrind.out</computeroutput>), although it
559would be very simple to add an option to control this.</para>
560
561</sect1>
562
563</chapter>