blob: 3375ef066f8d23d275a75b4502b005e6eb65b389 [file] [log] [blame]
sewardja9a2dcf2002-11-11 00:20:07 +00001<html>
2 <head>
3 <style type="text/css">
4 body { background-color: #ffffff;
5 color: #000000;
6 font-family: Times, Helvetica, Arial;
7 font-size: 14pt}
8 h4 { margin-bottom: 0.3em}
9 code { color: #000000;
10 font-family: Courier;
11 font-size: 13pt }
12 pre { color: #000000;
13 font-family: Courier;
14 font-size: 13pt }
15 a:link { color: #0000C0;
16 text-decoration: none; }
17 a:visited { color: #0000C0;
18 text-decoration: none; }
19 a:active { color: #0000C0;
20 text-decoration: none; }
21 </style>
22 <title>The design and implementation of Valgrind</title>
23 </head>
24
25<body bgcolor="#ffffff">
26
27<a name="title">&nbsp;</a>
28<h1 align=center>How Cachegrind works</h1>
29
30<center>
31Detailed technical notes for hackers, maintainers and the
32overly-curious<br>
33These notes pertain to snapshot 20020306<br>
34<p>
35<a href="mailto:jseward@acm.org">jseward@acm.org<br>
36<a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br>
37Copyright &copy; 2000-2002 Julian Seward
38<p>
39Valgrind is licensed under the GNU General Public License,
40version 2<br>
41An open-source tool for finding memory-management problems in
42x86 GNU/Linux executables.
43</center>
44
45<p>
46
47
48
49
50<hr width="100%">
51
52<h2>Cache profiling</h2>
53Valgrind is a very nice platform for doing cache profiling and other kinds of
54simulation, because it converts horrible x86 instructions into nice clean
55RISC-like UCode. For example, for cache profiling we are interested in
56instructions that read and write memory; in UCode there are only four
57instructions that do this: <code>LOAD</code>, <code>STORE</code>,
58<code>FPU_R</code> and <code>FPU_W</code>. By contrast, because of the x86
59addressing modes, almost every instruction can read or write memory.<p>
60
61Most of the cache profiling machinery is in the file
62<code>vg_cachesim.c</code>.<p>
63
64These notes are a somewhat haphazard guide to how Valgrind's cache profiling
65works.<p>
66
67<h3>Cost centres</h3>
68Valgrind gathers cache profiling about every instruction executed,
69individually. Each instruction has a <b>cost centre</b> associated with it.
70There are two kinds of cost centre: one for instructions that don't reference
71memory (<code>iCC</code>), and one for instructions that do
72(<code>idCC</code>):
73
74<pre>
75typedef struct _CC {
76 ULong a;
77 ULong m1;
78 ULong m2;
79} CC;
80
81typedef struct _iCC {
82 /* word 1 */
83 UChar tag;
84 UChar instr_size;
85
86 /* words 2+ */
87 Addr instr_addr;
88 CC I;
89} iCC;
90
91typedef struct _idCC {
92 /* word 1 */
93 UChar tag;
94 UChar instr_size;
95 UChar data_size;
96
97 /* words 2+ */
98 Addr instr_addr;
99 CC I;
100 CC D;
101} idCC;
102</pre>
103
104Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
105<code>m2</code> for recording references, level 1 misses and level 2 misses.
106Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
107ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
108
109A <code>iCC</code> has one <code>CC</code> for instruction cache accesses. A
110<code>idCC</code> has two, one for instruction cache accesses, and one for data
111cache accesses.<p>
112
113The <code>iCC</code> and <code>dCC</code> structs also store unchanging
114information about the instruction:
115<ul>
116 <li>An instruction-type identification tag (explained below)</li><p>
117 <li>Instruction size</li><p>
118 <li>Data reference size (<code>idCC</code> only)</li><p>
119 <li>Instruction address</li><p>
120</ul>
121
122Note that data address is not one of the fields for <code>idCC</code>. This is
123because for many memory-referencing instructions the data address can change
124each time it's executed (eg. if it uses register-offset addressing). We have
125to give this item to the cache simulation in a different way (see
126Instrumentation section below). Some memory-referencing instructions do always
127reference the same address, but we don't try to treat them specialy in order to
128keep things simple.<p>
129
130Also note that there is only room for recording info about one data cache
131access in an <code>idCC</code>. So what about instructions that do a read then
132a write, such as:
133
134<blockquote><code>inc %(esi)</code></blockquote>
135
136In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
137since it immediately follows the read which will drag the block into the cache
138if it's not already there. So the write access isn't really interesting, and
139Valgrind doesn't record it. This means that Valgrind doesn't measure
140memory references, but rather memory references that could miss in the cache.
141This behaviour is the same as that used by the AMD Athlon hardware counters.
142It also has the benefit of simplifying the implementation -- instructions that
143read and write memory can be treated like instructions that read memory.<p>
144
145<h3>Storing cost-centres</h3>
146Cost centres are stored in a way that makes them very cheap to lookup, which is
147important since one is looked up for every original x86 instruction
148executed.<p>
149
150Valgrind does JIT translations at the basic block level, and cost centres are
151also setup and stored at the basic block level. By doing things carefully, we
152store all the cost centres for a basic block in a contiguous array, and lookup
153comes almost for free.<p>
154
155Consider this part of a basic block (for exposition purposes, pretend it's an
156entire basic block):
157
158<pre>
159movl $0x0,%eax
160movl $0x99, -4(%ebp)
161</pre>
162
163The translation to UCode looks like this:
164
165<pre>
166MOVL $0x0, t20
167PUTL t20, %EAX
168INCEIPo $5
169
170LEA1L -4(t4), t14
171MOVL $0x99, t18
172STL t18, (t14)
173INCEIPo $7
174</pre>
175
176The first step is to allocate the cost centres. This requires a preliminary
177pass to count how many x86 instructions were in the basic block, and their
178types (and thus sizes). UCode translations for single x86 instructions are
179delimited by the <code>INCEIPo</code> instruction, the argument of which gives
180the byte size of the instruction (note that lazy INCEIP updating is turned off
181to allow this).<p>
182
183We can tell if an x86 instruction references memory by looking for
184<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
185cost centre is required. From this we can determine how many cost centres we
186need for the basic block, and their sizes. We can then allocate them in a
187single array.<p>
188
189Consider the example code above. After the preliminary pass, we know we need
190two cost centres, one <code>iCC</code> and one <code>dCC</code>. So we
191allocate an array to store these which looks like this:
192
193<pre>
194|(uninit)| tag (1 byte)
195|(uninit)| instr_size (1 bytes)
196|(uninit)| (padding) (2 bytes)
197|(uninit)| instr_addr (4 bytes)
198|(uninit)| I.a (8 bytes)
199|(uninit)| I.m1 (8 bytes)
200|(uninit)| I.m2 (8 bytes)
201
202|(uninit)| tag (1 byte)
203|(uninit)| instr_size (1 byte)
204|(uninit)| data_size (1 byte)
205|(uninit)| (padding) (1 byte)
206|(uninit)| instr_addr (4 bytes)
207|(uninit)| I.a (8 bytes)
208|(uninit)| I.m1 (8 bytes)
209|(uninit)| I.m2 (8 bytes)
210|(uninit)| D.a (8 bytes)
211|(uninit)| D.m1 (8 bytes)
212|(uninit)| D.m2 (8 bytes)
213</pre>
214
215(We can see now why we need tags to distinguish between the two types of cost
216centres.)<p>
217
218We also record the size of the array. We look up the debug info of the first
219instruction in the basic block, and then stick the array into a table indexed
220by filename and function name. This makes it easy to dump the information
221quickly to file at the end.<p>
222
223<h3>Instrumentation</h3>
224The instrumentation pass has two main jobs:
225
226<ol>
227 <li>Fill in the gaps in the allocated cost centres.</li><p>
228 <li>Add UCode to call the cache simulator for each instruction.</li><p>
229</ol>
230
231The instrumentation pass steps through the UCode and the cost centres in
232tandem. As each original x86 instruction's UCode is processed, the appropriate
233gaps in the instructions cost centre are filled in, for example:
234
235<pre>
236|INSTR_CC| tag (1 byte)
237|5 | instr_size (1 bytes)
238|(uninit)| (padding) (2 bytes)
239|i_addr1 | instr_addr (4 bytes)
240|0 | I.a (8 bytes)
241|0 | I.m1 (8 bytes)
242|0 | I.m2 (8 bytes)
243
244|WRITE_CC| tag (1 byte)
245|7 | instr_size (1 byte)
246|4 | data_size (1 byte)
247|(uninit)| (padding) (1 byte)
248|i_addr2 | instr_addr (4 bytes)
249|0 | I.a (8 bytes)
250|0 | I.m1 (8 bytes)
251|0 | I.m2 (8 bytes)
252|0 | D.a (8 bytes)
253|0 | D.m1 (8 bytes)
254|0 | D.m2 (8 bytes)
255</pre>
256
257(Note that this step is not performed if a basic block is re-translated; see
258<a href="#retranslations">here</a> for more information.)<p>
259
260GCC inserts padding before the <code>instr_size</code> field so that it is word
261aligned.<p>
262
263The instrumentation added to call the cache simulation function looks like this
264(instrumentation is indented to distinguish it from the original UCode):
265
266<pre>
267MOVL $0x0, t20
268PUTL t20, %EAX
269 PUSHL %eax
270 PUSHL %ecx
271 PUSHL %edx
272 MOVL $0x4091F8A4, t46 # address of 1st CC
273 PUSHL t46
274 CALLMo $0x12 # second cachesim function
275 CLEARo $0x4
276 POPL %edx
277 POPL %ecx
278 POPL %eax
279INCEIPo $5
280
281LEA1L -4(t4), t14
282MOVL $0x99, t18
283 MOVL t14, t42
284STL t18, (t14)
285 PUSHL %eax
286 PUSHL %ecx
287 PUSHL %edx
288 PUSHL t42
289 MOVL $0x4091F8C4, t44 # address of 2nd CC
290 PUSHL t44
291 CALLMo $0x13 # second cachesim function
292 CLEARo $0x8
293 POPL %edx
294 POPL %ecx
295 POPL %eax
296INCEIPo $7
297</pre>
298
299Consider the first instruction's UCode. Each call is surrounded by three
300<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
301caller-save registers. Then the address of the instruction's cost centre is
302pushed onto the stack, to be the first argument to the cache simulation
303function. The address is known at this point because we are doing a
304simultaneous pass through the cost centre array. This means the cost centre
305lookup for each instruction is almost free (just the cost of pushing an
306argument for a function call). Then the call to the cache simulation function
307for non-memory-reference instructions is made (note that the
308<code>CALLMo</code> UInstruction takes an offset into a table of predefined
309functions; it is not an absolute address), and the single argument is
310<code>CLEAR</code>ed from the stack.<p>
311
312The second instruction's UCode is similar. The only difference is that, as
313mentioned before, we have to pass the address of the data item referenced to
314the cache simulation function too. This explains the <code>MOVL t14,
315t42</code> and <code>PUSHL t42</code> UInstructions. (Note that the seemingly
316redundant <code>MOV</code>ing will probably be optimised away during register
317allocation.)<p>
318
319Note that instead of storing unchanging information about each instruction
320(instruction size, data size, etc) in its cost centre, we could have passed in
321these arguments to the simulation function. But this would slow the calls down
322(two or three extra arguments pushed onto the stack). Also it would bloat the
323UCode instrumentation by amounts similar to the space required for them in the
324cost centre; bloated UCode would also fill the translation cache more quickly,
325requiring more translations for large programs and slowing them down more.<p>
326
327<a name="retranslations"></a>
328<h3>Handling basic block retranslations</h3>
329The above description ignores one complication. Valgrind has a limited size
330cache for basic block translations; if it fills up, old translations are
331discarded. If a discarded basic block is executed again, it must be
332re-translated.<p>
333
334However, we can't use this approach for profiling -- we can't throw away cost
335centres for instructions in the middle of execution! So when a basic block is
336translated, we first look for its cost centre array in the hash table. If
337there is no cost centre array, it must be the first translation, so we proceed
338as described above. But if there is a cost centre array already, it must be a
339retranslation. In this case, we skip the cost centre allocation and
340initialisation steps, but still do the UCode instrumentation step.<p>
341
342<h3>The cache simulation</h3>
343The cache simulation is fairly straightforward. It just tracks which memory
344blocks are in the cache at the moment (it doesn't track the contents, since
345that is irrelevant).<p>
346
347The interface to the simulation is quite clean. The functions called from the
348UCode contain calls to the simulation functions in the files
349<Code>vg_cachesim_{I1,D1,L2}.c</code>; these calls are inlined so that only
350one function call is done per simulated x86 instruction. The file
351<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
352containing the simulation, which makes plugging in new cache simulations is
353very easy -- you just replace the three files and recompile.<p>
354
355<h3>Output</h3>
356Output is fairly straightforward, basically printing the cost centre for every
357instruction, grouped by files and functions. Total counts (eg. total cache
358accesses, total L1 misses) are calculated when traversing this structure rather
359than during execution, to save time; the cache simulation functions are called
360so often that even one or two extra adds can make a sizeable difference.<p>
361
362Input file has the following format:
363
364<pre>
365file ::= desc_line* cmd_line events_line data_line+ summary_line
366desc_line ::= "desc:" ws? non_nl_string
367cmd_line ::= "cmd:" ws? cmd
368events_line ::= "events:" ws? (event ws)+
369data_line ::= file_line | fn_line | count_line
370file_line ::= ("fl=" | "fi=" | "fe=") filename
371fn_line ::= "fn=" fn_name
372count_line ::= line_num ws? (count ws)+
373summary_line ::= "summary:" ws? (count ws)+
374count ::= num | "."
375</pre>
376
377Where:
378
379<ul>
380 <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
381 <li><code>cmd</code> is a command line invocation.</li><p>
382 <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
383 <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
384 <li><code>ws</code> is whitespace.</li><p>
385 <li><code>nl</code> is a newline.</li><p>
386</ul>
387
388The contents of the "desc:" lines is printed out at the top of the summary.
389This is a generic way of providing simulation specific information, eg. for
390giving the cache configuration for cache simulation.<p>
391
392Counts can be "." to represent "N/A", eg. the number of write misses for an
393instruction that doesn't write to memory.<p>
394
395The number of counts in each <code>line</code> and the
396<code>summary_line</code> should not exceed the number of events in the
397<code>event_line</code>. If the number in each <code>line</code> is less,
398cg_annotate treats those missing as though they were a "." entry. <p>
399
400A <code>file_line</code> changes the current file name. A <code>fn_line</code>
401changes the current function name. A <code>count_line</code> contains counts
402that pertain to the current filename/fn_name. A "fn=" <code>file_line</code>
403and a <code>fn_line</code> must appear before any <code>count_line</code>s to
404give the context of the first <code>count_line</code>s.<p>
405
406Each <code>file_line</code> should be immediately followed by a
407<code>fn_line</code>. "fi=" <code>file_lines</code> are used to switch
408filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
409are put at the end of a basic block in which the file name hasn't been switched
410back to the original file name. (fi and fe lines behave the same, they are
411only distinguished to help debugging.)<p>
412
413
414<h3>Summary of performance features</h3>
415Quite a lot of work has gone into making the profiling as fast as possible.
416This is a summary of the important features:
417
418<ul>
419 <li>The basic block-level cost centre storage allows almost free cost centre
420 lookup.</li><p>
421
422 <li>Only one function call is made per instruction simulated; even this
423 accounts for a sizeable percentage of execution time, but it seems
424 unavoidable if we want flexibility in the cache simulator.</li><p>
425
426 <li>Unchanging information about an instruction is stored in its cost centre,
427 avoiding unnecessary argument pushing, and minimising UCode
428 instrumentation bloat.</li><p>
429
430 <li>Summary counts are calculated at the end, rather than during
431 execution.</li><p>
432
433 <li>The <code>cachegrind.out</code> output files can contain huge amounts of
434 information; file format was carefully chosen to minimise file
435 sizes.</li><p>
436</ul>
437
438
439<h3>Annotation</h3>
440Annotation is done by cg_annotate. It is a fairly straightforward Perl script
441that slurps up all the cost centres, and then runs through all the chosen
442source files, printing out cost centres with them. It too has been carefully
443optimised.
444
445
446<h3>Similar work, extensions</h3>
447It would be relatively straightforward to do other simulations and obtain
448line-by-line information about interesting events. A good example would be
449branch prediction -- all branches could be instrumented to interact with a
450branch prediction simulator, using very similar techniques to those described
451above.<p>
452
453In particular, cg_annotate would not need to change -- the file format is such
454that it is not specific to the cache simulation, but could be used for any kind
455of line-by-line information. The only part of cg_annotate that is specific to
456the cache simulation is the name of the input file
457(<code>cachegrind.out</code>), although it would be very simple to add an
458option to control this.<p>
459
460</body>
461</html>