sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame^] | 1 | <html> |
| 2 | <head> |
| 3 | <style type="text/css"> |
| 4 | body { background-color: #ffffff; |
| 5 | color: #000000; |
| 6 | font-family: Times, Helvetica, Arial; |
| 7 | font-size: 14pt} |
| 8 | h4 { margin-bottom: 0.3em} |
| 9 | code { color: #000000; |
| 10 | font-family: Courier; |
| 11 | font-size: 13pt } |
| 12 | pre { color: #000000; |
| 13 | font-family: Courier; |
| 14 | font-size: 13pt } |
| 15 | a:link { color: #0000C0; |
| 16 | text-decoration: none; } |
| 17 | a:visited { color: #0000C0; |
| 18 | text-decoration: none; } |
| 19 | a:active { color: #0000C0; |
| 20 | text-decoration: none; } |
| 21 | </style> |
| 22 | <title>The design and implementation of Valgrind</title> |
| 23 | </head> |
| 24 | |
| 25 | <body bgcolor="#ffffff"> |
| 26 | |
| 27 | <a name="title"> </a> |
| 28 | <h1 align=center>How Cachegrind works</h1> |
| 29 | |
| 30 | <center> |
| 31 | Detailed technical notes for hackers, maintainers and the |
| 32 | overly-curious<br> |
| 33 | These notes pertain to snapshot 20020306<br> |
| 34 | <p> |
| 35 | <a href="mailto:jseward@acm.org">jseward@acm.org<br> |
| 36 | <a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br> |
| 37 | Copyright © 2000-2002 Julian Seward |
| 38 | <p> |
| 39 | Valgrind is licensed under the GNU General Public License, |
| 40 | version 2<br> |
| 41 | An open-source tool for finding memory-management problems in |
| 42 | x86 GNU/Linux executables. |
| 43 | </center> |
| 44 | |
| 45 | <p> |
| 46 | |
| 47 | |
| 48 | |
| 49 | |
| 50 | <hr width="100%"> |
| 51 | |
| 52 | <h2>Cache profiling</h2> |
| 53 | Valgrind is a very nice platform for doing cache profiling and other kinds of |
| 54 | simulation, because it converts horrible x86 instructions into nice clean |
| 55 | RISC-like UCode. For example, for cache profiling we are interested in |
| 56 | instructions that read and write memory; in UCode there are only four |
| 57 | instructions that do this: <code>LOAD</code>, <code>STORE</code>, |
| 58 | <code>FPU_R</code> and <code>FPU_W</code>. By contrast, because of the x86 |
| 59 | addressing modes, almost every instruction can read or write memory.<p> |
| 60 | |
| 61 | Most of the cache profiling machinery is in the file |
| 62 | <code>vg_cachesim.c</code>.<p> |
| 63 | |
| 64 | These notes are a somewhat haphazard guide to how Valgrind's cache profiling |
| 65 | works.<p> |
| 66 | |
| 67 | <h3>Cost centres</h3> |
| 68 | Valgrind gathers cache profiling about every instruction executed, |
| 69 | individually. Each instruction has a <b>cost centre</b> associated with it. |
| 70 | There are two kinds of cost centre: one for instructions that don't reference |
| 71 | memory (<code>iCC</code>), and one for instructions that do |
| 72 | (<code>idCC</code>): |
| 73 | |
| 74 | <pre> |
| 75 | typedef struct _CC { |
| 76 | ULong a; |
| 77 | ULong m1; |
| 78 | ULong m2; |
| 79 | } CC; |
| 80 | |
| 81 | typedef struct _iCC { |
| 82 | /* word 1 */ |
| 83 | UChar tag; |
| 84 | UChar instr_size; |
| 85 | |
| 86 | /* words 2+ */ |
| 87 | Addr instr_addr; |
| 88 | CC I; |
| 89 | } iCC; |
| 90 | |
| 91 | typedef struct _idCC { |
| 92 | /* word 1 */ |
| 93 | UChar tag; |
| 94 | UChar instr_size; |
| 95 | UChar data_size; |
| 96 | |
| 97 | /* words 2+ */ |
| 98 | Addr instr_addr; |
| 99 | CC I; |
| 100 | CC D; |
| 101 | } idCC; |
| 102 | </pre> |
| 103 | |
| 104 | Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>, |
| 105 | <code>m2</code> for recording references, level 1 misses and level 2 misses. |
| 106 | Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large, |
| 107 | ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p> |
| 108 | |
| 109 | A <code>iCC</code> has one <code>CC</code> for instruction cache accesses. A |
| 110 | <code>idCC</code> has two, one for instruction cache accesses, and one for data |
| 111 | cache accesses.<p> |
| 112 | |
| 113 | The <code>iCC</code> and <code>dCC</code> structs also store unchanging |
| 114 | information about the instruction: |
| 115 | <ul> |
| 116 | <li>An instruction-type identification tag (explained below)</li><p> |
| 117 | <li>Instruction size</li><p> |
| 118 | <li>Data reference size (<code>idCC</code> only)</li><p> |
| 119 | <li>Instruction address</li><p> |
| 120 | </ul> |
| 121 | |
| 122 | Note that data address is not one of the fields for <code>idCC</code>. This is |
| 123 | because for many memory-referencing instructions the data address can change |
| 124 | each time it's executed (eg. if it uses register-offset addressing). We have |
| 125 | to give this item to the cache simulation in a different way (see |
| 126 | Instrumentation section below). Some memory-referencing instructions do always |
| 127 | reference the same address, but we don't try to treat them specialy in order to |
| 128 | keep things simple.<p> |
| 129 | |
| 130 | Also note that there is only room for recording info about one data cache |
| 131 | access in an <code>idCC</code>. So what about instructions that do a read then |
| 132 | a write, such as: |
| 133 | |
| 134 | <blockquote><code>inc %(esi)</code></blockquote> |
| 135 | |
| 136 | In a write-allocate cache, as simulated by Valgrind, the write cannot miss, |
| 137 | since it immediately follows the read which will drag the block into the cache |
| 138 | if it's not already there. So the write access isn't really interesting, and |
| 139 | Valgrind doesn't record it. This means that Valgrind doesn't measure |
| 140 | memory references, but rather memory references that could miss in the cache. |
| 141 | This behaviour is the same as that used by the AMD Athlon hardware counters. |
| 142 | It also has the benefit of simplifying the implementation -- instructions that |
| 143 | read and write memory can be treated like instructions that read memory.<p> |
| 144 | |
| 145 | <h3>Storing cost-centres</h3> |
| 146 | Cost centres are stored in a way that makes them very cheap to lookup, which is |
| 147 | important since one is looked up for every original x86 instruction |
| 148 | executed.<p> |
| 149 | |
| 150 | Valgrind does JIT translations at the basic block level, and cost centres are |
| 151 | also setup and stored at the basic block level. By doing things carefully, we |
| 152 | store all the cost centres for a basic block in a contiguous array, and lookup |
| 153 | comes almost for free.<p> |
| 154 | |
| 155 | Consider this part of a basic block (for exposition purposes, pretend it's an |
| 156 | entire basic block): |
| 157 | |
| 158 | <pre> |
| 159 | movl $0x0,%eax |
| 160 | movl $0x99, -4(%ebp) |
| 161 | </pre> |
| 162 | |
| 163 | The translation to UCode looks like this: |
| 164 | |
| 165 | <pre> |
| 166 | MOVL $0x0, t20 |
| 167 | PUTL t20, %EAX |
| 168 | INCEIPo $5 |
| 169 | |
| 170 | LEA1L -4(t4), t14 |
| 171 | MOVL $0x99, t18 |
| 172 | STL t18, (t14) |
| 173 | INCEIPo $7 |
| 174 | </pre> |
| 175 | |
| 176 | The first step is to allocate the cost centres. This requires a preliminary |
| 177 | pass to count how many x86 instructions were in the basic block, and their |
| 178 | types (and thus sizes). UCode translations for single x86 instructions are |
| 179 | delimited by the <code>INCEIPo</code> instruction, the argument of which gives |
| 180 | the byte size of the instruction (note that lazy INCEIP updating is turned off |
| 181 | to allow this).<p> |
| 182 | |
| 183 | We can tell if an x86 instruction references memory by looking for |
| 184 | <code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of |
| 185 | cost centre is required. From this we can determine how many cost centres we |
| 186 | need for the basic block, and their sizes. We can then allocate them in a |
| 187 | single array.<p> |
| 188 | |
| 189 | Consider the example code above. After the preliminary pass, we know we need |
| 190 | two cost centres, one <code>iCC</code> and one <code>dCC</code>. So we |
| 191 | allocate an array to store these which looks like this: |
| 192 | |
| 193 | <pre> |
| 194 | |(uninit)| tag (1 byte) |
| 195 | |(uninit)| instr_size (1 bytes) |
| 196 | |(uninit)| (padding) (2 bytes) |
| 197 | |(uninit)| instr_addr (4 bytes) |
| 198 | |(uninit)| I.a (8 bytes) |
| 199 | |(uninit)| I.m1 (8 bytes) |
| 200 | |(uninit)| I.m2 (8 bytes) |
| 201 | |
| 202 | |(uninit)| tag (1 byte) |
| 203 | |(uninit)| instr_size (1 byte) |
| 204 | |(uninit)| data_size (1 byte) |
| 205 | |(uninit)| (padding) (1 byte) |
| 206 | |(uninit)| instr_addr (4 bytes) |
| 207 | |(uninit)| I.a (8 bytes) |
| 208 | |(uninit)| I.m1 (8 bytes) |
| 209 | |(uninit)| I.m2 (8 bytes) |
| 210 | |(uninit)| D.a (8 bytes) |
| 211 | |(uninit)| D.m1 (8 bytes) |
| 212 | |(uninit)| D.m2 (8 bytes) |
| 213 | </pre> |
| 214 | |
| 215 | (We can see now why we need tags to distinguish between the two types of cost |
| 216 | centres.)<p> |
| 217 | |
| 218 | We also record the size of the array. We look up the debug info of the first |
| 219 | instruction in the basic block, and then stick the array into a table indexed |
| 220 | by filename and function name. This makes it easy to dump the information |
| 221 | quickly to file at the end.<p> |
| 222 | |
| 223 | <h3>Instrumentation</h3> |
| 224 | The instrumentation pass has two main jobs: |
| 225 | |
| 226 | <ol> |
| 227 | <li>Fill in the gaps in the allocated cost centres.</li><p> |
| 228 | <li>Add UCode to call the cache simulator for each instruction.</li><p> |
| 229 | </ol> |
| 230 | |
| 231 | The instrumentation pass steps through the UCode and the cost centres in |
| 232 | tandem. As each original x86 instruction's UCode is processed, the appropriate |
| 233 | gaps in the instructions cost centre are filled in, for example: |
| 234 | |
| 235 | <pre> |
| 236 | |INSTR_CC| tag (1 byte) |
| 237 | |5 | instr_size (1 bytes) |
| 238 | |(uninit)| (padding) (2 bytes) |
| 239 | |i_addr1 | instr_addr (4 bytes) |
| 240 | |0 | I.a (8 bytes) |
| 241 | |0 | I.m1 (8 bytes) |
| 242 | |0 | I.m2 (8 bytes) |
| 243 | |
| 244 | |WRITE_CC| tag (1 byte) |
| 245 | |7 | instr_size (1 byte) |
| 246 | |4 | data_size (1 byte) |
| 247 | |(uninit)| (padding) (1 byte) |
| 248 | |i_addr2 | instr_addr (4 bytes) |
| 249 | |0 | I.a (8 bytes) |
| 250 | |0 | I.m1 (8 bytes) |
| 251 | |0 | I.m2 (8 bytes) |
| 252 | |0 | D.a (8 bytes) |
| 253 | |0 | D.m1 (8 bytes) |
| 254 | |0 | D.m2 (8 bytes) |
| 255 | </pre> |
| 256 | |
| 257 | (Note that this step is not performed if a basic block is re-translated; see |
| 258 | <a href="#retranslations">here</a> for more information.)<p> |
| 259 | |
| 260 | GCC inserts padding before the <code>instr_size</code> field so that it is word |
| 261 | aligned.<p> |
| 262 | |
| 263 | The instrumentation added to call the cache simulation function looks like this |
| 264 | (instrumentation is indented to distinguish it from the original UCode): |
| 265 | |
| 266 | <pre> |
| 267 | MOVL $0x0, t20 |
| 268 | PUTL t20, %EAX |
| 269 | PUSHL %eax |
| 270 | PUSHL %ecx |
| 271 | PUSHL %edx |
| 272 | MOVL $0x4091F8A4, t46 # address of 1st CC |
| 273 | PUSHL t46 |
| 274 | CALLMo $0x12 # second cachesim function |
| 275 | CLEARo $0x4 |
| 276 | POPL %edx |
| 277 | POPL %ecx |
| 278 | POPL %eax |
| 279 | INCEIPo $5 |
| 280 | |
| 281 | LEA1L -4(t4), t14 |
| 282 | MOVL $0x99, t18 |
| 283 | MOVL t14, t42 |
| 284 | STL t18, (t14) |
| 285 | PUSHL %eax |
| 286 | PUSHL %ecx |
| 287 | PUSHL %edx |
| 288 | PUSHL t42 |
| 289 | MOVL $0x4091F8C4, t44 # address of 2nd CC |
| 290 | PUSHL t44 |
| 291 | CALLMo $0x13 # second cachesim function |
| 292 | CLEARo $0x8 |
| 293 | POPL %edx |
| 294 | POPL %ecx |
| 295 | POPL %eax |
| 296 | INCEIPo $7 |
| 297 | </pre> |
| 298 | |
| 299 | Consider the first instruction's UCode. Each call is surrounded by three |
| 300 | <code>PUSHL</code> and <code>POPL</code> instructions to save and restore the |
| 301 | caller-save registers. Then the address of the instruction's cost centre is |
| 302 | pushed onto the stack, to be the first argument to the cache simulation |
| 303 | function. The address is known at this point because we are doing a |
| 304 | simultaneous pass through the cost centre array. This means the cost centre |
| 305 | lookup for each instruction is almost free (just the cost of pushing an |
| 306 | argument for a function call). Then the call to the cache simulation function |
| 307 | for non-memory-reference instructions is made (note that the |
| 308 | <code>CALLMo</code> UInstruction takes an offset into a table of predefined |
| 309 | functions; it is not an absolute address), and the single argument is |
| 310 | <code>CLEAR</code>ed from the stack.<p> |
| 311 | |
| 312 | The second instruction's UCode is similar. The only difference is that, as |
| 313 | mentioned before, we have to pass the address of the data item referenced to |
| 314 | the cache simulation function too. This explains the <code>MOVL t14, |
| 315 | t42</code> and <code>PUSHL t42</code> UInstructions. (Note that the seemingly |
| 316 | redundant <code>MOV</code>ing will probably be optimised away during register |
| 317 | allocation.)<p> |
| 318 | |
| 319 | Note that instead of storing unchanging information about each instruction |
| 320 | (instruction size, data size, etc) in its cost centre, we could have passed in |
| 321 | these arguments to the simulation function. But this would slow the calls down |
| 322 | (two or three extra arguments pushed onto the stack). Also it would bloat the |
| 323 | UCode instrumentation by amounts similar to the space required for them in the |
| 324 | cost centre; bloated UCode would also fill the translation cache more quickly, |
| 325 | requiring more translations for large programs and slowing them down more.<p> |
| 326 | |
| 327 | <a name="retranslations"></a> |
| 328 | <h3>Handling basic block retranslations</h3> |
| 329 | The above description ignores one complication. Valgrind has a limited size |
| 330 | cache for basic block translations; if it fills up, old translations are |
| 331 | discarded. If a discarded basic block is executed again, it must be |
| 332 | re-translated.<p> |
| 333 | |
| 334 | However, we can't use this approach for profiling -- we can't throw away cost |
| 335 | centres for instructions in the middle of execution! So when a basic block is |
| 336 | translated, we first look for its cost centre array in the hash table. If |
| 337 | there is no cost centre array, it must be the first translation, so we proceed |
| 338 | as described above. But if there is a cost centre array already, it must be a |
| 339 | retranslation. In this case, we skip the cost centre allocation and |
| 340 | initialisation steps, but still do the UCode instrumentation step.<p> |
| 341 | |
| 342 | <h3>The cache simulation</h3> |
| 343 | The cache simulation is fairly straightforward. It just tracks which memory |
| 344 | blocks are in the cache at the moment (it doesn't track the contents, since |
| 345 | that is irrelevant).<p> |
| 346 | |
| 347 | The interface to the simulation is quite clean. The functions called from the |
| 348 | UCode contain calls to the simulation functions in the files |
| 349 | <Code>vg_cachesim_{I1,D1,L2}.c</code>; these calls are inlined so that only |
| 350 | one function call is done per simulated x86 instruction. The file |
| 351 | <code>vg_cachesim.c</code> simply <code>#include</code>s the three files |
| 352 | containing the simulation, which makes plugging in new cache simulations is |
| 353 | very easy -- you just replace the three files and recompile.<p> |
| 354 | |
| 355 | <h3>Output</h3> |
| 356 | Output is fairly straightforward, basically printing the cost centre for every |
| 357 | instruction, grouped by files and functions. Total counts (eg. total cache |
| 358 | accesses, total L1 misses) are calculated when traversing this structure rather |
| 359 | than during execution, to save time; the cache simulation functions are called |
| 360 | so often that even one or two extra adds can make a sizeable difference.<p> |
| 361 | |
| 362 | Input file has the following format: |
| 363 | |
| 364 | <pre> |
| 365 | file ::= desc_line* cmd_line events_line data_line+ summary_line |
| 366 | desc_line ::= "desc:" ws? non_nl_string |
| 367 | cmd_line ::= "cmd:" ws? cmd |
| 368 | events_line ::= "events:" ws? (event ws)+ |
| 369 | data_line ::= file_line | fn_line | count_line |
| 370 | file_line ::= ("fl=" | "fi=" | "fe=") filename |
| 371 | fn_line ::= "fn=" fn_name |
| 372 | count_line ::= line_num ws? (count ws)+ |
| 373 | summary_line ::= "summary:" ws? (count ws)+ |
| 374 | count ::= num | "." |
| 375 | </pre> |
| 376 | |
| 377 | Where: |
| 378 | |
| 379 | <ul> |
| 380 | <li><code>non_nl_string</code> is any string not containing a newline.</li><p> |
| 381 | <li><code>cmd</code> is a command line invocation.</li><p> |
| 382 | <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p> |
| 383 | <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p> |
| 384 | <li><code>ws</code> is whitespace.</li><p> |
| 385 | <li><code>nl</code> is a newline.</li><p> |
| 386 | </ul> |
| 387 | |
| 388 | The contents of the "desc:" lines is printed out at the top of the summary. |
| 389 | This is a generic way of providing simulation specific information, eg. for |
| 390 | giving the cache configuration for cache simulation.<p> |
| 391 | |
| 392 | Counts can be "." to represent "N/A", eg. the number of write misses for an |
| 393 | instruction that doesn't write to memory.<p> |
| 394 | |
| 395 | The number of counts in each <code>line</code> and the |
| 396 | <code>summary_line</code> should not exceed the number of events in the |
| 397 | <code>event_line</code>. If the number in each <code>line</code> is less, |
| 398 | cg_annotate treats those missing as though they were a "." entry. <p> |
| 399 | |
| 400 | A <code>file_line</code> changes the current file name. A <code>fn_line</code> |
| 401 | changes the current function name. A <code>count_line</code> contains counts |
| 402 | that pertain to the current filename/fn_name. A "fn=" <code>file_line</code> |
| 403 | and a <code>fn_line</code> must appear before any <code>count_line</code>s to |
| 404 | give the context of the first <code>count_line</code>s.<p> |
| 405 | |
| 406 | Each <code>file_line</code> should be immediately followed by a |
| 407 | <code>fn_line</code>. "fi=" <code>file_lines</code> are used to switch |
| 408 | filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but |
| 409 | are put at the end of a basic block in which the file name hasn't been switched |
| 410 | back to the original file name. (fi and fe lines behave the same, they are |
| 411 | only distinguished to help debugging.)<p> |
| 412 | |
| 413 | |
| 414 | <h3>Summary of performance features</h3> |
| 415 | Quite a lot of work has gone into making the profiling as fast as possible. |
| 416 | This is a summary of the important features: |
| 417 | |
| 418 | <ul> |
| 419 | <li>The basic block-level cost centre storage allows almost free cost centre |
| 420 | lookup.</li><p> |
| 421 | |
| 422 | <li>Only one function call is made per instruction simulated; even this |
| 423 | accounts for a sizeable percentage of execution time, but it seems |
| 424 | unavoidable if we want flexibility in the cache simulator.</li><p> |
| 425 | |
| 426 | <li>Unchanging information about an instruction is stored in its cost centre, |
| 427 | avoiding unnecessary argument pushing, and minimising UCode |
| 428 | instrumentation bloat.</li><p> |
| 429 | |
| 430 | <li>Summary counts are calculated at the end, rather than during |
| 431 | execution.</li><p> |
| 432 | |
| 433 | <li>The <code>cachegrind.out</code> output files can contain huge amounts of |
| 434 | information; file format was carefully chosen to minimise file |
| 435 | sizes.</li><p> |
| 436 | </ul> |
| 437 | |
| 438 | |
| 439 | <h3>Annotation</h3> |
| 440 | Annotation is done by cg_annotate. It is a fairly straightforward Perl script |
| 441 | that slurps up all the cost centres, and then runs through all the chosen |
| 442 | source files, printing out cost centres with them. It too has been carefully |
| 443 | optimised. |
| 444 | |
| 445 | |
| 446 | <h3>Similar work, extensions</h3> |
| 447 | It would be relatively straightforward to do other simulations and obtain |
| 448 | line-by-line information about interesting events. A good example would be |
| 449 | branch prediction -- all branches could be instrumented to interact with a |
| 450 | branch prediction simulator, using very similar techniques to those described |
| 451 | above.<p> |
| 452 | |
| 453 | In particular, cg_annotate would not need to change -- the file format is such |
| 454 | that it is not specific to the cache simulation, but could be used for any kind |
| 455 | of line-by-line information. The only part of cg_annotate that is specific to |
| 456 | the cache simulation is the name of the input file |
| 457 | (<code>cachegrind.out</code>), although it would be very simple to add an |
| 458 | option to control this.<p> |
| 459 | |
| 460 | </body> |
| 461 | </html> |