njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
| 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> |
| 4 | |
| 5 | <chapter id="cg-tech-docs" xreflabel="How Cachegrind works"> |
| 6 | |
| 7 | <title>How Cachegrind works</title> |
| 8 | |
| 9 | <sect1 id="cg-tech-docs.profiling" xreflabel="Cache profiling"> |
| 10 | <title>Cache profiling</title> |
| 11 | |
njn | c4fcca3 | 2004-12-01 00:02:36 +0000 | [diff] [blame] | 12 | <para>[Note: this document is now very old, and a lot of its contents are out |
| 13 | of date, and misleading.]</para> |
| 14 | |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 15 | <para>Valgrind is a very nice platform for doing cache profiling |
| 16 | and other kinds of simulation, because it converts horrible x86 |
| 17 | instructions into nice clean RISC-like UCode. For example, for |
| 18 | cache profiling we are interested in instructions that read and |
| 19 | write memory; in UCode there are only four instructions that do |
| 20 | this: <computeroutput>LOAD</computeroutput>, |
| 21 | <computeroutput>STORE</computeroutput>, |
| 22 | <computeroutput>FPU_R</computeroutput> and |
| 23 | <computeroutput>FPU_W</computeroutput>. By contrast, because of |
| 24 | the x86 addressing modes, almost every instruction can read or |
| 25 | write memory.</para> |
| 26 | |
| 27 | <para>Most of the cache profiling machinery is in the file |
| 28 | <filename>vg_cachesim.c</filename>.</para> |
| 29 | |
| 30 | <para>These notes are a somewhat haphazard guide to how |
| 31 | Valgrind's cache profiling works.</para> |
| 32 | |
| 33 | </sect1> |
| 34 | |
| 35 | |
| 36 | <sect1 id="cg-tech-docs.costcentres" xreflabel="Cost centres"> |
| 37 | <title>Cost centres</title> |
| 38 | |
| 39 | <para>Valgrind gathers cache profiling about every instruction |
| 40 | executed, individually. Each instruction has a <command>cost |
| 41 | centre</command> associated with it. There are two kinds of cost |
| 42 | centre: one for instructions that don't reference memory |
| 43 | (<computeroutput>iCC</computeroutput>), and one for instructions |
| 44 | that do (<computeroutput>idCC</computeroutput>):</para> |
| 45 | |
| 46 | <programlisting><![CDATA[ |
| 47 | typedef struct _CC { |
| 48 | ULong a; |
| 49 | ULong m1; |
| 50 | ULong m2; |
| 51 | } CC; |
| 52 | |
| 53 | typedef struct _iCC { |
| 54 | /* word 1 */ |
| 55 | UChar tag; |
| 56 | UChar instr_size; |
| 57 | |
| 58 | /* words 2+ */ |
| 59 | Addr instr_addr; |
| 60 | CC I; |
| 61 | } iCC; |
| 62 | |
| 63 | typedef struct _idCC { |
| 64 | /* word 1 */ |
| 65 | UChar tag; |
| 66 | UChar instr_size; |
| 67 | UChar data_size; |
| 68 | |
| 69 | /* words 2+ */ |
| 70 | Addr instr_addr; |
| 71 | CC I; |
| 72 | CC D; |
| 73 | } idCC; ]]></programlisting> |
| 74 | |
| 75 | <para>Each <computeroutput>CC</computeroutput> has three fields |
| 76 | <computeroutput>a</computeroutput>, |
| 77 | <computeroutput>m1</computeroutput>, |
| 78 | <computeroutput>m2</computeroutput> for recording references, |
| 79 | level 1 misses and level 2 misses. Each of these is a 64-bit |
| 80 | <computeroutput>ULong</computeroutput> -- the numbers can get |
| 81 | very large, ie. greater than 4.2 billion allowed by a 32-bit |
| 82 | unsigned int.</para> |
| 83 | |
| 84 | <para>A <computeroutput>iCC</computeroutput> has one |
| 85 | <computeroutput>CC</computeroutput> for instruction cache |
| 86 | accesses. A <computeroutput>idCC</computeroutput> has two, one |
| 87 | for instruction cache accesses, and one for data cache |
| 88 | accesses.</para> |
| 89 | |
| 90 | <para>The <computeroutput>iCC</computeroutput> and |
| 91 | <computeroutput>dCC</computeroutput> structs also store |
| 92 | unchanging information about the instruction:</para> |
| 93 | <itemizedlist> |
| 94 | <listitem> |
| 95 | <para>An instruction-type identification tag (explained |
| 96 | below)</para> |
| 97 | </listitem> |
| 98 | <listitem> |
| 99 | <para>Instruction size</para> |
| 100 | </listitem> |
| 101 | <listitem> |
| 102 | <para>Data reference size |
| 103 | (<computeroutput>idCC</computeroutput> only)</para> |
| 104 | </listitem> |
| 105 | <listitem> |
| 106 | <para>Instruction address</para> |
| 107 | </listitem> |
| 108 | </itemizedlist> |
| 109 | |
| 110 | <para>Note that data address is not one of the fields for |
| 111 | <computeroutput>idCC</computeroutput>. This is because for many |
| 112 | memory-referencing instructions the data address can change each |
| 113 | time it's executed (eg. if it uses register-offset addressing). |
| 114 | We have to give this item to the cache simulation in a different |
| 115 | way (see Instrumentation section below). Some memory-referencing |
| 116 | instructions do always reference the same address, but we don't |
| 117 | try to treat them specialy in order to keep things simple.</para> |
| 118 | |
| 119 | <para>Also note that there is only room for recording info about |
| 120 | one data cache access in an |
| 121 | <computeroutput>idCC</computeroutput>. So what about |
| 122 | instructions that do a read then a write, such as:</para> |
| 123 | <programlisting><![CDATA[ |
| 124 | inc %(esi)]]></programlisting> |
| 125 | |
| 126 | <para>In a write-allocate cache, as simulated by Valgrind, the |
| 127 | write cannot miss, since it immediately follows the read which |
| 128 | will drag the block into the cache if it's not already there. So |
| 129 | the write access isn't really interesting, and Valgrind doesn't |
| 130 | record it. This means that Valgrind doesn't measure memory |
| 131 | references, but rather memory references that could miss in the |
| 132 | cache. This behaviour is the same as that used by the AMD Athlon |
| 133 | hardware counters. It also has the benefit of simplifying the |
| 134 | implementation -- instructions that read and write memory can be |
| 135 | treated like instructions that read memory.</para> |
| 136 | |
| 137 | </sect1> |
| 138 | |
| 139 | |
| 140 | <sect1 id="cg-tech-docs.ccstore" xreflabel="Storing cost-centres"> |
| 141 | <title>Storing cost-centres</title> |
| 142 | |
| 143 | <para>Cost centres are stored in a way that makes them very cheap |
| 144 | to lookup, which is important since one is looked up for every |
| 145 | original x86 instruction executed.</para> |
| 146 | |
| 147 | <para>Valgrind does JIT translations at the basic block level, |
| 148 | and cost centres are also setup and stored at the basic block |
| 149 | level. By doing things carefully, we store all the cost centres |
| 150 | for a basic block in a contiguous array, and lookup comes almost |
| 151 | for free.</para> |
| 152 | |
| 153 | <para>Consider this part of a basic block (for exposition |
| 154 | purposes, pretend it's an entire basic block):</para> |
| 155 | <programlisting><![CDATA[ |
| 156 | movl $0x0,%eax |
| 157 | movl $0x99, -4(%ebp)]]></programlisting> |
| 158 | |
| 159 | <para>The translation to UCode looks like this:</para> |
| 160 | <programlisting><![CDATA[ |
| 161 | MOVL $0x0, t20 |
| 162 | PUTL t20, %EAX |
| 163 | INCEIPo $5 |
| 164 | |
| 165 | LEA1L -4(t4), t14 |
| 166 | MOVL $0x99, t18 |
| 167 | STL t18, (t14) |
| 168 | INCEIPo $7]]></programlisting> |
| 169 | |
| 170 | <para>The first step is to allocate the cost centres. This |
| 171 | requires a preliminary pass to count how many x86 instructions |
| 172 | were in the basic block, and their types (and thus sizes). UCode |
| 173 | translations for single x86 instructions are delimited by the |
| 174 | <computeroutput>INCEIPo</computeroutput> instruction, the |
| 175 | argument of which gives the byte size of the instruction (note |
| 176 | that lazy INCEIP updating is turned off to allow this).</para> |
| 177 | |
| 178 | <para>We can tell if an x86 instruction references memory by |
| 179 | looking for <computeroutput>LDL</computeroutput> and |
| 180 | <computeroutput>STL</computeroutput> UCode instructions, and thus |
| 181 | what kind of cost centre is required. From this we can determine |
| 182 | how many cost centres we need for the basic block, and their |
| 183 | sizes. We can then allocate them in a single array.</para> |
| 184 | |
| 185 | <para>Consider the example code above. After the preliminary |
| 186 | pass, we know we need two cost centres, one |
| 187 | <computeroutput>iCC</computeroutput> and one |
| 188 | <computeroutput>dCC</computeroutput>. So we allocate an array to |
| 189 | store these which looks like this:</para> |
| 190 | |
| 191 | <programlisting><![CDATA[ |
| 192 | |(uninit)| tag (1 byte) |
| 193 | |(uninit)| instr_size (1 bytes) |
| 194 | |(uninit)| (padding) (2 bytes) |
| 195 | |(uninit)| instr_addr (4 bytes) |
| 196 | |(uninit)| I.a (8 bytes) |
| 197 | |(uninit)| I.m1 (8 bytes) |
| 198 | |(uninit)| I.m2 (8 bytes) |
| 199 | |
| 200 | |(uninit)| tag (1 byte) |
| 201 | |(uninit)| instr_size (1 byte) |
| 202 | |(uninit)| data_size (1 byte) |
| 203 | |(uninit)| (padding) (1 byte) |
| 204 | |(uninit)| instr_addr (4 bytes) |
| 205 | |(uninit)| I.a (8 bytes) |
| 206 | |(uninit)| I.m1 (8 bytes) |
| 207 | |(uninit)| I.m2 (8 bytes) |
| 208 | |(uninit)| D.a (8 bytes) |
| 209 | |(uninit)| D.m1 (8 bytes) |
| 210 | |(uninit)| D.m2 (8 bytes)]]></programlisting> |
| 211 | |
| 212 | <para>(We can see now why we need tags to distinguish between the |
| 213 | two types of cost centres.)</para> |
| 214 | |
| 215 | <para>We also record the size of the array. We look up the debug |
| 216 | info of the first instruction in the basic block, and then stick |
| 217 | the array into a table indexed by filename and function name. |
| 218 | This makes it easy to dump the information quickly to file at the |
| 219 | end.</para> |
| 220 | |
| 221 | </sect1> |
| 222 | |
| 223 | |
| 224 | <sect1 id="cg-tech-docs.instrum" xreflabel="Instrumentation"> |
| 225 | <title>Instrumentation</title> |
| 226 | |
| 227 | <para>The instrumentation pass has two main jobs:</para> |
| 228 | |
| 229 | <orderedlist> |
| 230 | <listitem> |
| 231 | <para>Fill in the gaps in the allocated cost centres.</para> |
| 232 | </listitem> |
| 233 | <listitem> |
| 234 | <para>Add UCode to call the cache simulator for each |
| 235 | instruction.</para> |
| 236 | </listitem> |
| 237 | </orderedlist> |
| 238 | |
| 239 | <para>The instrumentation pass steps through the UCode and the |
| 240 | cost centres in tandem. As each original x86 instruction's UCode |
| 241 | is processed, the appropriate gaps in the instructions cost |
| 242 | centre are filled in, for example:</para> |
| 243 | |
| 244 | <programlisting><![CDATA[ |
| 245 | |INSTR_CC| tag (1 byte) |
| 246 | |5 | instr_size (1 bytes) |
| 247 | |(uninit)| (padding) (2 bytes) |
| 248 | |i_addr1 | instr_addr (4 bytes) |
| 249 | |0 | I.a (8 bytes) |
| 250 | |0 | I.m1 (8 bytes) |
| 251 | |0 | I.m2 (8 bytes) |
| 252 | |
| 253 | |WRITE_CC| tag (1 byte) |
| 254 | |7 | instr_size (1 byte) |
| 255 | |4 | data_size (1 byte) |
| 256 | |(uninit)| (padding) (1 byte) |
| 257 | |i_addr2 | instr_addr (4 bytes) |
| 258 | |0 | I.a (8 bytes) |
| 259 | |0 | I.m1 (8 bytes) |
| 260 | |0 | I.m2 (8 bytes) |
| 261 | |0 | D.a (8 bytes) |
| 262 | |0 | D.m1 (8 bytes) |
| 263 | |0 | D.m2 (8 bytes)]]></programlisting> |
| 264 | |
| 265 | <para>(Note that this step is not performed if a basic block is |
| 266 | re-translated; see <xref linkend="cg-tech-docs.retranslations"/> for |
| 267 | more information.)</para> |
| 268 | |
| 269 | <para>GCC inserts padding before the |
| 270 | <computeroutput>instr_size</computeroutput> field so that it is |
| 271 | word aligned.</para> |
| 272 | |
| 273 | <para>The instrumentation added to call the cache simulation |
| 274 | function looks like this (instrumentation is indented to |
| 275 | distinguish it from the original UCode):</para> |
| 276 | |
| 277 | <programlisting><![CDATA[ |
| 278 | MOVL $0x0, t20 |
| 279 | PUTL t20, %EAX |
| 280 | PUSHL %eax |
| 281 | PUSHL %ecx |
| 282 | PUSHL %edx |
| 283 | MOVL $0x4091F8A4, t46 # address of 1st CC |
| 284 | PUSHL t46 |
| 285 | CALLMo $0x12 # second cachesim function |
| 286 | CLEARo $0x4 |
| 287 | POPL %edx |
| 288 | POPL %ecx |
| 289 | POPL %eax |
| 290 | INCEIPo $5 |
| 291 | |
| 292 | LEA1L -4(t4), t14 |
| 293 | MOVL $0x99, t18 |
| 294 | MOVL t14, t42 |
| 295 | STL t18, (t14) |
| 296 | PUSHL %eax |
| 297 | PUSHL %ecx |
| 298 | PUSHL %edx |
| 299 | PUSHL t42 |
| 300 | MOVL $0x4091F8C4, t44 # address of 2nd CC |
| 301 | PUSHL t44 |
| 302 | CALLMo $0x13 # second cachesim function |
| 303 | CLEARo $0x8 |
| 304 | POPL %edx |
| 305 | POPL %ecx |
| 306 | POPL %eax |
| 307 | INCEIPo $7]]></programlisting> |
| 308 | |
| 309 | <para>Consider the first instruction's UCode. Each call is |
| 310 | surrounded by three <computeroutput>PUSHL</computeroutput> and |
| 311 | <computeroutput>POPL</computeroutput> instructions to save and |
| 312 | restore the caller-save registers. Then the address of the |
| 313 | instruction's cost centre is pushed onto the stack, to be the |
| 314 | first argument to the cache simulation function. The address is |
| 315 | known at this point because we are doing a simultaneous pass |
| 316 | through the cost centre array. This means the cost centre lookup |
| 317 | for each instruction is almost free (just the cost of pushing an |
| 318 | argument for a function call). Then the call to the cache |
| 319 | simulation function for non-memory-reference instructions is made |
| 320 | (note that the <computeroutput>CALLMo</computeroutput> |
| 321 | UInstruction takes an offset into a table of predefined |
| 322 | functions; it is not an absolute address), and the single |
| 323 | argument is <computeroutput>CLEAR</computeroutput>ed from the |
| 324 | stack.</para> |
| 325 | |
| 326 | <para>The second instruction's UCode is similar. The only |
| 327 | difference is that, as mentioned before, we have to pass the |
| 328 | address of the data item referenced to the cache simulation |
| 329 | function too. This explains the <computeroutput>MOVL t14, |
| 330 | t42</computeroutput> and <computeroutput>PUSHL |
| 331 | t42</computeroutput> UInstructions. (Note that the seemingly |
| 332 | redundant <computeroutput>MOV</computeroutput>ing will probably |
| 333 | be optimised away during register allocation.)</para> |
| 334 | |
| 335 | <para>Note that instead of storing unchanging information about |
| 336 | each instruction (instruction size, data size, etc) in its cost |
| 337 | centre, we could have passed in these arguments to the simulation |
| 338 | function. But this would slow the calls down (two or three extra |
| 339 | arguments pushed onto the stack). Also it would bloat the UCode |
| 340 | instrumentation by amounts similar to the space required for them |
| 341 | in the cost centre; bloated UCode would also fill the translation |
| 342 | cache more quickly, requiring more translations for large |
| 343 | programs and slowing them down more.</para> |
| 344 | |
| 345 | </sect1> |
| 346 | |
| 347 | |
| 348 | <sect1 id="cg-tech-docs.retranslations" |
| 349 | xreflabel="Handling basic block retranslations"> |
| 350 | <title>Handling basic block retranslations</title> |
| 351 | |
| 352 | <para>The above description ignores one complication. Valgrind |
| 353 | has a limited size cache for basic block translations; if it |
| 354 | fills up, old translations are discarded. If a discarded basic |
| 355 | block is executed again, it must be re-translated.</para> |
| 356 | |
| 357 | <para>However, we can't use this approach for profiling -- we |
| 358 | can't throw away cost centres for instructions in the middle of |
| 359 | execution! So when a basic block is translated, we first look |
| 360 | for its cost centre array in the hash table. If there is no cost |
| 361 | centre array, it must be the first translation, so we proceed as |
| 362 | described above. But if there is a cost centre array already, it |
| 363 | must be a retranslation. In this case, we skip the cost centre |
| 364 | allocation and initialisation steps, but still do the UCode |
| 365 | instrumentation step.</para> |
| 366 | |
| 367 | </sect1> |
| 368 | |
| 369 | |
| 370 | |
| 371 | <sect1 id="cg-tech-docs.cachesim" xreflabel="The cache simulation"> |
| 372 | <title>The cache simulation</title> |
| 373 | |
| 374 | <para>The cache simulation is fairly straightforward. It just |
| 375 | tracks which memory blocks are in the cache at the moment (it |
| 376 | doesn't track the contents, since that is irrelevant).</para> |
| 377 | |
| 378 | <para>The interface to the simulation is quite clean. The |
| 379 | functions called from the UCode contain calls to the simulation |
| 380 | functions in the files |
| 381 | <filename>vg_cachesim_{I1,D1,L2}.c</filename>; these calls are |
| 382 | inlined so that only one function call is done per simulated x86 |
| 383 | instruction. The file <filename>vg_cachesim.c</filename> simply |
| 384 | <computeroutput>#include</computeroutput>s the three files |
| 385 | containing the simulation, which makes plugging in new cache |
| 386 | simulations is very easy -- you just replace the three files and |
| 387 | recompile.</para> |
| 388 | |
| 389 | </sect1> |
| 390 | |
| 391 | |
| 392 | <sect1 id="cg-tech-docs.output" xreflabel="Output"> |
| 393 | <title>Output</title> |
| 394 | |
| 395 | <para>Output is fairly straightforward, basically printing the |
| 396 | cost centre for every instruction, grouped by files and |
| 397 | functions. Total counts (eg. total cache accesses, total L1 |
| 398 | misses) are calculated when traversing this structure rather than |
| 399 | during execution, to save time; the cache simulation functions |
| 400 | are called so often that even one or two extra adds can make a |
| 401 | sizeable difference.</para> |
| 402 | |
| 403 | <para>Input file has the following format:</para> |
| 404 | <programlisting><![CDATA[ |
| 405 | file ::= desc_line* cmd_line events_line data_line+ summary_line |
| 406 | desc_line ::= "desc:" ws? non_nl_string |
| 407 | cmd_line ::= "cmd:" ws? cmd |
| 408 | events_line ::= "events:" ws? (event ws)+ |
| 409 | data_line ::= file_line | fn_line | count_line |
| 410 | file_line ::= ("fl=" | "fi=" | "fe=") filename |
| 411 | fn_line ::= "fn=" fn_name |
| 412 | count_line ::= line_num ws? (count ws)+ |
| 413 | summary_line ::= "summary:" ws? (count ws)+ |
| 414 | count ::= num | "."]]></programlisting> |
| 415 | |
| 416 | <para>Where:</para> |
| 417 | <itemizedlist> |
| 418 | <listitem> |
| 419 | <para><computeroutput>non_nl_string</computeroutput> is any |
| 420 | string not containing a newline.</para> |
| 421 | </listitem> |
| 422 | <listitem> |
| 423 | <para><computeroutput>cmd</computeroutput> is a command line |
| 424 | invocation.</para> |
| 425 | </listitem> |
| 426 | <listitem> |
| 427 | <para><computeroutput>filename</computeroutput> and |
| 428 | <computeroutput>fn_name</computeroutput> can be anything.</para> |
| 429 | </listitem> |
| 430 | <listitem> |
| 431 | <para><computeroutput>num</computeroutput> and |
| 432 | <computeroutput>line_num</computeroutput> are decimal |
| 433 | numbers.</para> |
| 434 | </listitem> |
| 435 | <listitem> |
| 436 | <para><computeroutput>ws</computeroutput> is whitespace.</para> |
| 437 | </listitem> |
| 438 | <listitem> |
| 439 | <para><computeroutput>nl</computeroutput> is a newline.</para> |
| 440 | </listitem> |
| 441 | |
| 442 | </itemizedlist> |
| 443 | |
| 444 | <para>The contents of the "desc:" lines is printed out at the top |
| 445 | of the summary. This is a generic way of providing simulation |
| 446 | specific information, eg. for giving the cache configuration for |
| 447 | cache simulation.</para> |
| 448 | |
| 449 | <para>Counts can be "." to represent "N/A", eg. the number of |
| 450 | write misses for an instruction that doesn't write to |
| 451 | memory.</para> |
| 452 | |
| 453 | <para>The number of counts in each |
| 454 | <computeroutput>line</computeroutput> and the |
| 455 | <computeroutput>summary_line</computeroutput> should not exceed |
| 456 | the number of events in the |
| 457 | <computeroutput>event_line</computeroutput>. If the number in |
| 458 | each <computeroutput>line</computeroutput> is less, cg_annotate |
| 459 | treats those missing as though they were a "." entry.</para> |
| 460 | |
| 461 | <para>A <computeroutput>file_line</computeroutput> changes the |
| 462 | current file name. A <computeroutput>fn_line</computeroutput> |
| 463 | changes the current function name. A |
| 464 | <computeroutput>count_line</computeroutput> contains counts that |
| 465 | pertain to the current filename/fn_name. A "fn=" |
| 466 | <computeroutput>file_line</computeroutput> and a |
| 467 | <computeroutput>fn_line</computeroutput> must appear before any |
| 468 | <computeroutput>count_line</computeroutput>s to give the context |
| 469 | of the first <computeroutput>count_line</computeroutput>s.</para> |
| 470 | |
| 471 | <para>Each <computeroutput>file_line</computeroutput> should be |
| 472 | immediately followed by a |
| 473 | <computeroutput>fn_line</computeroutput>. "fi=" |
| 474 | <computeroutput>file_lines</computeroutput> are used to switch |
| 475 | filenames for inlined functions; "fe=" |
| 476 | <computeroutput>file_lines</computeroutput> are similar, but are |
| 477 | put at the end of a basic block in which the file name hasn't |
| 478 | been switched back to the original file name. (fi and fe lines |
| 479 | behave the same, they are only distinguished to help |
| 480 | debugging.)</para> |
| 481 | |
| 482 | </sect1> |
| 483 | |
| 484 | |
| 485 | |
| 486 | <sect1 id="cg-tech-docs.summary" |
| 487 | xreflabel="Summary of performance features"> |
| 488 | <title>Summary of performance features</title> |
| 489 | |
| 490 | <para>Quite a lot of work has gone into making the profiling as |
| 491 | fast as possible. This is a summary of the important |
| 492 | features:</para> |
| 493 | |
| 494 | <itemizedlist> |
| 495 | |
| 496 | <listitem> |
| 497 | <para>The basic block-level cost centre storage allows almost |
| 498 | free cost centre lookup.</para> |
| 499 | </listitem> |
| 500 | |
| 501 | <listitem> |
| 502 | <para>Only one function call is made per instruction |
| 503 | simulated; even this accounts for a sizeable percentage of |
| 504 | execution time, but it seems unavoidable if we want |
| 505 | flexibility in the cache simulator.</para> |
| 506 | </listitem> |
| 507 | |
| 508 | <listitem> |
| 509 | <para>Unchanging information about an instruction is stored |
| 510 | in its cost centre, avoiding unnecessary argument pushing, |
| 511 | and minimising UCode instrumentation bloat.</para> |
| 512 | </listitem> |
| 513 | |
| 514 | <listitem> |
| 515 | <para>Summary counts are calculated at the end, rather than |
| 516 | during execution.</para> |
| 517 | </listitem> |
| 518 | |
| 519 | <listitem> |
| 520 | <para>The <computeroutput>cachegrind.out</computeroutput> |
| 521 | output files can contain huge amounts of information; file |
| 522 | format was carefully chosen to minimise file sizes.</para> |
| 523 | </listitem> |
| 524 | |
| 525 | </itemizedlist> |
| 526 | |
| 527 | </sect1> |
| 528 | |
| 529 | |
| 530 | |
| 531 | <sect1 id="cg-tech-docs.annotate" xreflabel="Annotation"> |
| 532 | <title>Annotation</title> |
| 533 | |
| 534 | <para>Annotation is done by cg_annotate. It is a fairly |
| 535 | straightforward Perl script that slurps up all the cost centres, |
| 536 | and then runs through all the chosen source files, printing out |
| 537 | cost centres with them. It too has been carefully optimised.</para> |
| 538 | |
| 539 | </sect1> |
| 540 | |
| 541 | |
| 542 | |
| 543 | <sect1 id="cg-tech-docs.extensions" xreflabel="Similar work, extensions"> |
| 544 | <title>Similar work, extensions</title> |
| 545 | |
| 546 | <para>It would be relatively straightforward to do other |
| 547 | simulations and obtain line-by-line information about interesting |
| 548 | events. A good example would be branch prediction -- all |
| 549 | branches could be instrumented to interact with a branch |
| 550 | prediction simulator, using very similar techniques to those |
| 551 | described above.</para> |
| 552 | |
| 553 | <para>In particular, cg_annotate would not need to change -- the |
| 554 | file format is such that it is not specific to the cache |
| 555 | simulation, but could be used for any kind of line-by-line |
| 556 | information. The only part of cg_annotate that is specific to |
| 557 | the cache simulation is the name of the input file |
| 558 | (<computeroutput>cachegrind.out</computeroutput>), although it |
| 559 | would be very simple to add an option to control this.</para> |
| 560 | |
| 561 | </sect1> |
| 562 | |
| 563 | </chapter> |