Added section to tech docs on how cachegrind works, including the
cachegrind.out file format.
Tiny change in user manual.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@198 a5019735-40e9-0310-863c-91ae7b9d1cf9
diff --git a/memcheck/docs/manual.html b/memcheck/docs/manual.html
index 5644872..4b6b773 100644
--- a/memcheck/docs/manual.html
+++ b/memcheck/docs/manual.html
@@ -1929,7 +1929,11 @@
</ul>
On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
-very useful for improving the performance of your program.
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for optimisation and test coverage.<p>
Please note that this is an experimental feature. Any feedback, bug-fixes,
suggestions, etc, welcome.
diff --git a/memcheck/docs/techdocs.html b/memcheck/docs/techdocs.html
index aea95c9..5bfda47 100644
--- a/memcheck/docs/techdocs.html
+++ b/memcheck/docs/techdocs.html
@@ -2108,5 +2108,415 @@
<p>
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode. For example, for cache profiling we are interested in
+instructions that read and write memory; in UCode there are only four
+instructions that do this: <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>. By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually. Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+ ULong a;
+ ULong m1;
+ ULong m2;
+} CC;
+
+typedef struct _iCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+} iCC;
+
+typedef struct _idCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+ UChar data_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+ CC D;
+} idCC;
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses. A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+ <li>An instruction-type identification tag (explained below)</li><p>
+ <li>Instruction size</li><p>
+ <li>Data reference size (<code>idCC</code> only)</li><p>
+ <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>. This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing). We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>. So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there. So the write access isn't really interesting, and
+Valgrind doesn't record it. This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level. By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+
+<pre>
+MOVL $0x0, t20
+PUTL t20, %EAX
+INCEIPo $5
+
+LEA1L -4(t4), t14
+MOVL $0x99, t18
+STL t18, (t14)
+INCEIPo $7
+</pre>
+
+The first step is to allocate the cost centres. This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes). UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required. From this we can determine how many cost centres we
+need for the basic block, and their sizes. We can then allocate them in a
+single array.<p>
+
+Consider the example code above. After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>. So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)| tag (1 byte)
+|(uninit)| instr_size (1 bytes)
+|(uninit)| (padding) (2 bytes)
+|(uninit)| instr_addr (4 bytes)
+|(uninit)| I.a (8 bytes)
+|(uninit)| I.m1 (8 bytes)
+|(uninit)| I.m2 (8 bytes)
+
+|(uninit)| tag (1 byte)
+|(uninit)| instr_size (1 bytes)
+|(uninit)| (padding) (2 bytes)
+|(uninit)| instr_addr (4 bytes)
+|(uninit)| I.a (8 bytes)
+|(uninit)| I.m1 (8 bytes)
+|(uninit)| I.m2 (8 bytes)
+|(uninit)| D.a (8 bytes)
+|(uninit)| D.m1 (8 bytes)
+|(uninit)| D.m2 (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array. We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name. This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+ <li>Fill in the gaps in the allocated cost centres.</li><p>
+ <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem. As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC| tag (1 byte)
+|5 | instr_size (1 bytes)
+|(uninit)| (padding) (2 bytes)
+|i_addr1 | instr_addr (4 bytes)
+|0 | I.a (8 bytes)
+|0 | I.m1 (8 bytes)
+|0 | I.m2 (8 bytes)
+
+|READ_CC | tag (1 byte)
+|7 | instr_size (1 bytes)
+|(uninit)| (padding) (2 bytes)
+|i_addr2 | instr_addr (4 bytes)
+|0 | I.a (8 bytes)
+|0 | I.m1 (8 bytes)
+|0 | I.m2 (8 bytes)
+|0 | D.a (8 bytes)
+|0 | D.m1 (8 bytes)
+|0 | D.m2 (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated; see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL $0x0, t20
+PUTL t20, %EAX
+ PUSHL %eax
+ PUSHL %ecx
+ PUSHL %edx
+ MOVL $0x4091F8A4, t46 # address of 1st CC
+ PUSHL t46
+ CALLMo $0x12 # second cachesim function
+ CLEARo $0x4
+ POPL %edx
+ POPL %ecx
+ POPL %eax
+INCEIPo $5
+
+LEA1L -4(t4), t14
+MOVL $0x99, t18
+ MOVL t14, t42
+STL t18, (t14)
+ PUSHL %eax
+ PUSHL %ecx
+ PUSHL %edx
+ PUSHL t42
+ MOVL $0x4091F8C4, t44 # address of 2nd CC
+ PUSHL t44
+ CALLMo $0x13 # second cachesim function
+ CLEARo $0x8
+ POPL %edx
+ POPL %ecx
+ POPL %eax
+INCEIPo $7
+</pre>
+
+Consider the first instruction's UCode. Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers. Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function. The address is known at this point because we are doing a
+simultaneous pass through the cost centre array. This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call). Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions; it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar. The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too. This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions. (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function. But this would slow the calls down
+(two or three extra arguments pushed onto the stack). Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre; bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication. Valgrind has a limited size
+cache for basic block translations; if it fills up, old translations are
+discarded. If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution! So when a basic block is
+translated, we first look for its cost centre array in the hash table. If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above. But if there is a cost centre array already, it must be a
+retranslation. In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward. It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean. The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>; these calls are inlined so that only
+one function call is done per simulated x86 instruction. The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions. Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time; the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line ::= "desc:" ws? non_nl_string
+cmd_line ::= "cmd:" ws? cmd
+events_line ::= "events:" ws? (event ws)+
+data_line ::= file_line | fn_line | count_line
+file_line ::= ("fl=" | "fi=" | "fe=") filename
+fn_line ::= "fn=" fn_name
+count_line ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+ <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+ <li><code>cmd</code> is a command line invocation.</li><p>
+ <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+ <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+ <li><code>ws</code> is whitespace.</li><p>
+ <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>. If the number in each <code>line</code> is less,
+vg_annotate treats those missing as though they were a "." entry. <p>
+
+A <code>file_line</code> changes the current file name. A <code>fn_line</code>
+changes the current function name. A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name. A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>. "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name. (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+ <li>The basic block-level cost centre storage allows almost free cost centre
+ lookup.</li><p>
+
+ <li>Only one function call is made per instruction simulated; even this
+ accounts for a sizeable percentage of execution time, but it seems
+ unavoidable if we want flexibility in the cache simulator.</li><p>
+
+ <li>Unchanging information about an instruction is stored in its cost centre,
+ avoiding unnecessary argument pushing, and minimising UCode
+ instrumentation bloat.</li><p>
+
+ <li>Summary counts are calculated at the end, rather than during
+ execution.</li><p>
+
+ <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+ information; file format was carefully chosen to minimise file
+ sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by vg_annotate. It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them. It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events. A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information. The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
</body>
</html>