New files:
  - vg_cachesim.c
  - vg_cachesim_{I1,D1,L2}.c
  - vg_annotate.in
  - vg_cachegen.in

Changes to existing files:

  - valgrind/valgrind.in, added option:

        --cachesim=no|yes       [no]

  - Makefile/Makefile.am:
        * added vg_cachesim.c to valgrind_so_SOURCES var
        * added vg_cachesim_I1.c, vg_cachesim_D1.c, vg_cachesim_L2.c to
          noinst_HEADERS var
        * added vg_annotate, vg_cachegen to 'bin_SCRIPTS' var, and added empty
          targets for them

  - vg_main.c:
        * added two offsets for cache sim functions (put in positions 17a,17b)
        * added option handling (detection of --cachesim=yes which turns off of
          --instrument);
        * added calls to cachesim initialisation/finalisation functions

  - vg_mylibc: added some system call wrappers (for chmod, open_write, etc) for
    file writing

  - vg_symtab2.c:
        * allow it to read symbols if either of --instrument or --cachesim is
          used
        * made vg_symtab2.c:vg_what_{line,fn}_is_this extern, renaming it as
          VG_(what_line_is_this) (and added to vg_include.h)
        * completely rewrote the read loop in vg_read_lib_symbols, fixing
          several bugs.  Much better now, although probably not perfect.  It's
          also relatively fragile -- I'm using the "die immediately if anything
          unexpected happens" approach.

  - vg_to_ucode.c:
        * in VG_(disBB), patching in x86 instruction size into extra4b field of
          JMP instructions at the end of basic blocks if --cachesim=yes.
          Shifted things around to do this;  also had to fiddle around with
          single-step stuff to get this to work, by not sticking extra JMPs on
          the end of the single-instruction block if there was already one
          there (to avoid breaking an assertion in vg_cachesim.c).  Did a
          similar thing to avoid an extra JMP on huge basic blocks that are
          split.

  - vg_translate.c:
        * if --cachesim=yes call the cachesim instrumentation phase
        * made some functions extern and renamed:
                allocCodeBlock() --> VG_(allocCodeBlock)()
                freeCodeBlock()  --> VG_(freeCodeBlock)()
                copyUInstr()     --> VG_(copyUInstr)()
          (added to vg_include.h too)

  - vg_include.c: declared
        * cachesim offsets
        * exports of vg_cachesim.c
        * added four new profiling events (increasing VGP_M_CCS to 24 -- I kept
          the spare ones)
        * added comment about UInstr.extra4b field being used for instr size in
          JMPs for cache simulation

  - docs/manual.html:
        * Added --cachesim option to section 2.5.
        * Added cache profiling stuff as section 7.


git-svn-id: svn://svn.valgrind.org/valgrind/trunk@168 a5019735-40e9-0310-863c-91ae7b9d1cf9
diff --git a/memcheck/docs/manual.html b/memcheck/docs/manual.html
index c441db1..a97c2f9 100644
--- a/memcheck/docs/manual.html
+++ b/memcheck/docs/manual.html
@@ -78,7 +78,9 @@
 
 <h4>6&nbsp; <a href="#example">An example</a></h4>
 
-<h4>7&nbsp; <a href="techdocs.html">The design and implementation of Valgrind</a></h4>
+<h4>7&nbsp; <a href="#cache">Cache profiling</a></h4>
+
+<h4>8&nbsp; <a href="techdocs.html">The design and implementation of Valgrind</a></h4>
 
 <hr width="100%">
 
@@ -515,6 +517,11 @@
       buggy, so you may need to issue this flag if you use 3.0.4.
       </li><br><p>
 
+  <li><code>--cachesim=no</code> [default]<br>
+      <code>--cachesim=yes</code>
+      <p>When enabled, turns off memory checking, and turns on cache profiling.
+      Cache profiling is described in detail in <a href="#cache">Section 7</a>.
+      </li><p>
 </ul>
 
 There are also some options for debugging Valgrind itself.  You
@@ -1763,5 +1770,632 @@
 <p>The GCC folks fixed this about a week before gcc-3.0 shipped.
 <hr width="100%">
 <p>
+
+
+
+<a name="cache"></a>
+<h2>7&nbsp; Cache profiling</h2>
+As well as memory debugging, Valgrind also allows you to do cache simulations
+and annotate your source line-by-line with the number of cache misses.  In
+particular, it records:
+<ul>
+  <li>L1 instruction cache reads and misses;
+  <li>L1 data cache reads and read misses, writes and write misses;
+  <li>L2 unified cache reads and read misses, writes and writes misses.
+</ul>
+On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
+and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
+very useful for improving the performance of your program.
+
+Please note that this is an experimental feature.  Any feedback, bug-fixes,
+suggestions, etc, welcome.
+
+
+<h3>7.1&nbsp; Overview</h3>
+First off, as for normal Valgrind use, you probably want to turn on debugging
+info (the <code>-g</code> flag).  But by contrast with normal Valgrind use, you
+probably <b>do</b> want to turn optimisation on, since you should profile your
+program as it will be normally run.
+
+The three steps are:
+<ol>
+  <li>Generate a cache simulator for your machine's cache configuration with
+      `vg_cachegen' and recompile Valgrind with <code>make install</code>.
+      Valgrind comes with a default simulator, but it is unlikely to be correct
+      for your system, so you should generate a simulator yourself.</li>
+  <li>Run your program with <code>valgrind --cachesim=yes</code> in front of 
+      the normal command line invocation.  When the program finishes, Valgrind
+      will print summary cache statistics. It also collects line-by-line
+      information in a file <code>cachegrind.out</code>.</li>
+  <li>Generate a function-by-function summary, and possibly annotate source
+      files with 'vg_annotate'. Source files to annotate can be specified
+      manually, or manually on the command line, or "interesting" source files
+      can be annotated automatically with the <code>--auto=yes</code> option.
+      You can annotate C/C++ files or assembly language files equally
+      easily.</li>
+</ol>
+
+<a href="#generate">Step 1</a> only needs to be done once, unless you are
+interested in simulating different cache configurations (eg. first
+concentrating on instruction cache misses, then on data cache misses).<p>
+
+<a href="#profile">Step 2</a> should be done every time you want to collect
+information about a new program, a changed program, or about the same program
+with different input.<p>
+
+<a href="#annotate">Step 3</a> can be performed as many times as you like for
+each Step 2; you may want to do multiple annotations showing different
+information each time.<p>
+
+The steps are described in detail in the following sections.<p>
+
+
+<a name="generate"></a>
+<h3>7.3&nbsp; Generating a cache simulator</h3>
+Although Valgrind comes with a pre-generated cache simulator, it most likely
+won't match the cache configuration of your machine, so you should generate
+a new simulator.<p>
+
+You need to generate three files, one for each of the I1, D1 and L2 caches.
+For each cache, you need to know the:
+<ul>
+  <li>Cache size (bytes);
+  <li>Line size (bytes);
+  <li>Associativity.
+</ul>
+
+vg_cachegen takes three options:
+<ul>
+  <li><code>--I1=size,line_size,associativity</code>
+  <li><code>--D1=size,line_size,associativity</code>
+  <li><code>--L2=size,line_size,associativity</code>
+</ul>
+
+You can specify one, two or all three caches per invocation of vg_cachegen.  It
+checks that the configuration is sensible before generating the simulators;  to
+see the allowed values, run <code>vg_cachegen -h</code>.<p>
+
+An example invocation would be:
+
+<blockquote><code>
+  vg_cachegen --I1=65536,64,2 --D1=65536,64,2 --L2=262144,64,8
+</code></blockquote>
+
+This simulates a machine with a 128KB split L1 2-way associative cache, and a
+256KB unified 8-way associative L2 cache.  Both caches have 64B lines.<p>
+
+If you don't know your cache configuration, you'll have to find it out.
+(Ideally vg_cachegen could auto-identify your cache configuration using the
+CPUID instruction, which could be done automatically during installation, and
+this whole step could be skipped...)<p>
+
+
+<h3>7.4&nbsp; Cache simulation specifics</h3>
+vg_cachegen only generates simulations for a machine with a split L1 cache and
+a unified L2 cache.  This configuration is used for all x86-based machines we
+are aware of.<p>
+
+The more specific characteristics of the simulation are as follows.
+
+<ul>
+  <li>Write-allocate: when a write miss occurs, the block written to is brought
+      into the D1 cache.  Most modern caches have this property.</li><p>
+
+  <li>Bit-selection hash function:  the line(s) in the cache to which a memory
+      block maps is chosen by the middle bits M--(M+N-1) of the byte address,
+      where:
+      <ul>
+        <li>&nbsp;line size = 2^M bytes&nbsp;</li>
+        <li>(cache size / line size) = 2^N bytes</li>
+      </ul> </li><p>
+
+  <li>Inclusive L2 cache:  the L2 cache replicates all the entries of the L1
+      cache.  This is standard on Pentium chips, but AMD Athlons use an
+      exclusive L2 cache that only holds blocks evicted from L1.</li><p>
+</ul>
+
+Other noteworthy behaviour:
+
+<ul>
+  <li>References that straddle two cache lines are treated as follows:</li>
+  <ul>
+    <li>If both blocks hit --&gt; counted as one hit</li>
+    <li>If one block hits, the other misses --&gt; counted as one miss</li>
+    <li>If both blocks miss --&gt; counted as one miss (not two)</li>
+  </ul><p>
+
+  <li>Instructions that modify a memory location (eg. <code>inc</code> and
+      <code>dec</code>) are counted as doing just a read, ie. a single data
+      reference.  This may seem strange, but since the write can never cause a
+      miss (the read guarantees the block is in the cache) it's not very
+      interesting.<p>
+
+      Thus it measures not the number of times the data cache is accessed, but
+      the number of times a data cache miss could occur.<p>
+      </li>
+</ul>
+
+If you are interested in simulating a cache with different properties, it is
+not particularly hard to write your own cache simulator, or to modify existing
+ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_I1.c</code> and
+<code>vg_cachesim_I1.c</code>.  We'd be interested to hear from anyone who
+does.
+
+
+<a name="profile"></a>
+<h3>7.5&nbsp; Profiling programs</h3>
+Cache profiling is enabled by using the <code>--cachesim=yes</code> option to
+Valgrind.  This automatically turns off Valgrind's memory checking functions,
+since the cache simulation is slow enough already, and you probably don't want
+to do both at once.<p>
+
+To gather cache profiling information about the program <code>ls -l<code, type:
+
+<blockquote><code>valgrind --cachesim=yes ls -l</code></blockquote>
+
+The program will execute (slowly).  Upon completion, summary statistics
+that look like this will be printed:
+
+<pre>
+==31751== I   refs:      27,742,716
+==31751== I1  misses:           276
+==31751== L2  misses:           275
+==31751== I1  miss rate:        0.0%
+==31751== L2i miss rate:        0.0%
+==31751== 
+==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
+==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
+==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
+==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
+==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
+==31751== 
+==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
+==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
+</pre>
+
+Cache accesses for instruction fetches are summarised first, giving the
+number of fetches made (this is the number of instructions executed, which
+can be useful to know in its own right), the number of I1 misses, and the
+number of L2 instruction (<code>L2i</code>) misses.<p>
+
+Cache accesses for data follow. The information is similar to that of the
+instruction fetches, except that the values are also shown split between reads
+and writes (note each row's <code>rd</code> and <code>wr</code> values add up
+to the row's total).<p>
+
+Combined instruction and data figures for the L2 cache follow that.<p>
+
+
+<h3>7.6&nbsp; Output file</h3>
+As well as printing summary information, Valgrind also writes line-by-line
+cache profiling information to a file named <code>cachegrind.out</code> .  This
+file is human-readable, but is best interpreted by the accompanying program
+vg_annotate, described in the next section.<p>
+
+Things to note about the <code>cachegrind.out</code> file:
+<ul>
+  <li>It is written every time <code>valgrind --cachesim=yes</code> is run; it
+      will automatically overwrite any existing <code>cachegrind.out<code/> in
+      the current directory.</li>
+  <li>It can be quite large: <code>ls -l</code> generates a file of about
+      350KB; browsing a few files and web pages with Konqueror generates a file
+      of around 10MB.</li>
+</ul>
+
+
+<a name="annotate"></a>
+<h3>7.7&nbsp; Annotating C/C++ programs</h3>
+Before using vg_annotate, it is worth widening your window to be at least
+120-characters wide if possible, as the output lines can be quite long.<p>
+
+To get a function-by-function summary, run <code>vg_annotate</code> in
+directory containing a <code>cachegrind.out</code> file.  The output looks like
+this:
+
+<pre>
+--------------------------------------------------------------------------------
+I1 cache:              65536 B, 64 B, 2-way associative
+D1 cache:              65536 B, 64 B, 2-way associative
+L2 cache:              262144 B, 64 B, 8-way associative
+Command:               concord vg_to_ucode.c
+Events recorded:       Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Events shown:          Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Event sort order:      Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Threshold:             99%
+Chosen for annotation:
+Auto-annotation:       on
+
+--------------------------------------------------------------------------------
+Ir         I1mr I2mr Dr         D1mr   D2mr  Dw        D1mw   D2mw
+--------------------------------------------------------------------------------
+27,742,716  276  275 10,955,517 21,905 3,987 4,474,773 19,280 19,098  PROGRAM TOTALS
+
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
+--------------------------------------------------------------------------------
+8,821,482    5    5 2,242,702 1,621    73 1,794,230      0      0  getc.c:_IO_getc
+5,222,023    4    4 2,276,334    16    12   875,959      1      1  concord.c:get_word
+2,649,248    2    2 1,344,810 7,326 1,385         .      .      .  vg_main.c:strcmp
+2,521,927    2    2   591,215     0     0   179,398      0      0  concord.c:hash
+2,242,740    2    2 1,046,612   568    22   448,548      0      0  ctype.c:tolower
+1,496,937    4    4   630,874 9,000 1,400   279,388      0      0  concord.c:insert
+  897,991   51   51   897,831    95    30        62      1      1  ???:???
+  598,068    1    1   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__flockfile
+  598,068    0    0   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__funlockfile
+  598,024    4    4   213,580    35    16   149,506      0      0  vg_clientmalloc.c:malloc
+  446,587    1    1   215,973 2,167   430   129,948 14,057 13,957  concord.c:add_existing
+  341,760    2    2   128,160     0     0   128,160      0      0  vg_clientmalloc.c:vg_trap_here_WRAPPER
+  320,782    4    4   150,711   276     0    56,027     53     53  concord.c:init_hash_table
+  298,998    1    1   106,785     0     0    64,071      1      1  concord.c:create
+  149,518    0    0   149,516     0     0         1      0      0  ???:tolower@@GLIBC_2.0
+  149,518    0    0   149,516     0     0         1      0      0  ???:fgetc@@GLIBC_2.0
+   95,983    4    4    38,031     0     0    34,409  3,152  3,150  concord.c:new_word_node
+   85,440    0    0    42,720     0     0    21,360      0      0  vg_clientmalloc.c:vg_bogus_epilogue
+</pre>
+
+First up is a summary of the annotation options:
+                    
+<ul>
+  <li>I1 cache, D1 cache, L2 cache: cache configuration.  So you know the
+      configuration with which these results were obtained.</li><p>
+
+  <li>Command: the command line invocation of the program under
+      examination.</li><p>
+
+  <li>Events recorded: event abbreviations are:<p>
+  <ul>
+    <li><code>Ir  </code>:  I cache reads (ie. instructions executed)</li>
+    <li><code>I1mr</code>: I1 cache read misses</li>
+    <li><code>I2mr</code>: L2 cache instruction read misses</li>
+    <li><code>Dr  </code>:  D cache reads (ie. memory reads)</li>
+    <li><code>D1mr</code>: D1 cache read misses</li>
+    <li><code>D2mr</code>: L2 cache data read misses</li>
+    <li><code>Dw  </code>:  D cache writes (ie. memory writes)</li>
+    <li><code>D1mw</code>: D1 cache write misses</li>
+    <li><code>D2mw</code>: L2 cache data write misses</li>
+  </ul><p>
+      Note that D1 total accesses is given by <code>D1mr</code> +
+      <code>D1mw</code>, and that L2 total accesses is given by
+      <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
+
+  <li>Events shown: the events shown (a subset of events gathered).  This can
+      be adjusted with the <code>--show</code> option.</li><p>
+
+  <li>Event sort order: the sort order in which functions are shown.  For
+      example, in this case the functions are sorted from highest
+      <code>Ir</code> counts to lowest.  If two functions have identical
+      <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
+      counts, and so on.  This order can be adjusted with the
+      <code>--sort</code> option.<p>
+
+      Note that this dictates the order the functions appear.  It is <b>not</b>
+      the order in which the columns appear;  that is dictated by the "events
+      shown" line (and can be changed with the <code>--sort</code> option).
+      </li><p>
+
+  <li>Threshold: vg_annotate by default omits functions that cause very low
+      numbers of misses to avoid drowing you in information.  In this case,
+      vg_annotate shows summaries the functions that account for 99%   of the
+      <code>Ir</code> counts; <code>Ir</code> is chosen as the treshold event
+      since it is  the primary sort event.  The threshold can be adjusted with
+      the <code>--threshold</code> option.</li><p>
+
+  <li>Chosen for annotation: names of files specified manually for annotation; 
+      in this case none.</li><p>
+
+  <li>Auto-annotation: whether auto-annotation was requested via the 
+      <code>--auto=yes</code> option. In this case no.</li><p>
+</ul>
+
+Then follows summary statistics for the whole program. These are similar
+to the summary provided when running <code>valgrind --cachesim=yes</code>.<p>
+  
+Then follows function-by-function statistics. Each function is identified by a
+<code>file_name:function_name</code> pair. If a column contains only a
+`.' it means  the function never performs that event (eg. the third row shows
+that <code>strcmp()</code> contains no instructions that write to memory). The
+name <code>???</code> is used if the the file name and/or function name could
+not be determined from debugging information. (If most of the entries have the
+form <code>???:???</code> the program probably wasn't compiled with
+<code>-g</code>.)<p> 
+
+It is worth noting that functions will come from three types of source files:
+<ol>
+  <li> From the profiled program (<code>concord.c</code> in this example).</li>
+  <li>From libraries (eg. <code>getc.c</code>)</li>
+  <li>From Valgrind's implementation of some libc functions (eg.
+      <code>vg_clientmalloc.c:malloc</code>).  These are recognisable because
+      the filename begins with <code>vg_</code>, and is probably one of
+      <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
+      <code>vg_mylibc.c</code>.
+  </li>
+</ol>
+
+There are two ways to annotate source files -- by choosing them manually, or
+with the <code>--auto=yes</code> option. To do it manually, just
+specify the filenames as arguments to vg_annotate. For example, the output from
+running <code>vg_annotate concord.c</code> for our example produces the same
+output as above followed by an annotated version of <code>concord.c</code>, a
+section of which looks like:
+
+<pre>
+--------------------------------------------------------------------------------
+-- User-annotated source: concord.c
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr      D1mr  D2mr  Dw      D1mw   D2mw
+
+[snip]
+
+        .    .    .       .     .     .       .      .      .  void init_hash_table(char *file_name, Word_Node *table[])
+        3    1    1       .     .     .       1      0      0  {
+        .    .    .       .     .     .       .      .      .      FILE *file_ptr;
+        .    .    .       .     .     .       .      .      .      Word_Info *data;
+        1    0    0       .     .     .       1      1      1      int line = 1, i;
+        .    .    .       .     .     .       .      .      .
+        5    0    0       .     .     .       3      0      0      data = (Word_Info *) create(sizeof(Word_Info));
+        .    .    .       .     .     .       .      .      .
+    4,991    0    0   1,995     0     0     998      0      0      for (i = 0; i < TABLE_SIZE; i++)
+    3,988    1    1   1,994     0     0     997     53     52          table[i] = NULL;
+        .    .    .       .     .     .       .      .      .
+        .    .    .       .     .     .       .      .      .      /* Open file, check it. */
+        6    0    0       1     0     0       4      0      0      file_ptr = fopen(file_name, "r");
+        2    0    0       1     0     0       .      .      .      if (!(file_ptr)) {
+        .    .    .       .     .     .       .      .      .          fprintf(stderr, "Couldn't open '%s'.\n", file_name);
+        1    1    1       .     .     .       .      .      .          exit(EXIT_FAILURE);
+        .    .    .       .     .     .       .      .      .      }
+        .    .    .       .     .     .       .      .      .
+  165,062    1    1  73,360     0     0  91,700      0      0      while ((line = get_word(data, line, file_ptr)) != EOF)
+  146,712    0    0  73,356     0     0  73,356      0      0          insert(data->;word, data->line, table);
+        .    .    .       .     .     .       .      .      .
+        4    0    0       1     0     0       2      0      0      free(data);
+        4    0    0       1     0     0       2      0      0      fclose(file_ptr);
+        3    0    0       2     0     0       .      .      .  }
+</pre>
+
+(Although column widths are automatically minimised, a wide terminal is clearly
+useful.)<p>
+  
+Each source file is clearly marked (<code>User-annotated source</code>) as
+having been chosen manually for annotation.  If the file was found in one of
+the directories specified with the <code>-I</code>/<code>--include</code>
+option, the directory and file are both given.<p>
+
+Each line is annotated with its event counts.  Events not applicable for a line
+are represented by a `.';  this is useful for distinguishing between an event
+which cannot happen, and one which can but did not.<p> 
+
+Sometimes only a small section of a source file is executed.  To minimise
+uninteresting output, Valgrind only shows annotated lines and lines within a
+small distance of annotated lines.  Gaps are marked with the line numbers so
+you know which part of a file the shown code comes from, eg:
+
+<pre>
+(figures and code for line 704)
+-- line 704 ----------------------------------------
+-- line 878 ----------------------------------------
+(figures and code for line 878)
+</pre>
+
+The amount of context to show around annotated lines is controlled by the
+<code>--context</code> option.<p>
+
+To get automatic annotation, run <code>vg_annotate --auto=yes</code>.
+vg_annotate will automatically annotate every source file it can find that is
+mentioned in the function-by-function summary.  Therefore, the files chosen for
+auto-annotation  are affected by the <code>--sort</code> and
+<code>--threshold</code> options.  Each source file is clearly marked
+(<code>Auto-annotated source</code>) as being chosen automatically.  Any files
+that could not be found are mentioned at the end of the output, eg:    
+
+<pre>
+--------------------------------------------------------------------------------
+The following files chosen for auto-annotation could not be found:
+--------------------------------------------------------------------------------
+  getc.c
+  ctype.c
+  ../sysdeps/generic/lockfile.c
+</pre>
+
+This is quite common for library files, since libraries are usually compiled
+with debugging information, but the source files are often not present on a
+system.  If a file is chosen for annotation <b>both</b> manually and
+automatically, it is marked as <code>User-annotated source</code>.
+
+Use the <code>-I/--include</code> option to tell Valgrind where to look for
+source files if the filenames found from the debugging information aren't
+specific enough.
+
+Beware that vg_annotate can take some time to digest large
+<code>cachegrind.out</code> files, eg. 30 seconds or more.  Also beware that
+auto-annotation can produce a lot of output if your program is large!
+
+
+<h3>7.8&nbsp; Annotating assembler programs</h3>
+Valgrind can annotate assembler programs too, or annotate the assembler
+generated for your C program.  Sometimes this is useful for understanding what
+is really happening when an interesting line of C code is translated into
+multiple instructions.<p>
+
+To do this, you just need to assemble your <code>.s</code> files with
+assembler-level debug information.  gcc doesn't do this, but you can use GNU as
+with the <code>--gstabs</code> option to generate object files with this
+information, eg:
+
+<blockquote><code>as --gstabs foo.s</code></blockquote>
+
+You can then profile and annotate source files in the same way as for C/C++
+programs.
+
+
+<h3>7.9&nbsp; vg_annotate options</h3>
+<ul>
+  <li><code>-h, --help</code></li><p>
+  <li><code>-v, --version</code><p>
+
+      Help and version, as usual.</li>
+
+  <li><code>--sort=A,B,C</code> [default: order in 
+      <code>cachegrind.out</code>]<p>
+      Specifies the events upon which the sorting of the function-by-function
+      entries will be based.  Useful if you want to concentrate on eg. I cache
+      misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
+      (<code>--sort=D1mr,D2mr</code>), or L2 misses
+      (<code>--sort=D2mr,I2mr</code>).</li><p>
+
+  <li><code>--show=A,B,C</code> [default: all, using order in
+      <code>cachegrind.out</code>]<p>
+      Specifies which events to show (and the column order). Default is to use
+      all present in the <code>cachegrind.out</code> file (and use the order in
+      the file).</li><p>
+
+  <li><code>--threshold=X</code> [default: 99%] <p>
+      Sets the threshold for the function-by-function summary.  Functions are
+      shown that account for more than X% of all the primary sort events.  If
+      auto-annotating, also affects which files are annotated.</li><p>
+
+  <li><code>--auto=no</code> [default]<br>
+      <code>--auto=yes</code> <p>
+      When enabled, automatically annotates every file that is mentioned in the
+      function-by-function summary that can be found.  Also gives a list of
+      those that couldn't be found.
+
+  <li><code>--context=N</code> [default: 8]<p>
+      Print N lines of context before and after each annotated line.  Avoids
+      printing large sections of source files that were not executed.  Use a 
+      large number (eg. 10,000) to show all source lines.
+      </li><p>
+
+  <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code> 
+      [default: empty string]<p>
+      Adds a directory to the list in which to search for files.  Multiple
+      -I/--include options can be given to add multiple directories.
+</ul>
+  
+
+<h3>7.10&nbsp; Warnings</h3>
+There are a couple of situations in which vg_annotate issues warnings.
+
+<ul>
+  <li>If a source file is more recent than the <code>cachegrind.out</code>
+      file.  This is because the information in <code>cachegrind.out</code> is
+      only recorded with line numbers, so if the line numbers change at all in
+      the source (eg. lines added, deleted, swapped), any annotations will be 
+      incorrect.<p>
+
+  <li>If information is recorded about line numbers past the end of a file.
+      This can be caused by the above problem, ie. shortening the source file
+      while using an old <code>cachegrind.out</code> file.  If this happens,
+      the figures for the bogus lines are printed anyway (clearly marked as
+      bogus) in case they are important.</li><p>
+</ul>
+
+
+<h3>7.10&nbsp; Things to watch out for</h3>
+Some odd things that can occur during annotation:
+
+<ul>
+  <li>If annotating at the assembler level, you might see something like this:
+
+      <pre>
+      1    0    0  .    .    .  .    .    .          leal -12(%ebp),%eax
+      1    0    0  .    .    .  1    0    0          movl %eax,84(%ebx)
+      2    0    0  0    0    0  1    0    0          movl $1,-20(%ebp)
+      .    .    .  .    .    .  .    .    .          .align 4,0x90
+      1    0    0  .    .    .  .    .    .          movl $.LnrB,%eax
+      1    0    0  .    .    .  1    0    0          movl %eax,-16(%ebp)
+      </pre>
+
+      How can the third instruction be executed twice when the others are
+      executed only once?  As it turns out, it isn't.  Here's a dump of the
+      executable, from objdump:
+
+      <pre>
+      8048f25:       8d 45 f4                lea    0xfffffff4(%ebp),%eax
+      8048f28:       89 43 54                mov    %eax,0x54(%ebx)
+      8048f2b:       c7 45 ec 01 00 00 00    movl   $0x1,0xffffffec(%ebp)
+      8048f32:       89 f6                   mov    %esi,%esi
+      8048f34:       b8 08 8b 07 08          mov    $0x8078b08,%eax
+      8048f39:       89 45 f0                mov    %eax,0xfffffff0(%ebp)
+      </pre>
+
+      Notice the extra <code>mov %esi,%esi</code> instruction.  Where did this
+      come from?  The GNU assembler inserted it to serve as the two bytes of
+      padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
+      a four-byte boundary, but pretended it didn't exist when adding debug
+      information.  Thus when Valgrind reads the debug info it thinks that the
+      <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
+      range 0x8048f2b--0x804833 by itself, and attributes the counts for the
+      <code>mov %esi,%esi</code> to it.<p>
+  </li>
+
+  <li>
+      Inlined functions can cause strange results in the function-by-function
+      summary.  If a function <code>inline_me()</code> is defined in
+      <code>foo.h</code> and inlined in the functions <code>f1()</code>,
+      <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
+      not be a <code>foo.h:inline_me()</code> function entry.  Instead, there
+      will be separate function entries for each inlining site, ie.
+      <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
+      <code>foo.h:f3()</code>.  To find the total counts for
+      <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
+
+      The reason for this is that although the debug info output by gcc
+      indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
+      doesn't indicate the name of the function in <code>foo.h</code>, so
+      Valgrind keeps using the old one.<p>
+
+  <li>
+      Sometimes, the same filename might be represented with a relative name
+      and with an absolute name in different parts of the debug info, eg:
+      <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>.  In this
+      case, if you use auto-annotation, the file will be annotated twice with
+      the counts split between the two.<p>
+  </li>
+</ul>
+
+Note: stabs is not an easy format to read.  If you come across bizarre
+annotations that look like might be caused by a bug in the stabs reader,
+please let us know.
+
+
+<h3>7.11&nbsp; Accuracy</h3>
+Valgrind's cache profiling has a number of shortcomings:
+
+<ul>
+  <li>It doesn't account for kernel activity -- the effect of system calls on
+      the cache contents is ignored.</li><p>
+
+  <li>It doesn't account for other process activity (although this is probably
+      desirable when considering a single program).</li><p>
+
+  <li>It doesn't account for virtual-to-physical address mappings;  hence the
+      entire simulation is not a true representation of what's happening in the
+      cache.</li><p>
+
+  <li>It doesn't account for cache misses not visible at the instruction level,
+      eg. those arising from TLB misses, or speculative execution.</li><p>
+</ul>
+
+Another thing worth nothing is that results are very sensitive.  Changing the
+size of the <code>valgrind.so</code> file, the size of the program being
+profiled, or even the length of its name can perturb the results.  Variations
+will be small, but don't expect perfectly repeatable results if your program
+changes at all.<p>
+
+While these factors mean you shouldn't trust the results to be super-accurate,
+hopefully they should be close enough to be useful.<p>
+
+
+<h3>7.12&nbsp; Todo</h3>
+<ul>
+  <li>Use CPUID instruction to auto-identify cache configuration during 
+      installation.  This would save the user from having to know their cache
+      configuration and using vg_cachegen.</li><p>
+  <li>Program start-up/shut-down calls a lot of functions that aren't
+      interesting and just complicate the output.  Would be nice to exclude
+      these somehow.</li><p>
+</ul> 
+<hr width="100%">
 </body>
 </html>
+