| <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
| "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> |
| |
| <chapter id="bbv-manual" xreflabel="BBV"> |
| <title>BBV: an experimental basic block vector generation tool</title> |
| |
| <para>To use this tool, you must specify |
| <option>--tool=exp-bbv</option> on the Valgrind |
| command line.</para> |
| |
| <sect1 id="bbv-manual.overview" xreflabel="Overview"> |
| <title>Overview</title> |
| |
| <para> |
| A basic block is a linear section of code with one entry point and one exit |
| point. A <emphasis>basic block vector</emphasis> (BBV) is a list of all |
| basic blocks entered during program execution, and a count of how many |
| times each basic block was run. |
| </para> |
| |
| <para> |
| BBV is a tool that generates basic block vectors for use with the |
| <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint</ulink> |
| analysis tool. |
| The SimPoint methodology enables speeding up architectural |
| simulations by only running a small portion of a program |
| and then extrapolating total behavior from this |
| small portion. Most programs exhibit phase-based behavior, which |
| means that at various times during execution a program will encounter |
| intervals of time where the code behaves similarly to a previous |
| interval. If you can detect these intervals and group them together, |
| an approximation of the total program behavior can be obtained |
| by only simulating a bare minimum number of intervals, and then scaling |
| the results. |
| </para> |
| |
| <para> |
| In computer architecture research, running a |
| benchmark on a cycle-accurate simulator can cause slowdowns on the order |
| of 1000 times, making it take days, weeks, or even longer to run full |
| benchmarks. By utilizing SimPoint this can be reduced significantly, |
| usually by 90-95%, while still retaining reasonable accuracy. |
| </para> |
| |
| <para> |
| A more complete introduction to how SimPoint works can be |
| found in the paper "Automatically Characterizing Large Scale |
| Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and |
| B. Calder. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.quickstart" xreflabel="Quick Start"> |
| <title>Using Basic Block Vectors to create SimPoints</title> |
| |
| <para> |
| To quickly create a basic block vector file, you will call Valgrind |
| like this: |
| |
| <programlisting>valgrind --tool=exp-bbv /bin/ls</programlisting> |
| |
| In this case we are running on <filename>/bin/ls</filename>, |
| but this can be any program. By default a file called |
| <computeroutput>bb.out.PID</computeroutput> will be created, |
| where PID is replaced by the process ID of the running process. |
| This file contains the basic block vector. For long-running programs |
| this file can be quite large, so it might be wise to compress |
| it with gzip or some other compression program. |
| </para> |
| |
| <para> |
| To create actual SimPoint results, you will need the SimPoint utility, |
| available from the |
| <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint webpage</ulink>. |
| Assuming you have downloaded SimPoint 3.2 and compiled it, |
| create SimPoint results with a command like the following: |
| |
| <programlisting><![CDATA[ |
| ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \ |
| -loadFVFile bb.out.1234.gz \ |
| -k 5 -saveSimpoints results.simpts \ |
| -saveSimpointWeights results.weights]]></programlisting> |
| |
| where bb.out.1234.gz is your compressed basic block vector file |
| generated by BBV. |
| </para> |
| |
| <para> |
| The SimPoint utility does random linear projection using 15-dimensions, |
| then does k-mean clustering to calculate which intervals are |
| of interest. In this example we specify 5 intervals with the |
| -k 5 option. |
| </para> |
| |
| <para> |
| The outputs from the SimPoint run are the |
| <computeroutput>results.simpts</computeroutput> |
| and <computeroutput>results.weights</computeroutput> files. |
| The first holds the 5 most relevant intervals of the program. |
| The seconds holds the weight to scale each interval by when |
| extrapolating full-program behavior. The intervals and the weights |
| can be used in conjunction with a simulator that supports |
| fast-forwarding; you fast-forward to the interval of interest, |
| collect stats for the desired interval length, then use |
| statistics gathered in conjunction with the weights to |
| calculate your results. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.usage" xreflabel="BBV Command-line Options"> |
| <title>BBV Command-line Options</title> |
| |
| <para> BBV-specific command-line options are:</para> |
| |
| <!-- start of xi:include in the manpage --> |
| <variablelist id="bbv.opts.list"> |
| |
| <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file"> |
| <term> |
| <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option> |
| </term> |
| <listitem> |
| <para> |
| This option selects the name of the basic block vector file. The |
| <option>%p</option> and <option>%q</option> format specifiers can be |
| used to embed the process ID and/or the contents of an environment |
| variable in the name, as is the case for the core option |
| <option><xref linkend="opt.log-file"/></option>. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file"> |
| <term> |
| <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option> |
| </term> |
| <listitem> |
| <para> |
| This option selects the name of the PC file. |
| This file holds program counter addresses |
| and function name info for the various basic blocks. |
| This can be used in conjunction |
| with the basic block vector file to fast-forward via function names |
| instead of just instruction counts. The |
| <option>%p</option> and <option>%q</option> format specifiers can be |
| used to embed the process ID and/or the contents of an environment |
| variable in the name, as is the case for the core option |
| <option><xref linkend="opt.log-file"/></option>. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.interval-size" xreflabel="--interval-size"> |
| <term> |
| <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option> |
| </term> |
| <listitem> |
| <para> |
| This option selects the size of the interval to use. |
| The default is 100 |
| million instructions, which is a commonly used value. |
| Other sizes can be used; smaller intervals can help programs |
| with finer-grained phases. However smaller interval size |
| can lead to accuracy issues due to warm-up effects |
| (When fast-forwarding the various architectural features |
| will be un-initialized, and it will take some number |
| of instructions before they "warm up" to the state a |
| full simulation would be at without the fast-forwarding. |
| Large interval sizes tend to mitigate this.) |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only"> |
| <term> |
| <option><![CDATA[--instr-count-only [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para> |
| This option tells the tool to only display instruction count |
| totals, and to not generate the actual basic block vector file. |
| This is useful for debugging, and for gathering instruction count |
| info without generating the large basic block vector files. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| |
| </variablelist> |
| <!-- end of xi:include in the manpage --> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format"> |
| <title>Basic Block Vector File Format</title> |
| |
| <para> |
| The Basic Block Vector is dumped at fixed intervals. This |
| is commonly done every 100 million instructions; the |
| <option>--interval-size</option> option can be |
| used to change this. |
| </para> |
| |
| <para> |
| The output file looks like this: |
| </para> |
| |
| <programlisting><![CDATA[ |
| T:45:1024 :189:99343 |
| T:11:78573 :15:1353 :56:1 |
| T:18:45 :12:135353 :56:78 314:4324263]]></programlisting> |
| |
| <para> |
| Each new interval starts with a T. This is followed on the same line |
| by a series of basic block and frequency pairs, one for each |
| basic block that was entered during the interval. The format for |
| each block/frequency pair is a colon, followed by a number that |
| uniquely identifies the basic block, another colon, and then |
| the frequency (which is the number of times the block was entered, |
| multiplied by the number of instructions in the block). The |
| pairs are separated from each other by a space. |
| </para> |
| |
| <para> |
| The frequency count is multiplied by the number of instructions that are |
| in the basic block, in order to weigh the count so that instructions in |
| small basic blocks aren't counted as more important than instructions |
| in large basic blocks. |
| </para> |
| |
| <para> |
| The SimPoint program only processes lines that start with a "T". All |
| other lines are ignored. Traditionally comments are indicated by |
| starting a line with a "#" character. Some other BBV generation tools, |
| such as PinPoints, generate lines beginning with letters other than "T" |
| to indicate more information about the program being run. We do |
| not generate these, as the SimPoint utility ignores them. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.implementation" xreflabel="Implementation"> |
| <title>Implementation</title> |
| |
| <para> |
| Valgrind provides all of the information necessary to create |
| BBV files. In the current implementation, all instructions |
| are instrumented. This is slower (by approximately a factor |
| of two) than a method that instruments at the basic block level, |
| but there are some complications (especially with rep prefix |
| detection) that make that method more difficult. |
| </para> |
| |
| <para> |
| Valgrind actually provides instrumentation at a superblock level. |
| A superblock has one entry point but unlike basic blocks can |
| have multiple exit points. Once a branch occurs into the middle |
| of a block, it is split into a new basic block. Because |
| Valgrind cannot produce "true" basic blocks, the generated |
| BBV vectors will be different than those generated by other tools. |
| In practice this does not seem to affect the accuracy of the |
| SimPoint results. We do internally force the |
| <option>--vex-guest-chase-thresh=0</option> |
| option to Valgrind which forces a more basic-block-like |
| behavior. |
| </para> |
| |
| <para> |
| When a superblock is run for the first time, it is instrumented |
| with our BBV routine. A block info (bbInfo) structure is allocated |
| which holds the various information and statistics for the block. |
| A unique block ID is assigned to the block, and then the |
| structure is placed into an ordered set. |
| Then each native instruction in the block is instrumented to |
| call an instruction counting routine with a pointer to the block |
| info structure as an argument. |
| </para> |
| |
| <para> |
| At run-time, our instruction counting routines are called once |
| per native instruction. The relevant block info structure is accessed |
| and the block count and total instruction count is updated. |
| If the total instruction count overflows the interval size |
| then we walk the ordered set, writing out the statistics for |
| any block that was accessed in the interval, then resetting the |
| block counters to zero. |
| </para> |
| |
| <para> |
| On the x86 and amd64 architectures the counting code has extra |
| code to handle rep-prefixed string instructions. This is because |
| actual hardware counts a rep-prefixed instruction |
| as one instruction, while a naive Valgrind implementation |
| would count it as many (possibly hundreds, thousands or even millions) |
| of instructions. We handle rep-prefixed instructions specially, |
| in order to make the results match those obtained with hardware performance |
| counters. |
| </para> |
| |
| <para> |
| BBV also counts the fldcw instruction. This instruction is used on |
| x86 machines in various ways; it is most commonly found when converting |
| floating point values into integers. |
| On Pentium 4 systems the retired instruction performance |
| counter counts this instruction as two instructions (all other |
| known processors only count it as one). |
| This can affect results when using SimPoint on Pentium 4 systems. |
| We provide the fldcw count so that users can evaluate whether it |
| will impact their results enough to avoid using Pentium 4 machines |
| for their experiments. It would be possible to add an option to |
| this tool that mimics the double-counting so that the generated BBV |
| files would be usable for experiments using hardware performance |
| counters on Pentium 4 systems. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support"> |
| <title>Threaded Executable Support</title> |
| |
| <para> |
| BBV supports threaded programs. When a program has multiple threads, |
| an additional basic block vector file is created for each thread (each |
| additional file is the specified filename with the thread number |
| appended at the end). |
| </para> |
| |
| <para> |
| There is no official method of using SimPoint with |
| threaded workloads. The most common method is to run |
| SimPoint on each thread's results independently, and use |
| some method of deterministic execution to try to match the |
| original workload. This should be possible with the current |
| BBV. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.validation" xreflabel="BBV Validation"> |
| <title>Validation</title> |
| |
| <para> |
| BBV has been tested on x86, amd64, and ppc32 platforms. |
| An earlier version of BBV was tested in detail using |
| hardware performance counters, this work is described in a paper |
| from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation |
| to Generate Multi-Platform SimPoints: Methodology and Accuracy" by |
| V.M. Weaver and S.A. McKee. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="bbv-manual.performance" xreflabel="BBV Performance"> |
| <title>Performance</title> |
| |
| <para> |
| Using this program slows down execution by roughly a factor of 40 |
| over native execution. This varies depending on the machine |
| used and the benchmark being run. |
| On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D |
| processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2). |
| </para> |
| |
| </sect1> |
| |
| </chapter> |