njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
| 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> |
| 4 | |
| 5 | <chapter id="bbv-manual" xreflabel="BBV"> |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 6 | <title>BBV: an experimental basic block vector generation tool</title> |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 7 | |
| 8 | <para>To use this tool, you must specify |
njn | 7e5d4ed | 2009-07-30 02:57:52 +0000 | [diff] [blame] | 9 | <option>--tool=exp-bbv</option> on the Valgrind |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 10 | command line.</para> |
| 11 | |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 12 | <sect1 id="bbv-manual.overview" xreflabel="Overview"> |
| 13 | <title>Overview</title> |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 14 | |
| 15 | <para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 16 | A basic block is a linear section of code with one entry point and one exit |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 17 | point. A <emphasis>basic block vector</emphasis> (BBV) is a list of all |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 18 | basic blocks entered during program execution, and a count of how many |
| 19 | times each basic block was run. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 20 | </para> |
| 21 | |
| 22 | <para> |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 23 | BBV is a tool that generates basic block vectors for use with the |
njn | 78b708d | 2009-08-05 07:20:15 +0000 | [diff] [blame] | 24 | <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint</ulink> |
| 25 | analysis tool. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 26 | The SimPoint methodology enables speeding up architectural |
| 27 | simulations by only running a small portion of a program |
| 28 | and then extrapolating total behavior from this |
| 29 | small portion. Most programs exhibit phase-based behavior, which |
| 30 | means that at various times during execution a program will encounter |
| 31 | intervals of time where the code behaves similarly to a previous |
| 32 | interval. If you can detect these intervals and group them together, |
| 33 | an approximation of the total program behavior can be obtained |
| 34 | by only simulating a bare minimum number of intervals, and then scaling |
| 35 | the results. |
| 36 | </para> |
| 37 | |
| 38 | <para> |
| 39 | In computer architecture research, running a |
| 40 | benchmark on a cycle-accurate simulator can cause slowdowns on the order |
| 41 | of 1000 times, making it take days, weeks, or even longer to run full |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 42 | benchmarks. By utilizing SimPoint this can be reduced significantly, |
| 43 | usually by 90-95%, while still retaining reasonable accuracy. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 44 | </para> |
| 45 | |
| 46 | <para> |
| 47 | A more complete introduction to how SimPoint works can be |
| 48 | found in the paper "Automatically Characterizing Large Scale |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 49 | Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 50 | B. Calder. |
| 51 | </para> |
| 52 | |
| 53 | </sect1> |
| 54 | |
| 55 | <sect1 id="bbv-manual.quickstart" xreflabel="Quick Start"> |
| 56 | <title>Using Basic Block Vectors to create SimPoints</title> |
| 57 | |
| 58 | <para> |
| 59 | To quickly create a basic block vector file, you will call Valgrind |
| 60 | like this: |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 61 | |
| 62 | <programlisting>valgrind --tool=exp-bbv /bin/ls</programlisting> |
| 63 | |
| 64 | In this case we are running on <filename>/bin/ls</filename>, |
| 65 | but this can be any program. By default a file called |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 66 | <computeroutput>bb.out.PID</computeroutput> will be created, |
| 67 | where PID is replaced by the process ID of the running process. |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 68 | This file contains the basic block vector. For long-running programs |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 69 | this file can be quite large, so it might be wise to compress |
| 70 | it with gzip or some other compression program. |
| 71 | </para> |
| 72 | |
| 73 | <para> |
njn | 78b708d | 2009-08-05 07:20:15 +0000 | [diff] [blame] | 74 | To create actual SimPoint results, you will need the SimPoint utility, |
| 75 | available from the |
| 76 | <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint webpage</ulink>. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 77 | Assuming you have downloaded SimPoint 3.2 and compiled it, |
| 78 | create SimPoint results with a command like the following: |
| 79 | |
njn | d09133d | 2009-07-30 02:27:17 +0000 | [diff] [blame] | 80 | <programlisting><![CDATA[ |
| 81 | ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \ |
| 82 | -loadFVFile bb.out.1234.gz \ |
| 83 | -k 5 -saveSimpoints results.simpts \ |
| 84 | -saveSimpointWeights results.weights]]></programlisting> |
| 85 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 86 | where bb.out.1234.gz is your compressed basic block vector file |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 87 | generated by BBV. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 88 | </para> |
| 89 | |
| 90 | <para> |
| 91 | The SimPoint utility does random linear projection using 15-dimensions, |
| 92 | then does k-mean clustering to calculate which intervals are |
| 93 | of interest. In this example we specify 5 intervals with the |
| 94 | -k 5 option. |
| 95 | </para> |
| 96 | |
| 97 | <para> |
| 98 | The outputs from the SimPoint run are the |
| 99 | <computeroutput>results.simpts</computeroutput> |
| 100 | and <computeroutput>results.weights</computeroutput> files. |
| 101 | The first holds the 5 most relevant intervals of the program. |
| 102 | The seconds holds the weight to scale each interval by when |
| 103 | extrapolating full-program behavior. The intervals and the weights |
| 104 | can be used in conjunction with a simulator that supports |
| 105 | fast-forwarding; you fast-forward to the interval of interest, |
| 106 | collect stats for the desired interval length, then use |
| 107 | statistics gathered in conjunction with the weights to |
| 108 | calculate your results. |
| 109 | </para> |
| 110 | |
| 111 | </sect1> |
| 112 | |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 113 | <sect1 id="bbv-manual.usage" xreflabel="BBV Command-line Options"> |
| 114 | <title>BBV Command-line Options</title> |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 115 | |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 116 | <para> BBV-specific command-line options are:</para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 117 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 118 | <!-- start of xi:include in the manpage --> |
| 119 | <variablelist id="bbv.opts.list"> |
| 120 | |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 121 | <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file"> |
| 122 | <term> |
| 123 | <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option> |
| 124 | </term> |
| 125 | <listitem> |
| 126 | <para> |
| 127 | This option selects the name of the basic block vector file. The |
| 128 | <option>%p</option> and <option>%q</option> format specifiers can be |
| 129 | used to embed the process ID and/or the contents of an environment |
| 130 | variable in the name, as is the case for the core option |
| 131 | <option><xref linkend="opt.log-file"/></option>. |
| 132 | </para> |
| 133 | </listitem> |
| 134 | </varlistentry> |
| 135 | |
| 136 | <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file"> |
| 137 | <term> |
| 138 | <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option> |
| 139 | </term> |
| 140 | <listitem> |
| 141 | <para> |
| 142 | This option selects the name of the PC file. |
| 143 | This file holds program counter addresses |
| 144 | and function name info for the various basic blocks. |
| 145 | This can be used in conjunction |
| 146 | with the basic block vector file to fast-forward via function names |
| 147 | instead of just instruction counts. The |
| 148 | <option>%p</option> and <option>%q</option> format specifiers can be |
| 149 | used to embed the process ID and/or the contents of an environment |
| 150 | variable in the name, as is the case for the core option |
| 151 | <option><xref linkend="opt.log-file"/></option>. |
| 152 | </para> |
| 153 | </listitem> |
| 154 | </varlistentry> |
| 155 | |
| 156 | <varlistentry id="opt.interval-size" xreflabel="--interval-size"> |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 157 | <term> |
| 158 | <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option> |
| 159 | </term> |
| 160 | <listitem> |
| 161 | <para> |
| 162 | This option selects the size of the interval to use. |
| 163 | The default is 100 |
| 164 | million instructions, which is a commonly used value. |
| 165 | Other sizes can be used; smaller intervals can help programs |
| 166 | with finer-grained phases. However smaller interval size |
| 167 | can lead to accuracy issues due to warm-up effects |
| 168 | (When fast-forwarding the various architectural features |
| 169 | will be un-initialized, and it will take some number |
| 170 | of instructions before they "warm up" to the state a |
| 171 | full simulation would be at without the fast-forwarding. |
| 172 | Large interval sizes tend to mitigate this.) |
| 173 | </para> |
| 174 | </listitem> |
| 175 | </varlistentry> |
| 176 | |
| 177 | <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only"> |
| 178 | <term> |
| 179 | <option><![CDATA[--instr-count-only [default: no] ]]></option> |
| 180 | </term> |
| 181 | <listitem> |
| 182 | <para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 183 | This option tells the tool to only display instruction count |
| 184 | totals, and to not generate the actual basic block vector file. |
| 185 | This is useful for debugging, and for gathering instruction count |
| 186 | info without generating the large basic block vector files. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 187 | </para> |
| 188 | </listitem> |
| 189 | </varlistentry> |
| 190 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 191 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 192 | </variablelist> |
| 193 | <!-- end of xi:include in the manpage --> |
| 194 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 195 | </sect1> |
| 196 | |
| 197 | <sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format"> |
| 198 | <title>Basic Block Vector File Format</title> |
| 199 | |
| 200 | <para> |
| 201 | The Basic Block Vector is dumped at fixed intervals. This |
| 202 | is commonly done every 100 million instructions; the |
njn | 7e5d4ed | 2009-07-30 02:57:52 +0000 | [diff] [blame] | 203 | <option>--interval-size</option> option can be |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 204 | used to change this. |
| 205 | </para> |
| 206 | |
| 207 | <para> |
| 208 | The output file looks like this: |
| 209 | </para> |
| 210 | |
| 211 | <programlisting><![CDATA[ |
| 212 | T:45:1024 :189:99343 |
| 213 | T:11:78573 :15:1353 :56:1 |
| 214 | T:18:45 :12:135353 :56:78 314:4324263]]></programlisting> |
| 215 | |
| 216 | <para> |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 217 | Each new interval starts with a T. This is followed on the same line |
| 218 | by a series of basic block and frequency pairs, one for each |
| 219 | basic block that was entered during the interval. The format for |
| 220 | each block/frequency pair is a colon, followed by a number that |
| 221 | uniquely identifies the basic block, another colon, and then |
| 222 | the frequency (which is the number of times the block was entered, |
| 223 | multiplied by the number of instructions in the block). The |
| 224 | pairs are separated from each other by a space. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 225 | </para> |
| 226 | |
| 227 | <para> |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 228 | The frequency count is multiplied by the number of instructions that are |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 229 | in the basic block, in order to weigh the count so that instructions in |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 230 | small basic blocks aren't counted as more important than instructions |
| 231 | in large basic blocks. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 232 | </para> |
| 233 | |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 234 | <para> |
| 235 | The SimPoint program only processes lines that start with a "T". All |
| 236 | other lines are ignored. Traditionally comments are indicated by |
| 237 | starting a line with a "#" character. Some other BBV generation tools, |
| 238 | such as PinPoints, generate lines beginning with letters other than "T" |
| 239 | to indicate more information about the program being run. We do |
| 240 | not generate these, as the SimPoint utility ignores them. |
| 241 | </para> |
| 242 | |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 243 | </sect1> |
| 244 | |
| 245 | <sect1 id="bbv-manual.implementation" xreflabel="Implementation"> |
| 246 | <title>Implementation</title> |
| 247 | |
| 248 | <para> |
| 249 | Valgrind provides all of the information necessary to create |
| 250 | BBV files. In the current implementation, all instructions |
| 251 | are instrumented. This is slower (by approximately a factor |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 252 | of two) than a method that instruments at the basic block level, |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 253 | but there are some complications (especially with rep prefix |
| 254 | detection) that make that method more difficult. |
| 255 | </para> |
| 256 | |
| 257 | <para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 258 | Valgrind actually provides instrumentation at a superblock level. |
| 259 | A superblock has one entry point but unlike basic blocks can |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 260 | have multiple exit points. Once a branch occurs into the middle |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 261 | of a block, it is split into a new basic block. Because |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 262 | Valgrind cannot produce "true" basic blocks, the generated |
| 263 | BBV vectors will be different than those generated by other tools. |
| 264 | In practice this does not seem to affect the accuracy of the |
| 265 | SimPoint results. We do internally force the |
njn | 7e5d4ed | 2009-07-30 02:57:52 +0000 | [diff] [blame] | 266 | <option>--vex-guest-chase-thresh=0</option> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 267 | option to Valgrind which forces a more basic-block-like |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 268 | behavior. |
| 269 | </para> |
| 270 | |
| 271 | <para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 272 | When a superblock is run for the first time, it is instrumented |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 273 | with our BBV routine. A block info (bbInfo) structure is allocated |
| 274 | which holds the various information and statistics for the block. |
| 275 | A unique block ID is assigned to the block, and then the |
| 276 | structure is placed into an ordered set. |
| 277 | Then each native instruction in the block is instrumented to |
| 278 | call an instruction counting routine with a pointer to the block |
| 279 | info structure as an argument. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 280 | </para> |
| 281 | |
| 282 | <para> |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 283 | At run-time, our instruction counting routines are called once |
| 284 | per native instruction. The relevant block info structure is accessed |
| 285 | and the block count and total instruction count is updated. |
| 286 | If the total instruction count overflows the interval size |
| 287 | then we walk the ordered set, writing out the statistics for |
| 288 | any block that was accessed in the interval, then resetting the |
| 289 | block counters to zero. |
| 290 | </para> |
| 291 | |
| 292 | <para> |
| 293 | On the x86 and amd64 architectures the counting code has extra |
| 294 | code to handle rep-prefixed string instructions. This is because |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 295 | actual hardware counts a rep-prefixed instruction |
| 296 | as one instruction, while a naive Valgrind implementation |
| 297 | would count it as many (possibly hundreds, thousands or even millions) |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 298 | of instructions. We handle rep-prefixed instructions specially, |
| 299 | in order to make the results match those obtained with hardware performance |
| 300 | counters. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 301 | </para> |
| 302 | |
| 303 | <para> |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 304 | BBV also counts the fldcw instruction. This instruction is used on |
| 305 | x86 machines in various ways; it is most commonly found when converting |
| 306 | floating point values into integers. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 307 | On Pentium 4 systems the retired instruction performance |
vince | 3ad02ea | 2009-08-07 21:00:05 +0000 | [diff] [blame] | 308 | counter counts this instruction as two instructions (all other |
| 309 | known processors only count it as one). |
| 310 | This can affect results when using SimPoint on Pentium 4 systems. |
| 311 | We provide the fldcw count so that users can evaluate whether it |
| 312 | will impact their results enough to avoid using Pentium 4 machines |
| 313 | for their experiments. It would be possible to add an option to |
| 314 | this tool that mimics the double-counting so that the generated BBV |
| 315 | files would be usable for experiments using hardware performance |
| 316 | counters on Pentium 4 systems. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 317 | </para> |
| 318 | |
| 319 | </sect1> |
| 320 | |
| 321 | <sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support"> |
| 322 | <title>Threaded Executable Support</title> |
| 323 | |
| 324 | <para> |
| 325 | BBV supports threaded programs. When a program has multiple threads, |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 326 | an additional basic block vector file is created for each thread (each |
| 327 | additional file is the specified filename with the thread number |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 328 | appended at the end). |
| 329 | </para> |
| 330 | |
| 331 | <para> |
| 332 | There is no official method of using SimPoint with |
| 333 | threaded workloads. The most common method is to run |
| 334 | SimPoint on each thread's results independently, and use |
| 335 | some method of deterministic execution to try to match the |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 336 | original workload. This should be possible with the current |
| 337 | BBV. |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 338 | </para> |
| 339 | |
| 340 | </sect1> |
| 341 | |
| 342 | <sect1 id="bbv-manual.validation" xreflabel="BBV Validation"> |
| 343 | <title>Validation</title> |
| 344 | |
| 345 | <para> |
njn | 738184a | 2009-08-05 23:59:05 +0000 | [diff] [blame] | 346 | BBV has been tested on x86, amd64, and ppc32 platforms. |
| 347 | An earlier version of BBV was tested in detail using |
njn | dbebecc | 2009-07-14 01:39:54 +0000 | [diff] [blame] | 348 | hardware performance counters, this work is described in a paper |
| 349 | from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation |
| 350 | to Generate Multi-Platform SimPoints: Methodology and Accuracy" by |
| 351 | V.M. Weaver and S.A. McKee. |
| 352 | </para> |
| 353 | |
| 354 | </sect1> |
| 355 | |
| 356 | <sect1 id="bbv-manual.performance" xreflabel="BBV Performance"> |
| 357 | <title>Performance</title> |
| 358 | |
| 359 | <para> |
| 360 | Using this program slows down execution by roughly a factor of 40 |
| 361 | over native execution. This varies depending on the machine |
| 362 | used and the benchmark being run. |
| 363 | On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D |
| 364 | processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2). |
| 365 | </para> |
| 366 | |
| 367 | </sect1> |
| 368 | |
| 369 | </chapter> |