blob: f6ba05439beb4ac4f49b07f8e68678cd480e6eb2 [file] [log] [blame]
njndbebecc2009-07-14 01:39:54 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<chapter id="bbv-manual" xreflabel="BBV">
njn05a89172009-07-29 02:36:21 +00006 <title>BBV: an experimental basic block vector generation tool</title>
njndbebecc2009-07-14 01:39:54 +00007
8<para>To use this tool, you must specify
njn7e5d4ed2009-07-30 02:57:52 +00009<option>--tool=exp-bbv</option> on the Valgrind
njndbebecc2009-07-14 01:39:54 +000010command line.</para>
11
njn05a89172009-07-29 02:36:21 +000012<sect1 id="bbv-manual.overview" xreflabel="Overview">
13<title>Overview</title>
njndbebecc2009-07-14 01:39:54 +000014
15<para>
njn738184a2009-08-05 23:59:05 +000016 A basic block is a linear section of code with one entry point and one exit
vince3ad02ea2009-08-07 21:00:05 +000017 point. A <emphasis>basic block vector</emphasis> (BBV) is a list of all
njn738184a2009-08-05 23:59:05 +000018 basic blocks entered during program execution, and a count of how many
19 times each basic block was run.
njndbebecc2009-07-14 01:39:54 +000020</para>
21
22<para>
vince3ad02ea2009-08-07 21:00:05 +000023 BBV is a tool that generates basic block vectors for use with the
njn78b708d2009-08-05 07:20:15 +000024 <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint</ulink>
25 analysis tool.
njndbebecc2009-07-14 01:39:54 +000026 The SimPoint methodology enables speeding up architectural
27 simulations by only running a small portion of a program
28 and then extrapolating total behavior from this
29 small portion. Most programs exhibit phase-based behavior, which
30 means that at various times during execution a program will encounter
31 intervals of time where the code behaves similarly to a previous
32 interval. If you can detect these intervals and group them together,
33 an approximation of the total program behavior can be obtained
34 by only simulating a bare minimum number of intervals, and then scaling
35 the results.
36</para>
37
38<para>
39 In computer architecture research, running a
40 benchmark on a cycle-accurate simulator can cause slowdowns on the order
41 of 1000 times, making it take days, weeks, or even longer to run full
njn05a89172009-07-29 02:36:21 +000042 benchmarks. By utilizing SimPoint this can be reduced significantly,
43 usually by 90-95%, while still retaining reasonable accuracy.
njndbebecc2009-07-14 01:39:54 +000044</para>
45
46<para>
47 A more complete introduction to how SimPoint works can be
48 found in the paper "Automatically Characterizing Large Scale
njn05a89172009-07-29 02:36:21 +000049 Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and
njndbebecc2009-07-14 01:39:54 +000050 B. Calder.
51</para>
52
53</sect1>
54
55<sect1 id="bbv-manual.quickstart" xreflabel="Quick Start">
56<title>Using Basic Block Vectors to create SimPoints</title>
57
58<para>
59 To quickly create a basic block vector file, you will call Valgrind
60 like this:
njn738184a2009-08-05 23:59:05 +000061
62 <programlisting>valgrind --tool=exp-bbv /bin/ls</programlisting>
63
64 In this case we are running on <filename>/bin/ls</filename>,
65 but this can be any program. By default a file called
njndbebecc2009-07-14 01:39:54 +000066 <computeroutput>bb.out.PID</computeroutput> will be created,
67 where PID is replaced by the process ID of the running process.
njn738184a2009-08-05 23:59:05 +000068 This file contains the basic block vector. For long-running programs
njndbebecc2009-07-14 01:39:54 +000069 this file can be quite large, so it might be wise to compress
70 it with gzip or some other compression program.
71</para>
72
73<para>
njn78b708d2009-08-05 07:20:15 +000074 To create actual SimPoint results, you will need the SimPoint utility,
75 available from the
76 <ulink url="http://www.cse.ucsd.edu/~calder/simpoint/">SimPoint webpage</ulink>.
njndbebecc2009-07-14 01:39:54 +000077 Assuming you have downloaded SimPoint 3.2 and compiled it,
78 create SimPoint results with a command like the following:
79
njnd09133d2009-07-30 02:27:17 +000080 <programlisting><![CDATA[
81./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
82 -loadFVFile bb.out.1234.gz \
83 -k 5 -saveSimpoints results.simpts \
84 -saveSimpointWeights results.weights]]></programlisting>
85
njndbebecc2009-07-14 01:39:54 +000086 where bb.out.1234.gz is your compressed basic block vector file
njn738184a2009-08-05 23:59:05 +000087 generated by BBV.
njndbebecc2009-07-14 01:39:54 +000088</para>
89
90<para>
91 The SimPoint utility does random linear projection using 15-dimensions,
92 then does k-mean clustering to calculate which intervals are
93 of interest. In this example we specify 5 intervals with the
94 -k 5 option.
95</para>
96
97<para>
98 The outputs from the SimPoint run are the
99 <computeroutput>results.simpts</computeroutput>
100 and <computeroutput>results.weights</computeroutput> files.
101 The first holds the 5 most relevant intervals of the program.
102 The seconds holds the weight to scale each interval by when
103 extrapolating full-program behavior. The intervals and the weights
104 can be used in conjunction with a simulator that supports
105 fast-forwarding; you fast-forward to the interval of interest,
106 collect stats for the desired interval length, then use
107 statistics gathered in conjunction with the weights to
108 calculate your results.
109</para>
110
111</sect1>
112
njna3311642009-08-10 01:29:14 +0000113<sect1 id="bbv-manual.usage" xreflabel="BBV Command-line Options">
114<title>BBV Command-line Options</title>
njndbebecc2009-07-14 01:39:54 +0000115
njna3311642009-08-10 01:29:14 +0000116<para> BBV-specific command-line options are:</para>
njn738184a2009-08-05 23:59:05 +0000117
njndbebecc2009-07-14 01:39:54 +0000118<!-- start of xi:include in the manpage -->
119<variablelist id="bbv.opts.list">
120
njn738184a2009-08-05 23:59:05 +0000121 <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file">
122 <term>
123 <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option>
124 </term>
125 <listitem>
126 <para>
127 This option selects the name of the basic block vector file. The
128 <option>%p</option> and <option>%q</option> format specifiers can be
129 used to embed the process ID and/or the contents of an environment
130 variable in the name, as is the case for the core option
131 <option><xref linkend="opt.log-file"/></option>.
132 </para>
133 </listitem>
134 </varlistentry>
135
136 <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file">
137 <term>
138 <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option>
139 </term>
140 <listitem>
141 <para>
142 This option selects the name of the PC file.
143 This file holds program counter addresses
144 and function name info for the various basic blocks.
145 This can be used in conjunction
146 with the basic block vector file to fast-forward via function names
147 instead of just instruction counts. The
148 <option>%p</option> and <option>%q</option> format specifiers can be
149 used to embed the process ID and/or the contents of an environment
150 variable in the name, as is the case for the core option
151 <option><xref linkend="opt.log-file"/></option>.
152 </para>
153 </listitem>
154 </varlistentry>
155
156 <varlistentry id="opt.interval-size" xreflabel="--interval-size">
njndbebecc2009-07-14 01:39:54 +0000157 <term>
158 <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option>
159 </term>
160 <listitem>
161 <para>
162 This option selects the size of the interval to use.
163 The default is 100
164 million instructions, which is a commonly used value.
165 Other sizes can be used; smaller intervals can help programs
166 with finer-grained phases. However smaller interval size
167 can lead to accuracy issues due to warm-up effects
168 (When fast-forwarding the various architectural features
169 will be un-initialized, and it will take some number
170 of instructions before they "warm up" to the state a
171 full simulation would be at without the fast-forwarding.
172 Large interval sizes tend to mitigate this.)
173 </para>
174 </listitem>
175 </varlistentry>
176
177 <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only">
178 <term>
179 <option><![CDATA[--instr-count-only [default: no] ]]></option>
180 </term>
181 <listitem>
182 <para>
njn738184a2009-08-05 23:59:05 +0000183 This option tells the tool to only display instruction count
184 totals, and to not generate the actual basic block vector file.
185 This is useful for debugging, and for gathering instruction count
186 info without generating the large basic block vector files.
njndbebecc2009-07-14 01:39:54 +0000187 </para>
188 </listitem>
189 </varlistentry>
190
njndbebecc2009-07-14 01:39:54 +0000191
njndbebecc2009-07-14 01:39:54 +0000192</variablelist>
193<!-- end of xi:include in the manpage -->
194
njndbebecc2009-07-14 01:39:54 +0000195</sect1>
196
197<sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format">
198<title>Basic Block Vector File Format</title>
199
200<para>
201 The Basic Block Vector is dumped at fixed intervals. This
202 is commonly done every 100 million instructions; the
njn7e5d4ed2009-07-30 02:57:52 +0000203 <option>--interval-size</option> option can be
njndbebecc2009-07-14 01:39:54 +0000204 used to change this.
205</para>
206
207<para>
208 The output file looks like this:
209</para>
210
211<programlisting><![CDATA[
212T:45:1024 :189:99343
213T:11:78573 :15:1353 :56:1
214T:18:45 :12:135353 :56:78 314:4324263]]></programlisting>
215
216<para>
vince3ad02ea2009-08-07 21:00:05 +0000217 Each new interval starts with a T. This is followed on the same line
218 by a series of basic block and frequency pairs, one for each
219 basic block that was entered during the interval. The format for
220 each block/frequency pair is a colon, followed by a number that
221 uniquely identifies the basic block, another colon, and then
222 the frequency (which is the number of times the block was entered,
223 multiplied by the number of instructions in the block). The
224 pairs are separated from each other by a space.
njndbebecc2009-07-14 01:39:54 +0000225</para>
226
227<para>
vince3ad02ea2009-08-07 21:00:05 +0000228 The frequency count is multiplied by the number of instructions that are
njndbebecc2009-07-14 01:39:54 +0000229 in the basic block, in order to weigh the count so that instructions in
njn738184a2009-08-05 23:59:05 +0000230 small basic blocks aren't counted as more important than instructions
231 in large basic blocks.
njndbebecc2009-07-14 01:39:54 +0000232</para>
233
vince3ad02ea2009-08-07 21:00:05 +0000234<para>
235 The SimPoint program only processes lines that start with a "T". All
236 other lines are ignored. Traditionally comments are indicated by
237 starting a line with a "#" character. Some other BBV generation tools,
238 such as PinPoints, generate lines beginning with letters other than "T"
239 to indicate more information about the program being run. We do
240 not generate these, as the SimPoint utility ignores them.
241</para>
242
njndbebecc2009-07-14 01:39:54 +0000243</sect1>
244
245<sect1 id="bbv-manual.implementation" xreflabel="Implementation">
246<title>Implementation</title>
247
248<para>
249 Valgrind provides all of the information necessary to create
250 BBV files. In the current implementation, all instructions
251 are instrumented. This is slower (by approximately a factor
njn738184a2009-08-05 23:59:05 +0000252 of two) than a method that instruments at the basic block level,
njndbebecc2009-07-14 01:39:54 +0000253 but there are some complications (especially with rep prefix
254 detection) that make that method more difficult.
255</para>
256
257<para>
njn738184a2009-08-05 23:59:05 +0000258 Valgrind actually provides instrumentation at a superblock level.
259 A superblock has one entry point but unlike basic blocks can
njndbebecc2009-07-14 01:39:54 +0000260 have multiple exit points. Once a branch occurs into the middle
njn738184a2009-08-05 23:59:05 +0000261 of a block, it is split into a new basic block. Because
njndbebecc2009-07-14 01:39:54 +0000262 Valgrind cannot produce "true" basic blocks, the generated
263 BBV vectors will be different than those generated by other tools.
264 In practice this does not seem to affect the accuracy of the
265 SimPoint results. We do internally force the
njn7e5d4ed2009-07-30 02:57:52 +0000266 <option>--vex-guest-chase-thresh=0</option>
njn738184a2009-08-05 23:59:05 +0000267 option to Valgrind which forces a more basic-block-like
njndbebecc2009-07-14 01:39:54 +0000268 behavior.
269</para>
270
271<para>
njn738184a2009-08-05 23:59:05 +0000272 When a superblock is run for the first time, it is instrumented
vince3ad02ea2009-08-07 21:00:05 +0000273 with our BBV routine. A block info (bbInfo) structure is allocated
274 which holds the various information and statistics for the block.
275 A unique block ID is assigned to the block, and then the
276 structure is placed into an ordered set.
277 Then each native instruction in the block is instrumented to
278 call an instruction counting routine with a pointer to the block
279 info structure as an argument.
njndbebecc2009-07-14 01:39:54 +0000280</para>
281
282<para>
vince3ad02ea2009-08-07 21:00:05 +0000283 At run-time, our instruction counting routines are called once
284 per native instruction. The relevant block info structure is accessed
285 and the block count and total instruction count is updated.
286 If the total instruction count overflows the interval size
287 then we walk the ordered set, writing out the statistics for
288 any block that was accessed in the interval, then resetting the
289 block counters to zero.
290</para>
291
292<para>
293 On the x86 and amd64 architectures the counting code has extra
294 code to handle rep-prefixed string instructions. This is because
njndbebecc2009-07-14 01:39:54 +0000295 actual hardware counts a rep-prefixed instruction
296 as one instruction, while a naive Valgrind implementation
297 would count it as many (possibly hundreds, thousands or even millions)
vince3ad02ea2009-08-07 21:00:05 +0000298 of instructions. We handle rep-prefixed instructions specially,
299 in order to make the results match those obtained with hardware performance
300 counters.
njndbebecc2009-07-14 01:39:54 +0000301</para>
302
303<para>
vince3ad02ea2009-08-07 21:00:05 +0000304 BBV also counts the fldcw instruction. This instruction is used on
305 x86 machines in various ways; it is most commonly found when converting
306 floating point values into integers.
njndbebecc2009-07-14 01:39:54 +0000307 On Pentium 4 systems the retired instruction performance
vince3ad02ea2009-08-07 21:00:05 +0000308 counter counts this instruction as two instructions (all other
309 known processors only count it as one).
310 This can affect results when using SimPoint on Pentium 4 systems.
311 We provide the fldcw count so that users can evaluate whether it
312 will impact their results enough to avoid using Pentium 4 machines
313 for their experiments. It would be possible to add an option to
314 this tool that mimics the double-counting so that the generated BBV
315 files would be usable for experiments using hardware performance
316 counters on Pentium 4 systems.
njndbebecc2009-07-14 01:39:54 +0000317</para>
318
319</sect1>
320
321<sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support">
322<title>Threaded Executable Support</title>
323
324<para>
325 BBV supports threaded programs. When a program has multiple threads,
njn738184a2009-08-05 23:59:05 +0000326 an additional basic block vector file is created for each thread (each
327 additional file is the specified filename with the thread number
njndbebecc2009-07-14 01:39:54 +0000328 appended at the end).
329</para>
330
331<para>
332 There is no official method of using SimPoint with
333 threaded workloads. The most common method is to run
334 SimPoint on each thread's results independently, and use
335 some method of deterministic execution to try to match the
njn738184a2009-08-05 23:59:05 +0000336 original workload. This should be possible with the current
337 BBV.
njndbebecc2009-07-14 01:39:54 +0000338</para>
339
340</sect1>
341
342<sect1 id="bbv-manual.validation" xreflabel="BBV Validation">
343<title>Validation</title>
344
345<para>
njn738184a2009-08-05 23:59:05 +0000346 BBV has been tested on x86, amd64, and ppc32 platforms.
347 An earlier version of BBV was tested in detail using
njndbebecc2009-07-14 01:39:54 +0000348 hardware performance counters, this work is described in a paper
349 from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation
350 to Generate Multi-Platform SimPoints: Methodology and Accuracy" by
351 V.M. Weaver and S.A. McKee.
352</para>
353
354</sect1>
355
356<sect1 id="bbv-manual.performance" xreflabel="BBV Performance">
357<title>Performance</title>
358
359<para>
360 Using this program slows down execution by roughly a factor of 40
361 over native execution. This varies depending on the machine
362 used and the benchmark being run.
363 On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D
364 processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2).
365</para>
366
367</sect1>
368
369</chapter>