Blame - cachegrind/docs/cg_main.html - fp2-dev/platform/external/valgrind

2002-11-11 00:20:07 +0000

[diff] [blame]

1

<html>

2

<head>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

3

<title>Cachegrind: a cache-miss profiler</title>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

4

</head>

5

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

6

<body>

7

8

<h2>4  <b>Cachegrind</b>: a cache-miss profiler</h2>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

9

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

10

To use this tool, you must specify <code>--tool=cachegrind</code>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

11

on the Valgrind command line.

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

12

13

<p>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

14

Detailed technical documentation on how Cachegrind works is available

15

<A HREF="cg_techdocs.html">here</A>. If you want to know how

16

to <b>use</b> it, you only need to read this page.

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

20

<h3>4.1  Cache profiling</h3>

21

Cachegrind is a tool for doing cache simulations and annotating your source

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

22

line-by-line with the number of cache misses. In particular, it records:

23

<ul>

24

<li>L1 instruction cache reads and misses;

25

<li>L1 data cache reads and read misses, writes and write misses;

26

<li>L2 unified cache reads and read misses, writes and writes misses.

27

</ul>

28

On a modern x86 machine, an L1 miss will typically cost around 10 cycles,

29

and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be

30

very useful for improving the performance of your program.<p>

31

32

Also, since one instruction cache read is performed per instruction executed,

33

you can find out how many instructions are executed per line, which can be

34

useful for traditional profiling and test coverage.<p>

35

36

Any feedback, bug-fixes, suggestions, etc, welcome.

37

38

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

39

<h3>4.2  Overview</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

40

First off, as for normal Valgrind use, you probably want to compile with

41

debugging info (the <code>-g</code> flag). But by contrast with normal

42

Valgrind use, you probably <b>do</b> want to turn optimisation on, since you

43

should profile your program as it will be normally run.

44

45

The two steps are:

46

<ol>

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

47

<li>Run your program with <code>valgrind --tool=cachegrind</code> in front of

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

48

the normal command line invocation. When the program finishes,

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

49

Cachegrind will print summary cache statistics. It also collects

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

50

line-by-line information in a file

51

<code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>

52

is the program's process id.

53

<p>

54

This step should be done every time you want to collect

55

information about a new program, a changed program, or about the

56

same program with different input.

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

57

</li><p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

58

<li>Generate a function-by-function summary, and possibly annotate

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

59

source files, using the supplied

60

<code>cg_annotate</code> program. Source files to annotate can be

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

61

specified manually, or manually on the command line, or

62

"interesting" source files can be annotated automatically with

63

the <code>--auto=yes</code> option. You can annotate C/C++

64

files or assembly language files equally easily.

65

<p>

66

This step can be performed as many times as you like for each

67

Step 2. You may want to do multiple annotations showing

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

68

different information each time.

69

</li><p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

70

</ol>

71

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

72

The steps are described in detail in the following sections.

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

73

74

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

75

<h3>4.3  Cache simulation specifics</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

76

77

Cachegrind uses a simulation for a machine with a split L1 cache and a unified

78

L2 cache. This configuration is used for all (modern) x86-based machines we

79

are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are

80

ancient history now.<p>

81

82

The more specific characteristics of the simulation are as follows.

83

84

<ul>

85

<li>Write-allocate: when a write miss occurs, the block written to

86

is brought into the D1 cache. Most modern caches have this

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

87

property.<p>

88

</li>

89

<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

90

<li>Bit-selection hash function: the line(s) in the cache to which a

91

memory block maps is chosen by the middle bits M--(M+N-1) of the

92

byte address, where:

93

<ul>

94

<li> line size = 2^M bytes </li>

95

<li>(cache size / line size) = 2^N bytes</li>

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

96

</ul>

97

</li>

98

<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

99

<li>Inclusive L2 cache: the L2 cache replicates all the entries of

100

the L1 cache. This is standard on Pentium chips, but AMD

101

Athlons use an exclusive L2 cache that only holds blocks evicted

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

102

from L1. Ditto AMD Durons and most modern VIAs.</li>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

103

</ul>

104

105

The cache configuration simulated (cache size, associativity and line size) is

106

determined automagically using the CPUID instruction. If you have an old

107

machine that (a) doesn't support the CPUID instruction, or (b) supports it in

108

an early incarnation that doesn't give any cache information, then Cachegrind

109

will fall back to using a default configuration (that of a model 3/4 Athlon).

110

Cachegrind will tell you if this happens. You can manually specify one, two or

111

all three levels (I1/D1/L2) of the cache from the command line using the

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

112

<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

113

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

114

<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

115

Other noteworthy behaviour:

116

117

<ul>

118

<li>References that straddle two cache lines are treated as follows:

119

<ul>

120

<li>If both blocks hit --> counted as one hit</li>

121

<li>If one block hits, the other misses --> counted as one miss</li>

122

<li>If both blocks miss --> counted as one miss (not two)</li>

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

123

</ul>

124

</li>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

125

126

<li>Instructions that modify a memory location (eg. <code>inc</code> and

127

<code>dec</code>) are counted as doing just a read, ie. a single data

128

reference. This may seem strange, but since the write can never cause a

129

miss (the read guarantees the block is in the cache) it's not very

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

130

interesting.

131

<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

132

Thus it measures not the number of times the data cache is accessed, but

133

the number of times a data cache miss could occur.<p>

</li>

</ul>

If you are interested in simulating a cache with different properties, it is

138

not particularly hard to write your own cache simulator, or to modify the

139

existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,

140

<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be

141

interested to hear from anyone who does.

142

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

143

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

144

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

145

<h3>4.4  Profiling programs</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

146

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

147

To gather cache profiling information about the program <code>ls -l</code>,

148

invoke Cachegrind like this:

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

149

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

150

<blockquote><code>valgrind --tool=cachegrind ls -l</code></blockquote>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

151

152

The program will execute (slowly). Upon completion, summary statistics

153

that look like this will be printed:

154

155

<pre>

156

==31751== I refs: 27,742,716

157

==31751== I1 misses: 276

158

==31751== L2 misses: 275

159

==31751== I1 miss rate: 0.0%

160

==31751== L2i miss rate: 0.0%

161

==31751==

162

==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)

163

==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)

164

==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)

165

==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)

166

==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)

167

==31751==

168

==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)

169

==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)

170

</pre>

171

172

Cache accesses for instruction fetches are summarised first, giving the

173

number of fetches made (this is the number of instructions executed, which

174

can be useful to know in its own right), the number of I1 misses, and the

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

175

number of L2 instruction (<code>L2i</code>) misses.

176

<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

177

Cache accesses for data follow. The information is similar to that of the

178

instruction fetches, except that the values are also shown split between reads

179

and writes (note each row's <code>rd</code> and <code>wr</code> values add up

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

180

to the row's total).

181

<p>

182

Combined instruction and data figures for the L2 cache follow that.

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

183

184

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

185

<h3>4.5  Output file</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

186

187

As well as printing summary information, Cachegrind also writes

188

line-by-line cache profiling information to a file named

189

<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is

190

best interpreted by the accompanying program <code>cg_annotate</code>,

191

described in the next section.

192

<p>

193

Things to note about the <code>cachegrind.out.<i>pid</i></code> file:

194

<ul>

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

195

<li>It is written every time Cachegrind

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

196

is run, and will overwrite any existing

197

<code>cachegrind.out.<i>pid</i></code> in the current directory (but

198

that won't happen very often because it takes some time for process ids

njn

2003-04-08 11:08:45 +0000

[diff] [blame]

199

to be recycled).</li><p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

200

<li>It can be huge: <code>ls -l</code> generates a file of about

201

350KB. Browsing a few files and web pages with a Konqueror

202

built with full debugging information generates a file

203

of around 15 MB.</li>

204

</ul>

205

206

Note that older versions of Cachegrind used a log file named

207

<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

208

The suffix serves two purposes. Firstly, it means you don't have to

209

rename old log files that you don't want to overwrite. Secondly, and

210

more importantly, it allows correct profiling with the

211

<code>--trace-children=yes</code> option of programs that spawn child

212

processes.

213

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

214

215

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

216

<h3>4.6  Cachegrind options</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

217

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

218

Cache-simulation specific options are:

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

<ul>

[default: uses CPUID for automagic cache configuration]<p>

225

226

Manually specifies the I1/D1/L2 cache configuration, where

227

<code>size</code> and <code>line_size</code> are measured in bytes. The

228

three items must be comma-separated, but with no spaces, eg:

229

230

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

231

<code>valgrind --tool=cachegrind --I1=65535,2,64</code>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

232

</blockquote>

233

234

You can specify one, two or three of the I1/D1/L2 caches. Any level not

235

manually specified will be simulated using the configuration found in the

236

normal way (via the CPUID instruction, or failing that, via defaults).

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

241

<h3>4.7  Annotating C/C++ programs</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

242

243

Before using <code>cg_annotate</code>, it is worth widening your

244

window to be at least 120-characters wide if possible, as the output

245

lines can be quite long.

246

<p>

247

To get a function-by-function summary, run <code>cg_annotate

248

--<i>pid</i></code> in a directory containing a

249

<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>

250

is required so that <code>cg_annotate</code> knows which log file to use when

251

several are present.

252

<p>

253

The output looks like this:

254

255

<pre>

256

--------------------------------------------------------------------------------

257

I1 cache: 65536 B, 64 B, 2-way associative

258

D1 cache: 65536 B, 64 B, 2-way associative

259

L2 cache: 262144 B, 64 B, 8-way associative

260

Command: concord vg_to_ucode.c

261

Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

262

Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

263

Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

264

Threshold: 99%

265

Chosen for annotation:

266

Auto-annotation: on

267

268

--------------------------------------------------------------------------------

269

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

270

--------------------------------------------------------------------------------

271

27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS

272

273

--------------------------------------------------------------------------------

274

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function

275

--------------------------------------------------------------------------------

276

8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc

277

5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word

278

2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp

279

2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash

280

2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower

281

1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert

282

897,991 51 51 897,831 95 30 62 1 1 ???:???

283

598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile

284

598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile

285

598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc

286

446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing

287

341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER

288

320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table

289

298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create

290

149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0

291

149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0

292

95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node

293

85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue

294

</pre>

295

296

First up is a summary of the annotation options:

297

298

<ul>

299

<li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the

300

configuration with which these results were obtained.</li><p>

301

302

<li>Command: the command line invocation of the program under

303

examination.</li><p>

304

305

<li>Events recorded: event abbreviations are:<p>

306

<ul>

307

<li><code>Ir </code>: I cache reads (ie. instructions executed)</li>

308

<li><code>I1mr</code>: I1 cache read misses</li>

309

<li><code>I2mr</code>: L2 cache instruction read misses</li>

310

<li><code>Dr </code>: D cache reads (ie. memory reads)</li>

311

<li><code>D1mr</code>: D1 cache read misses</li>

312

<li><code>D2mr</code>: L2 cache data read misses</li>

313

<li><code>Dw </code>: D cache writes (ie. memory writes)</li>

314

<li><code>D1mw</code>: D1 cache write misses</li>

315

<li><code>D2mw</code>: L2 cache data write misses</li>

316

</ul><p>

317

Note that D1 total accesses is given by <code>D1mr</code> +

318

<code>D1mw</code>, and that L2 total accesses is given by

319

320

321

<li>Events shown: the events shown (a subset of events gathered). This can

322

be adjusted with the <code>--show</code> option.</li><p>

323

324

<li>Event sort order: the sort order in which functions are shown. For

325

example, in this case the functions are sorted from highest

326

<code>Ir</code> counts to lowest. If two functions have identical

327

<code>Ir</code> counts, they will then be sorted by <code>I1mr</code>

328

counts, and so on. This order can be adjusted with the

329

<code>--sort</code> option.<p>

330

331

Note that this dictates the order the functions appear. It is <b>not</b>

332

the order in which the columns appear; that is dictated by the "events

333

shown" line (and can be changed with the <code>--show</code> option).

334

</li><p>

335

336

<li>Threshold: <code>cg_annotate</code> by default omits functions

337

that cause very low numbers of misses to avoid drowning you in

338

information. In this case, cg_annotate shows summaries the

339

functions that account for 99% of the <code>Ir</code> counts;

340

<code>Ir</code> is chosen as the threshold event since it is the

341

primary sort event. The threshold can be adjusted with the

342

<code>--threshold</code> option.</li><p>

343

344

<li>Chosen for annotation: names of files specified manually for annotation;

345

in this case none.</li><p>

346

347

<li>Auto-annotation: whether auto-annotation was requested via the

348

<code>--auto=yes</code> option. In this case no.</li><p>

349

</ul>

350

351

Then follows summary statistics for the whole program. These are similar

nethercote

2003-11-14 17:47:54 +0000

[diff] [blame]

352

to the summary provided when running <code>valgrind --tool=cachegrind</code>.<p>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

353

354

Then follows function-by-function statistics. Each function is

355

identified by a <code>file_name:function_name</code> pair. If a column

356

contains only a dot it means the function never performs

357

that event (eg. the third row shows that <code>strcmp()</code>

358

contains no instructions that write to memory). The name

359

<code>???</code> is used if the the file name and/or function name

360

could not be determined from debugging information. If most of the

361

entries have the form <code>???:???</code> the program probably wasn't

362

compiled with <code>-g</code>. If any code was invalidated (either due to

363

self-modifying code or unloading of shared objects) its counts are aggregated

364

into a single cost centre written as <code>(discarded):(discarded)</code>.<p>

365

366

It is worth noting that functions will come from three types of source files:

367

<ol>

368

<li> From the profiled program (<code>concord.c</code> in this example).</li>

369

<li>From libraries (eg. <code>getc.c</code>)</li>

370

<li>From Valgrind's implementation of some libc functions (eg.

371

<code>vg_clientmalloc.c:malloc</code>). These are recognisable because

372

the filename begins with <code>vg_</code>, and is probably one of

373

<code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or

374

<code>vg_mylibc.c</code>.

</li>

</ol>

There are two ways to annotate source files -- by choosing them

379

manually, or with the <code>--auto=yes</code> option. To do it

380

manually, just specify the filenames as arguments to

381

<code>cg_annotate</code>. For example, the output from running

382

<code>cg_annotate concord.c</code> for our example produces the same

383

output as above followed by an annotated version of

384

<code>concord.c</code>, a section of which looks like:

385

386

<pre>

387

--------------------------------------------------------------------------------

388

-- User-annotated source: concord.c

389

--------------------------------------------------------------------------------

390

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

[snip]

. . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])

395

3 1 1 . . . 1 0 0 {

396

. . . . . . . . . FILE *file_ptr;

397

. . . . . . . . . Word_Info *data;

398

1 0 0 . . . 1 1 1 int line = 1, i;

399

. . . . . . . . .

400

5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));

401

. . . . . . . . .

402

4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)

403

3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;

404

. . . . . . . . .

405

. . . . . . . . . /* Open file, check it. */

406

6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");

407

2 0 0 1 0 0 . . . if (!(file_ptr)) {

408

. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);

409

1 1 1 . . . . . . exit(EXIT_FAILURE);

410

. . . . . . . . . }

411

. . . . . . . . .

412

165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)

413

146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);

414

. . . . . . . . .

415

4 0 0 1 0 0 2 0 0 free(data);

416

4 0 0 1 0 0 2 0 0 fclose(file_ptr);

3 0 0 2 0 0 . . . }

</pre>

(Although column widths are automatically minimised, a wide terminal is clearly

421

useful.)<p>

422

423

Each source file is clearly marked (<code>User-annotated source</code>) as

424

having been chosen manually for annotation. If the file was found in one of

425

the directories specified with the <code>-I</code>/<code>--include</code>

426

option, the directory and file are both given.<p>

427

428

Each line is annotated with its event counts. Events not applicable for a line

429

are represented by a `.'; this is useful for distinguishing between an event

430

which cannot happen, and one which can but did not.<p>

431

432

Sometimes only a small section of a source file is executed. To minimise

433

uninteresting output, Valgrind only shows annotated lines and lines within a

434

small distance of annotated lines. Gaps are marked with the line numbers so

435

you know which part of a file the shown code comes from, eg:

436

437

<pre>

438

(figures and code for line 704)

439

-- line 704 ----------------------------------------

440

-- line 878 ----------------------------------------

441

(figures and code for line 878)

442

</pre>

443

444

The amount of context to show around annotated lines is controlled by the

445

<code>--context</code> option.<p>

446

447

To get automatic annotation, run <code>cg_annotate --auto=yes</code>.

448

cg_annotate will automatically annotate every source file it can find that is

449

mentioned in the function-by-function summary. Therefore, the files chosen for

450

auto-annotation are affected by the <code>--sort</code> and

451

<code>--threshold</code> options. Each source file is clearly marked

452

(<code>Auto-annotated source</code>) as being chosen automatically. Any files

453

that could not be found are mentioned at the end of the output, eg:

454

455

<pre>

456

--------------------------------------------------------------------------------

457

The following files chosen for auto-annotation could not be found:

458

--------------------------------------------------------------------------------

459

getc.c

460

ctype.c

461

../sysdeps/generic/lockfile.c

462

</pre>

463

464

This is quite common for library files, since libraries are usually compiled

465

with debugging information, but the source files are often not present on a

466

system. If a file is chosen for annotation <b>both</b> manually and

467

automatically, it is marked as <code>User-annotated source</code>.

468

469

Use the <code>-I/--include</code> option to tell Valgrind where to look for

470

source files if the filenames found from the debugging information aren't

471

specific enough.

472

473

Beware that cg_annotate can take some time to digest large

474

<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also

475

beware that auto-annotation can produce a lot of output if your program is

large!

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

479

<h3>4.8  Annotating assembler programs</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

480

481

Valgrind can annotate assembler programs too, or annotate the

482

assembler generated for your C program. Sometimes this is useful for

483

understanding what is really happening when an interesting line of C

484

code is translated into multiple instructions.<p>

485

486

To do this, you just need to assemble your <code>.s</code> files with

487

assembler-level debug information. gcc doesn't do this, but you can

488

use the GNU assembler with the <code>--gstabs</code> option to

489

generate object files with this information, eg:

490

491

<blockquote><code>as --gstabs foo.s</code></blockquote>

492

493

You can then profile and annotate source files in the same way as for C/C++

programs.

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

497

<h3>4.9  <code>cg_annotate</code> options</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

<ul>

Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.

502

Not actually an option -- it is required.

503

504

505

<li><code>-v, --version</code><p>

506

507

Help and version, as usual.</li>

508

509

<li><code>--sort=A,B,C</code> [default: order in

510

<code>cachegrind.out.<i>pid</i></code>]<p>

511

Specifies the events upon which the sorting of the function-by-function

512

entries will be based. Useful if you want to concentrate on eg. I cache

513

misses (<code>--sort=I1mr,I2mr</code>), or D cache misses

514

(<code>--sort=D1mr,D2mr</code>), or L2 misses

515

(<code>--sort=D2mr,I2mr</code>).</li><p>

516

517

<li><code>--show=A,B,C</code> [default: all, using order in

518

<code>cachegrind.out.<i>pid</i></code>]<p>

519

Specifies which events to show (and the column order). Default is to use

520

all present in the <code>cachegrind.out.<i>pid</i></code> file (and use

521

the order in the file).</li><p>

522

523

<li><code>--threshold=X</code> [default: 99%] <p>

524

Sets the threshold for the function-by-function summary. Functions are

525

shown that account for more than X% of the primary sort event. If

526

auto-annotating, also affects which files are annotated.

527

528

Note: thresholds can be set for more than one of the events by appending

529

any events for the <code>--sort</code> option with a colon and a number

530

(no spaces, though). E.g. if you want to see the functions that cover

531

99% of L2 read misses and 99% of L2 write misses, use this option:

</li><p>

<li><code>--auto=no</code> [default]<br>

537

538

When enabled, automatically annotates every file that is mentioned in the

539

function-by-function summary that can be found. Also gives a list of

540

those that couldn't be found.

541

542

<li><code>--context=N</code> [default: 8]<p>

543

Print N lines of context before and after each annotated line. Avoids

544

printing large sections of source files that were not executed. Use a

545

large number (eg. 10,000) to show all source lines.

546

</li><p>

547

548

<li><code>-I=<dir>, --include=<dir></code>

549

[default: empty string]<p>

550

Adds a directory to the list in which to search for files. Multiple

551

-I/--include options can be given to add multiple directories.

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

555

<h3>4.10  Warnings</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

556

There are a couple of situations in which cg_annotate issues warnings.

557

558

<ul>

559

<li>If a source file is more recent than the

560

<code>cachegrind.out.<i>pid</i></code> file. This is because the

561

information in <code>cachegrind.out.<i>pid</i></code> is only recorded

562

with line numbers, so if the line numbers change at all in the source

563

(eg. lines added, deleted, swapped), any annotations will be

564

incorrect.<p>

565

566

<li>If information is recorded about line numbers past the end of a file.

567

This can be caused by the above problem, ie. shortening the source file

568

while using an old <code>cachegrind.out.<i>pid</i></code> file. If this

569

happens, the figures for the bogus lines are printed anyway (clearly

570

marked as bogus) in case they are important.</li><p>

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

574

<h3>4.11  Things to watch out for</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

575

Some odd things that can occur during annotation:

576

577

<ul>

578

<li>If annotating at the assembler level, you might see something like this:

579

580

<pre>

581

1 0 0 . . . . . . leal -12(%ebp),%eax

582

1 0 0 . . . 1 0 0 movl %eax,84(%ebx)

583

2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)

584

. . . . . . . . . .align 4,0x90

585

1 0 0 . . . . . . movl $.LnrB,%eax

586

1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)

587

</pre>

588

589

How can the third instruction be executed twice when the others are

590

executed only once? As it turns out, it isn't. Here's a dump of the

591

executable, using <code>objdump -d</code>:

592

593

<pre>

594

8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax

595

8048f28: 89 43 54 mov %eax,0x54(%ebx)

596

8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)

597

8048f32: 89 f6 mov %esi,%esi

598

8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax

599

8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)

600

</pre>

601

602

Notice the extra <code>mov %esi,%esi</code> instruction. Where did this

603

come from? The GNU assembler inserted it to serve as the two bytes of

604

padding needed to align the <code>movl $.LnrB,%eax</code> instruction on

605

a four-byte boundary, but pretended it didn't exist when adding debug

606

information. Thus when Valgrind reads the debug info it thinks that the

607

<code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address

608

range 0x8048f2b--0x804833 by itself, and attributes the counts for the

</li>

<li>Inlined functions can cause strange results in the function-by-function

613

summary. If a function <code>inline_me()</code> is defined in

614

<code>foo.h</code> and inlined in the functions <code>f1()</code>,

615

<code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will

616

not be a <code>foo.h:inline_me()</code> function entry. Instead, there

617

will be separate function entries for each inlining site, ie.

618

<code>foo.h:f1()</code>, <code>foo.h:f2()</code> and

619

<code>foo.h:f3()</code>. To find the total counts for

620

<code>foo.h:inline_me()</code>, add up the counts from each entry.<p>

621

622

The reason for this is that although the debug info output by gcc

623

indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it

624

doesn't indicate the name of the function in <code>foo.h</code>, so

625

Valgrind keeps using the old one.<p>

626

627

<li>Sometimes, the same filename might be represented with a relative name

628

and with an absolute name in different parts of the debug info, eg:

629

<code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this

630

case, if you use auto-annotation, the file will be annotated twice with

631

the counts split between the two.<p>

632

</li>

633

634

<li>Files with more than 65,535 lines cause difficulties for the stabs debug

635

info reader. This is because the line number in the <code>struct

636

nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit

637

value. Valgrind can handle some files with more than 65,535 lines

638

correctly by making some guesses to identify line number overflows. But

639

some cases are beyond it, in which case you'll get a warning message

640

explaining that annotations for the file might be incorrect.<p>

641

</li>

642

643

<li>If you compile some files with <code>-g</code> and some without, some

644

events that take place in a file without debug info could be attributed

645

to the last line of a file with debug info (whichever one gets placed

646

before the non-debug-info file in the executable).<p>

</li>

</ul>

This list looks long, but these cases should be fairly rare.<p>

651

652

Note: stabs is not an easy format to read. If you come across bizarre

653

annotations that look like might be caused by a bug in the stabs reader,

654

please let us know.<p>

655

656

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

657

<h3>4.12  Accuracy</h3>

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

658

Valgrind's cache profiling has a number of shortcomings:

659

660

<ul>

661

<li>It doesn't account for kernel activity -- the effect of system calls on

662

the cache contents is ignored.</li><p>

663

664

<li>It doesn't account for other process activity (although this is probably

665

desirable when considering a single program).</li><p>

666

667

<li>It doesn't account for virtual-to-physical address mappings; hence the

668

entire simulation is not a true representation of what's happening in the

669

cache.</li><p>

670

671

<li>It doesn't account for cache misses not visible at the instruction level,

672

eg. those arising from TLB misses, or speculative execution.</li><p>

673

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

674

<li>Valgrind's custom threads implementation will schedule threads

njn

3e88418

2003-04-15 13:03:23 +0000

[diff] [blame]

675

differently to the standard one. This could warp the results for

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

threaded programs.

</li><p>

<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>

680

will incorrectly be counted as doing a data read if both the arguments

are registers, eg:

This should only happen rarely.

686

</li><p>

687

688

<li>FPU instructions with data sizes of 28 and 108 bytes (e.g.

689

<code>fsave</code>) are treated as though they only access 16 bytes.

690

These instructions seem to be rare so hopefully this won't affect

accuracy much.

</li><p>

</ul>

Another thing worth nothing is that results are very sensitive. Changing the

696

size of the <code>valgrind.so</code> file, the size of the program being

697

profiled, or even the length of its name can perturb the results. Variations

698

will be small, but don't expect perfectly repeatable results if your program

699

changes at all.<p>

700

701

While these factors mean you shouldn't trust the results to be super-accurate,

702

hopefully they should be close enough to be useful.<p>

703

704

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

705

sewardj

2002-11-11 00:20:07 +0000

[diff] [blame]

706

<ul>

707

<li>Program start-up/shut-down calls a lot of functions that aren't

708

interesting and just complicate the output. Would be nice to exclude

709

these somehow.</li>

710

<p>

711

</ul>

sewardj