Blame - cachegrind/docs/cg_main.html - fp2-dev/platform/external/valgrind

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

1

<html>

2

<head>

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

3

<title>Cachegrind: a cache-miss profiler</title>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

4

</head>

5

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

6

<body>

7

8

<h2>4  <b>Cachegrind</b>: a cache-miss profiler</h2>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

9

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

10

To use this skin, you must specify <code>--skin=cachegrind</code>

11

on the Valgrind command line.

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

12

13

<p>

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

14

Detailed technical documentation on how Cachegrind works is available

15

<A HREF="cg_techdocs.html">here</A>. If you want to know how

16

to <b>use</b> it, you only need to read this page.

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

20

<h3>4.1  Cache profiling</h3>

21

Cachegrind is a tool for doing cache simulations and annotating your source

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

22

line-by-line with the number of cache misses. In particular, it records:

23

<ul>

24

<li>L1 instruction cache reads and misses;

25

<li>L1 data cache reads and read misses, writes and write misses;

26

<li>L2 unified cache reads and read misses, writes and writes misses.

27

</ul>

28

On a modern x86 machine, an L1 miss will typically cost around 10 cycles,

29

and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be

30

very useful for improving the performance of your program.<p>

31

32

Also, since one instruction cache read is performed per instruction executed,

33

you can find out how many instructions are executed per line, which can be

34

useful for traditional profiling and test coverage.<p>

35

36

Any feedback, bug-fixes, suggestions, etc, welcome.

37

38

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

39

<h3>4.2  Overview</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

40

First off, as for normal Valgrind use, you probably want to compile with

41

debugging info (the <code>-g</code> flag). But by contrast with normal

42

Valgrind use, you probably <b>do</b> want to turn optimisation on, since you

43

should profile your program as it will be normally run.

The two steps are:

<ol>

<li>Run your program with <code>valgrind --skin=cachegrind</code> in front of

48

the normal command line invocation. When the program finishes,

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

49

Cachegrind will print summary cache statistics. It also collects

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

50

line-by-line information in a file

51

<code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>

52

is the program's process id.

53

<p>

54

This step should be done every time you want to collect

55

information about a new program, a changed program, or about the

56

same program with different input.

57

</li>

58

<p>

59

<li>Generate a function-by-function summary, and possibly annotate

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

60

source files, using the supplied

61

<code>cg_annotate</code> program. Source files to annotate can be

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

62

specified manually, or manually on the command line, or

63

"interesting" source files can be annotated automatically with

64

the <code>--auto=yes</code> option. You can annotate C/C++

65

files or assembly language files equally easily.

66

<p>

67

This step can be performed as many times as you like for each

68

Step 2. You may want to do multiple annotations showing

69

different information each time.<p>

</li>

</ol>

The steps are described in detail in the following sections.<p>

74

75

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

76

<h4>4.3  Cache simulation specifics</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

77

78

Cachegrind uses a simulation for a machine with a split L1 cache and a unified

79

L2 cache. This configuration is used for all (modern) x86-based machines we

80

are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are

81

ancient history now.<p>

82

83

The more specific characteristics of the simulation are as follows.

84

85

<ul>

86

<li>Write-allocate: when a write miss occurs, the block written to

87

is brought into the D1 cache. Most modern caches have this

88

property.</li><p>

89

90

<li>Bit-selection hash function: the line(s) in the cache to which a

91

memory block maps is chosen by the middle bits M--(M+N-1) of the

92

byte address, where:

93

<ul>

94

<li> line size = 2^M bytes </li>

95

<li>(cache size / line size) = 2^N bytes</li>

96

</ul> </li><p>

97

98

<li>Inclusive L2 cache: the L2 cache replicates all the entries of

99

the L1 cache. This is standard on Pentium chips, but AMD

100

Athlons use an exclusive L2 cache that only holds blocks evicted

101

from L1. Ditto AMD Durons and most modern VIAs.</li><p>

102

</ul>

103

104

The cache configuration simulated (cache size, associativity and line size) is

105

determined automagically using the CPUID instruction. If you have an old

106

machine that (a) doesn't support the CPUID instruction, or (b) supports it in

107

an early incarnation that doesn't give any cache information, then Cachegrind

108

will fall back to using a default configuration (that of a model 3/4 Athlon).

109

Cachegrind will tell you if this happens. You can manually specify one, two or

110

all three levels (I1/D1/L2) of the cache from the command line using the

111

<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>

112

113

Other noteworthy behaviour:

114

115

<ul>

116

<li>References that straddle two cache lines are treated as follows:

117

<ul>

118

<li>If both blocks hit --> counted as one hit</li>

119

<li>If one block hits, the other misses --> counted as one miss</li>

120

<li>If both blocks miss --> counted as one miss (not two)</li>

121

</ul><p></li>

122

123

<li>Instructions that modify a memory location (eg. <code>inc</code> and

124

<code>dec</code>) are counted as doing just a read, ie. a single data

125

reference. This may seem strange, but since the write can never cause a

126

miss (the read guarantees the block is in the cache) it's not very

127

interesting.<p>

128

129

Thus it measures not the number of times the data cache is accessed, but

130

the number of times a data cache miss could occur.<p>

</li>

</ul>

If you are interested in simulating a cache with different properties, it is

135

not particularly hard to write your own cache simulator, or to modify the

136

existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,

137

<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be

138

interested to hear from anyone who does.

139

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

140

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

141

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

142

<h3>4.4  Profiling programs</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

143

144

Cache profiling is enabled by using the <code>--skin=cachegrind</code>

145

option to the <code>valgrind</code> shell script. To gather cache profiling

146

information about the program <code>ls -l</code>, type:

147

148

<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>

149

150

The program will execute (slowly). Upon completion, summary statistics

151

that look like this will be printed:

152

153

<pre>

154

==31751== I refs: 27,742,716

155

==31751== I1 misses: 276

156

==31751== L2 misses: 275

157

==31751== I1 miss rate: 0.0%

158

==31751== L2i miss rate: 0.0%

159

==31751==

160

==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)

161

==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)

162

==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)

163

==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)

164

==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)

165

==31751==

166

==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)

167

==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)

168

</pre>

169

170

Cache accesses for instruction fetches are summarised first, giving the

171

number of fetches made (this is the number of instructions executed, which

172

can be useful to know in its own right), the number of I1 misses, and the

173

number of L2 instruction (<code>L2i</code>) misses.<p>

174

175

Cache accesses for data follow. The information is similar to that of the

176

instruction fetches, except that the values are also shown split between reads

177

and writes (note each row's <code>rd</code> and <code>wr</code> values add up

178

to the row's total).<p>

179

180

Combined instruction and data figures for the L2 cache follow that.<p>

181

182

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

183

<h3>4.5  Output file</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

184

185

As well as printing summary information, Cachegrind also writes

186

line-by-line cache profiling information to a file named

187

<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is

188

best interpreted by the accompanying program <code>cg_annotate</code>,

189

described in the next section.

190

<p>

191

Things to note about the <code>cachegrind.out.<i>pid</i></code> file:

192

<ul>

193

<li>It is written every time <code>valgrind --skin=cachegrind</code>

194

is run, and will overwrite any existing

195

<code>cachegrind.out.<i>pid</i></code> in the current directory (but

196

that won't happen very often because it takes some time for process ids

197

to be recycled).</li>

198

<p>

199

<li>It can be huge: <code>ls -l</code> generates a file of about

200

350KB. Browsing a few files and web pages with a Konqueror

201

built with full debugging information generates a file

202

of around 15 MB.</li>

203

</ul>

204

205

Note that older versions of Cachegrind used a log file named

206

<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

207

The suffix serves two purposes. Firstly, it means you don't have to

208

rename old log files that you don't want to overwrite. Secondly, and

209

more importantly, it allows correct profiling with the

210

<code>--trace-children=yes</code> option of programs that spawn child

211

processes.

212

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

213

214

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

215

<h3>4.6  Cachegrind options</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

216

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

217

Cache-simulation specific options are:

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

<ul>

[default: uses CPUID for automagic cache configuration]<p>

224

225

Manually specifies the I1/D1/L2 cache configuration, where

226

<code>size</code> and <code>line_size</code> are measured in bytes. The

227

three items must be comma-separated, but with no spaces, eg:

228

229

230

<code>valgrind --skin=cachegrind --I1=65535,2,64</code>

231

</blockquote>

232

233

You can specify one, two or three of the I1/D1/L2 caches. Any level not

234

manually specified will be simulated using the configuration found in the

235

normal way (via the CPUID instruction, or failing that, via defaults).

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

240

<h3>4.7  Annotating C/C++ programs</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

241

242

Before using <code>cg_annotate</code>, it is worth widening your

243

window to be at least 120-characters wide if possible, as the output

244

lines can be quite long.

245

<p>

246

To get a function-by-function summary, run <code>cg_annotate

247

--<i>pid</i></code> in a directory containing a

248

<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>

249

is required so that <code>cg_annotate</code> knows which log file to use when

250

several are present.

251

<p>

252

The output looks like this:

253

254

<pre>

255

--------------------------------------------------------------------------------

256

I1 cache: 65536 B, 64 B, 2-way associative

257

D1 cache: 65536 B, 64 B, 2-way associative

258

L2 cache: 262144 B, 64 B, 8-way associative

259

Command: concord vg_to_ucode.c

260

Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

261

Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

262

Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

263

Threshold: 99%

264

Chosen for annotation:

265

Auto-annotation: on

266

267

--------------------------------------------------------------------------------

268

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

269

--------------------------------------------------------------------------------

270

27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS

271

272

--------------------------------------------------------------------------------

273

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function

274

--------------------------------------------------------------------------------

275

8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc

276

5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word

277

2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp

278

2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash

279

2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower

280

1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert

281

897,991 51 51 897,831 95 30 62 1 1 ???:???

282

598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile

283

598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile

284

598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc

285

446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing

286

341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER

287

320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table

288

298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create

289

149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0

290

149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0

291

95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node

292

85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue

293

</pre>

294

295

First up is a summary of the annotation options:

296

297

<ul>

298

<li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the

299

configuration with which these results were obtained.</li><p>

300

301

<li>Command: the command line invocation of the program under

302

examination.</li><p>

303

304

<li>Events recorded: event abbreviations are:<p>

305

<ul>

306

<li><code>Ir </code>: I cache reads (ie. instructions executed)</li>

307

<li><code>I1mr</code>: I1 cache read misses</li>

308

<li><code>I2mr</code>: L2 cache instruction read misses</li>

309

<li><code>Dr </code>: D cache reads (ie. memory reads)</li>

310

<li><code>D1mr</code>: D1 cache read misses</li>

311

<li><code>D2mr</code>: L2 cache data read misses</li>

312

<li><code>Dw </code>: D cache writes (ie. memory writes)</li>

313

<li><code>D1mw</code>: D1 cache write misses</li>

314

<li><code>D2mw</code>: L2 cache data write misses</li>

315

</ul><p>

316

Note that D1 total accesses is given by <code>D1mr</code> +

317

<code>D1mw</code>, and that L2 total accesses is given by

318

319

320

<li>Events shown: the events shown (a subset of events gathered). This can

321

be adjusted with the <code>--show</code> option.</li><p>

322

323

<li>Event sort order: the sort order in which functions are shown. For

324

example, in this case the functions are sorted from highest

325

<code>Ir</code> counts to lowest. If two functions have identical

326

<code>Ir</code> counts, they will then be sorted by <code>I1mr</code>

327

counts, and so on. This order can be adjusted with the

328

<code>--sort</code> option.<p>

329

330

Note that this dictates the order the functions appear. It is <b>not</b>

331

the order in which the columns appear; that is dictated by the "events

332

shown" line (and can be changed with the <code>--show</code> option).

333

</li><p>

334

335

<li>Threshold: <code>cg_annotate</code> by default omits functions

336

that cause very low numbers of misses to avoid drowning you in

337

information. In this case, cg_annotate shows summaries the

338

functions that account for 99% of the <code>Ir</code> counts;

339

<code>Ir</code> is chosen as the threshold event since it is the

340

primary sort event. The threshold can be adjusted with the

341

<code>--threshold</code> option.</li><p>

342

343

<li>Chosen for annotation: names of files specified manually for annotation;

344

in this case none.</li><p>

345

346

<li>Auto-annotation: whether auto-annotation was requested via the

347

<code>--auto=yes</code> option. In this case no.</li><p>

348

</ul>

349

350

Then follows summary statistics for the whole program. These are similar

351

to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>

352

353

Then follows function-by-function statistics. Each function is

354

identified by a <code>file_name:function_name</code> pair. If a column

355

contains only a dot it means the function never performs

356

that event (eg. the third row shows that <code>strcmp()</code>

357

contains no instructions that write to memory). The name

358

<code>???</code> is used if the the file name and/or function name

359

could not be determined from debugging information. If most of the

360

entries have the form <code>???:???</code> the program probably wasn't

361

compiled with <code>-g</code>. If any code was invalidated (either due to

362

self-modifying code or unloading of shared objects) its counts are aggregated

363

into a single cost centre written as <code>(discarded):(discarded)</code>.<p>

364

365

It is worth noting that functions will come from three types of source files:

366

<ol>

367

<li> From the profiled program (<code>concord.c</code> in this example).</li>

368

<li>From libraries (eg. <code>getc.c</code>)</li>

369

<li>From Valgrind's implementation of some libc functions (eg.

370

<code>vg_clientmalloc.c:malloc</code>). These are recognisable because

371

the filename begins with <code>vg_</code>, and is probably one of

372

<code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or

373

<code>vg_mylibc.c</code>.

</li>

</ol>

There are two ways to annotate source files -- by choosing them

378

manually, or with the <code>--auto=yes</code> option. To do it

379

manually, just specify the filenames as arguments to

380

<code>cg_annotate</code>. For example, the output from running

381

<code>cg_annotate concord.c</code> for our example produces the same

382

output as above followed by an annotated version of

383

<code>concord.c</code>, a section of which looks like:

384

385

<pre>

386

--------------------------------------------------------------------------------

387

-- User-annotated source: concord.c

388

--------------------------------------------------------------------------------

389

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

[snip]

. . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])

394

3 1 1 . . . 1 0 0 {

395

. . . . . . . . . FILE *file_ptr;

396

. . . . . . . . . Word_Info *data;

397

1 0 0 . . . 1 1 1 int line = 1, i;

398

. . . . . . . . .

399

5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));

400

. . . . . . . . .

401

4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)

402

3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;

403

. . . . . . . . .

404

. . . . . . . . . /* Open file, check it. */

405

6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");

406

2 0 0 1 0 0 . . . if (!(file_ptr)) {

407

. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);

408

1 1 1 . . . . . . exit(EXIT_FAILURE);

409

. . . . . . . . . }

410

. . . . . . . . .

411

165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)

412

146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);

413

. . . . . . . . .

414

4 0 0 1 0 0 2 0 0 free(data);

415

4 0 0 1 0 0 2 0 0 fclose(file_ptr);

3 0 0 2 0 0 . . . }

</pre>

(Although column widths are automatically minimised, a wide terminal is clearly

420

useful.)<p>

421

422

Each source file is clearly marked (<code>User-annotated source</code>) as

423

having been chosen manually for annotation. If the file was found in one of

424

the directories specified with the <code>-I</code>/<code>--include</code>

425

option, the directory and file are both given.<p>

426

427

Each line is annotated with its event counts. Events not applicable for a line

428

are represented by a `.'; this is useful for distinguishing between an event

429

which cannot happen, and one which can but did not.<p>

430

431

Sometimes only a small section of a source file is executed. To minimise

432

uninteresting output, Valgrind only shows annotated lines and lines within a

433

small distance of annotated lines. Gaps are marked with the line numbers so

434

you know which part of a file the shown code comes from, eg:

435

436

<pre>

437

(figures and code for line 704)

438

-- line 704 ----------------------------------------

439

-- line 878 ----------------------------------------

440

(figures and code for line 878)

441

</pre>

442

443

The amount of context to show around annotated lines is controlled by the

444

<code>--context</code> option.<p>

445

446

To get automatic annotation, run <code>cg_annotate --auto=yes</code>.

447

cg_annotate will automatically annotate every source file it can find that is

448

mentioned in the function-by-function summary. Therefore, the files chosen for

449

auto-annotation are affected by the <code>--sort</code> and

450

<code>--threshold</code> options. Each source file is clearly marked

451

(<code>Auto-annotated source</code>) as being chosen automatically. Any files

452

that could not be found are mentioned at the end of the output, eg:

453

454

<pre>

455

--------------------------------------------------------------------------------

456

The following files chosen for auto-annotation could not be found:

457

--------------------------------------------------------------------------------

458

getc.c

459

ctype.c

460

../sysdeps/generic/lockfile.c

461

</pre>

462

463

This is quite common for library files, since libraries are usually compiled

464

with debugging information, but the source files are often not present on a

465

system. If a file is chosen for annotation <b>both</b> manually and

466

automatically, it is marked as <code>User-annotated source</code>.

467

468

Use the <code>-I/--include</code> option to tell Valgrind where to look for

469

source files if the filenames found from the debugging information aren't

470

specific enough.

471

472

Beware that cg_annotate can take some time to digest large

473

<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also

474

beware that auto-annotation can produce a lot of output if your program is

large!

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

478

<h3>4.8  Annotating assembler programs</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

479

480

Valgrind can annotate assembler programs too, or annotate the

481

assembler generated for your C program. Sometimes this is useful for

482

understanding what is really happening when an interesting line of C

483

code is translated into multiple instructions.<p>

484

485

To do this, you just need to assemble your <code>.s</code> files with

486

assembler-level debug information. gcc doesn't do this, but you can

487

use the GNU assembler with the <code>--gstabs</code> option to

488

generate object files with this information, eg:

489

490

<blockquote><code>as --gstabs foo.s</code></blockquote>

491

492

You can then profile and annotate source files in the same way as for C/C++

programs.

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

496

<h3>4.9  <code>cg_annotate</code> options</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

<ul>

Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.

501

Not actually an option -- it is required.

502

503

504

<li><code>-v, --version</code><p>

505

506

Help and version, as usual.</li>

507

508

<li><code>--sort=A,B,C</code> [default: order in

509

<code>cachegrind.out.<i>pid</i></code>]<p>

510

Specifies the events upon which the sorting of the function-by-function

511

entries will be based. Useful if you want to concentrate on eg. I cache

512

misses (<code>--sort=I1mr,I2mr</code>), or D cache misses

513

(<code>--sort=D1mr,D2mr</code>), or L2 misses

514

(<code>--sort=D2mr,I2mr</code>).</li><p>

515

516

<li><code>--show=A,B,C</code> [default: all, using order in

517

<code>cachegrind.out.<i>pid</i></code>]<p>

518

Specifies which events to show (and the column order). Default is to use

519

all present in the <code>cachegrind.out.<i>pid</i></code> file (and use

520

the order in the file).</li><p>

521

522

<li><code>--threshold=X</code> [default: 99%] <p>

523

Sets the threshold for the function-by-function summary. Functions are

524

shown that account for more than X% of the primary sort event. If

525

auto-annotating, also affects which files are annotated.

526

527

Note: thresholds can be set for more than one of the events by appending

528

any events for the <code>--sort</code> option with a colon and a number

529

(no spaces, though). E.g. if you want to see the functions that cover

530

99% of L2 read misses and 99% of L2 write misses, use this option:

</li><p>

<li><code>--auto=no</code> [default]<br>

536

537

When enabled, automatically annotates every file that is mentioned in the

538

function-by-function summary that can be found. Also gives a list of

539

those that couldn't be found.

540

541

<li><code>--context=N</code> [default: 8]<p>

542

Print N lines of context before and after each annotated line. Avoids

543

printing large sections of source files that were not executed. Use a

544

large number (eg. 10,000) to show all source lines.

545

</li><p>

546

547

<li><code>-I=<dir>, --include=<dir></code>

548

[default: empty string]<p>

549

Adds a directory to the list in which to search for files. Multiple

550

-I/--include options can be given to add multiple directories.

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

554

<h3>4.10  Warnings</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

555

There are a couple of situations in which cg_annotate issues warnings.

556

557

<ul>

558

<li>If a source file is more recent than the

559

<code>cachegrind.out.<i>pid</i></code> file. This is because the

560

information in <code>cachegrind.out.<i>pid</i></code> is only recorded

561

with line numbers, so if the line numbers change at all in the source

562

(eg. lines added, deleted, swapped), any annotations will be

563

incorrect.<p>

564

565

<li>If information is recorded about line numbers past the end of a file.

566

This can be caused by the above problem, ie. shortening the source file

567

while using an old <code>cachegrind.out.<i>pid</i></code> file. If this

568

happens, the figures for the bogus lines are printed anyway (clearly

569

marked as bogus) in case they are important.</li><p>

</ul>

sewardj

2002-11-18 00:07:28 +0000

[diff] [blame]

573

<h3>4.11  Things to watch out for</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

574

Some odd things that can occur during annotation:

575

576

<ul>

577

<li>If annotating at the assembler level, you might see something like this:

578

579

<pre>

580

1 0 0 . . . . . . leal -12(%ebp),%eax

581

1 0 0 . . . 1 0 0 movl %eax,84(%ebx)

582

2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)

583

. . . . . . . . . .align 4,0x90

584

1 0 0 . . . . . . movl $.LnrB,%eax

585

1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)

586

</pre>

587

588

How can the third instruction be executed twice when the others are

589

executed only once? As it turns out, it isn't. Here's a dump of the

590

executable, using <code>objdump -d</code>:

591

592

<pre>

593

8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax

594

8048f28: 89 43 54 mov %eax,0x54(%ebx)

595

8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)

596

8048f32: 89 f6 mov %esi,%esi

597

8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax

598

8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)

599

</pre>

600

601

Notice the extra <code>mov %esi,%esi</code> instruction. Where did this

602

come from? The GNU assembler inserted it to serve as the two bytes of

603

padding needed to align the <code>movl $.LnrB,%eax</code> instruction on

604

a four-byte boundary, but pretended it didn't exist when adding debug

605

information. Thus when Valgrind reads the debug info it thinks that the

606

<code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address

607

range 0x8048f2b--0x804833 by itself, and attributes the counts for the

</li>

<li>Inlined functions can cause strange results in the function-by-function

612

summary. If a function <code>inline_me()</code> is defined in

613

<code>foo.h</code> and inlined in the functions <code>f1()</code>,

614

<code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will

615

not be a <code>foo.h:inline_me()</code> function entry. Instead, there

616

will be separate function entries for each inlining site, ie.

617

<code>foo.h:f1()</code>, <code>foo.h:f2()</code> and

618

<code>foo.h:f3()</code>. To find the total counts for

619

<code>foo.h:inline_me()</code>, add up the counts from each entry.<p>

620

621

The reason for this is that although the debug info output by gcc

622

indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it

623

doesn't indicate the name of the function in <code>foo.h</code>, so

624

Valgrind keeps using the old one.<p>

625

626

<li>Sometimes, the same filename might be represented with a relative name

627

and with an absolute name in different parts of the debug info, eg:

628

<code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this

629

case, if you use auto-annotation, the file will be annotated twice with

630

the counts split between the two.<p>

631

</li>

632

633

<li>Files with more than 65,535 lines cause difficulties for the stabs debug

634

info reader. This is because the line number in the <code>struct

635

nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit

636

value. Valgrind can handle some files with more than 65,535 lines

637

correctly by making some guesses to identify line number overflows. But

638

some cases are beyond it, in which case you'll get a warning message

639

explaining that annotations for the file might be incorrect.<p>

640

</li>

641

642

<li>If you compile some files with <code>-g</code> and some without, some

643

events that take place in a file without debug info could be attributed

644

to the last line of a file with debug info (whichever one gets placed

645

before the non-debug-info file in the executable).<p>

</li>

</ul>

This list looks long, but these cases should be fairly rare.<p>

650

651

Note: stabs is not an easy format to read. If you come across bizarre

652

annotations that look like might be caused by a bug in the stabs reader,

653

please let us know.<p>

654

655

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

656

<h3>4.12  Accuracy</h3>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

657

Valgrind's cache profiling has a number of shortcomings:

658

659

<ul>

660

<li>It doesn't account for kernel activity -- the effect of system calls on

661

the cache contents is ignored.</li><p>

662

663

<li>It doesn't account for other process activity (although this is probably

664

desirable when considering a single program).</li><p>

665

666

<li>It doesn't account for virtual-to-physical address mappings; hence the

667

entire simulation is not a true representation of what's happening in the

668

cache.</li><p>

669

670

<li>It doesn't account for cache misses not visible at the instruction level,

671

eg. those arising from TLB misses, or speculative execution.</li><p>

672

673

<li>Valgrind's custom <code>malloc()</code> will allocate memory in different

674

ways to the standard <code>malloc()</code>, which could warp the results.

675

</li><p>

676

677

<li>Valgrind's custom threads implementation will schedule threads

678

differently to the standard one. This too could warp the results for

threaded programs.

</li><p>

<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>

683

will incorrectly be counted as doing a data read if both the arguments

are registers, eg:

This should only happen rarely.

689

</li><p>

690

691

<li>FPU instructions with data sizes of 28 and 108 bytes (e.g.

692

<code>fsave</code>) are treated as though they only access 16 bytes.

693

These instructions seem to be rare so hopefully this won't affect

accuracy much.

</li><p>

</ul>

Another thing worth nothing is that results are very sensitive. Changing the

699

size of the <code>valgrind.so</code> file, the size of the program being

700

profiled, or even the length of its name can perturb the results. Variations

701

will be small, but don't expect perfectly repeatable results if your program

702

changes at all.<p>

703

704

While these factors mean you shouldn't trust the results to be super-accurate,

705

hopefully they should be close enough to be useful.<p>

706

707

sewardj

f555ac7

2002-11-18 00:07:28 +0000

[diff] [blame]

708

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

709

<ul>

710

<li>Program start-up/shut-down calls a lot of functions that aren't

711

interesting and just complicate the output. Would be nice to exclude

712

these somehow.</li>

713

<p>

714

</ul>

sewardj

a9a2dcf

2002-11-11 00:20:07 +0000

[diff] [blame]

715

</body>

716

</html>

717