sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 1 | <html> |
| 2 | <head> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 3 | <title>Cachegrind: a cache-miss profiler</title> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 4 | </head> |
| 5 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 6 | <body> |
| 7 | <a name="cg-top"></a> |
| 8 | <h2>4 <b>Cachegrind</b>: a cache-miss profiler</h2> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 9 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 10 | To use this skin, you must specify <code>--skin=cachegrind</code> |
| 11 | on the Valgrind command line. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 12 | |
| 13 | <p> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 14 | Detailed technical documentation on how Cachegrind works is available |
| 15 | <A HREF="cg_techdocs.html">here</A>. If you want to know how |
| 16 | to <b>use</b> it, you only need to read this page. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 17 | |
| 18 | |
| 19 | <a name="cache"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 20 | <h3>4.1 Cache profiling</h3> |
| 21 | Cachegrind is a tool for doing cache simulations and annotating your source |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 22 | line-by-line with the number of cache misses. In particular, it records: |
| 23 | <ul> |
| 24 | <li>L1 instruction cache reads and misses; |
| 25 | <li>L1 data cache reads and read misses, writes and write misses; |
| 26 | <li>L2 unified cache reads and read misses, writes and writes misses. |
| 27 | </ul> |
| 28 | On a modern x86 machine, an L1 miss will typically cost around 10 cycles, |
| 29 | and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be |
| 30 | very useful for improving the performance of your program.<p> |
| 31 | |
| 32 | Also, since one instruction cache read is performed per instruction executed, |
| 33 | you can find out how many instructions are executed per line, which can be |
| 34 | useful for traditional profiling and test coverage.<p> |
| 35 | |
| 36 | Any feedback, bug-fixes, suggestions, etc, welcome. |
| 37 | |
| 38 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 39 | <h3>4.2 Overview</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 40 | First off, as for normal Valgrind use, you probably want to compile with |
| 41 | debugging info (the <code>-g</code> flag). But by contrast with normal |
| 42 | Valgrind use, you probably <b>do</b> want to turn optimisation on, since you |
| 43 | should profile your program as it will be normally run. |
| 44 | |
| 45 | The two steps are: |
| 46 | <ol> |
| 47 | <li>Run your program with <code>valgrind --skin=cachegrind</code> in front of |
| 48 | the normal command line invocation. When the program finishes, |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 49 | Cachegrind will print summary cache statistics. It also collects |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 50 | line-by-line information in a file |
| 51 | <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code> |
| 52 | is the program's process id. |
| 53 | <p> |
| 54 | This step should be done every time you want to collect |
| 55 | information about a new program, a changed program, or about the |
| 56 | same program with different input. |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 57 | </li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 58 | <li>Generate a function-by-function summary, and possibly annotate |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 59 | source files, using the supplied |
| 60 | <code>cg_annotate</code> program. Source files to annotate can be |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 61 | specified manually, or manually on the command line, or |
| 62 | "interesting" source files can be annotated automatically with |
| 63 | the <code>--auto=yes</code> option. You can annotate C/C++ |
| 64 | files or assembly language files equally easily. |
| 65 | <p> |
| 66 | This step can be performed as many times as you like for each |
| 67 | Step 2. You may want to do multiple annotations showing |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 68 | different information each time. |
| 69 | </li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 70 | </ol> |
| 71 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 72 | The steps are described in detail in the following sections. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 73 | |
| 74 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 75 | <h3>4.3 Cache simulation specifics</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 76 | |
| 77 | Cachegrind uses a simulation for a machine with a split L1 cache and a unified |
| 78 | L2 cache. This configuration is used for all (modern) x86-based machines we |
| 79 | are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are |
| 80 | ancient history now.<p> |
| 81 | |
| 82 | The more specific characteristics of the simulation are as follows. |
| 83 | |
| 84 | <ul> |
| 85 | <li>Write-allocate: when a write miss occurs, the block written to |
| 86 | is brought into the D1 cache. Most modern caches have this |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 87 | property.<p> |
| 88 | </li> |
| 89 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 90 | <li>Bit-selection hash function: the line(s) in the cache to which a |
| 91 | memory block maps is chosen by the middle bits M--(M+N-1) of the |
| 92 | byte address, where: |
| 93 | <ul> |
| 94 | <li> line size = 2^M bytes </li> |
| 95 | <li>(cache size / line size) = 2^N bytes</li> |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 96 | </ul> |
| 97 | </li> |
| 98 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 99 | <li>Inclusive L2 cache: the L2 cache replicates all the entries of |
| 100 | the L1 cache. This is standard on Pentium chips, but AMD |
| 101 | Athlons use an exclusive L2 cache that only holds blocks evicted |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 102 | from L1. Ditto AMD Durons and most modern VIAs.</li> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 103 | </ul> |
| 104 | |
| 105 | The cache configuration simulated (cache size, associativity and line size) is |
| 106 | determined automagically using the CPUID instruction. If you have an old |
| 107 | machine that (a) doesn't support the CPUID instruction, or (b) supports it in |
| 108 | an early incarnation that doesn't give any cache information, then Cachegrind |
| 109 | will fall back to using a default configuration (that of a model 3/4 Athlon). |
| 110 | Cachegrind will tell you if this happens. You can manually specify one, two or |
| 111 | all three levels (I1/D1/L2) of the cache from the command line using the |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 112 | <code>--I1</code>, <code>--D1</code> and <code>--L2</code> options. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 113 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 114 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 115 | Other noteworthy behaviour: |
| 116 | |
| 117 | <ul> |
| 118 | <li>References that straddle two cache lines are treated as follows: |
| 119 | <ul> |
| 120 | <li>If both blocks hit --> counted as one hit</li> |
| 121 | <li>If one block hits, the other misses --> counted as one miss</li> |
| 122 | <li>If both blocks miss --> counted as one miss (not two)</li> |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 123 | </ul> |
| 124 | </li> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 125 | |
| 126 | <li>Instructions that modify a memory location (eg. <code>inc</code> and |
| 127 | <code>dec</code>) are counted as doing just a read, ie. a single data |
| 128 | reference. This may seem strange, but since the write can never cause a |
| 129 | miss (the read guarantees the block is in the cache) it's not very |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 130 | interesting. |
| 131 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 132 | Thus it measures not the number of times the data cache is accessed, but |
| 133 | the number of times a data cache miss could occur.<p> |
| 134 | </li> |
| 135 | </ul> |
| 136 | |
| 137 | If you are interested in simulating a cache with different properties, it is |
| 138 | not particularly hard to write your own cache simulator, or to modify the |
| 139 | existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>, |
| 140 | <code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be |
| 141 | interested to hear from anyone who does. |
| 142 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 143 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 144 | <a name="profile"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 145 | <h3>4.4 Profiling programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 146 | |
| 147 | Cache profiling is enabled by using the <code>--skin=cachegrind</code> |
| 148 | option to the <code>valgrind</code> shell script. To gather cache profiling |
| 149 | information about the program <code>ls -l</code>, type: |
| 150 | |
| 151 | <blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote> |
| 152 | |
| 153 | The program will execute (slowly). Upon completion, summary statistics |
| 154 | that look like this will be printed: |
| 155 | |
| 156 | <pre> |
| 157 | ==31751== I refs: 27,742,716 |
| 158 | ==31751== I1 misses: 276 |
| 159 | ==31751== L2 misses: 275 |
| 160 | ==31751== I1 miss rate: 0.0% |
| 161 | ==31751== L2i miss rate: 0.0% |
| 162 | ==31751== |
| 163 | ==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) |
| 164 | ==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) |
| 165 | ==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr) |
| 166 | ==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) |
| 167 | ==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%) |
| 168 | ==31751== |
| 169 | ==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr) |
| 170 | ==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%) |
| 171 | </pre> |
| 172 | |
| 173 | Cache accesses for instruction fetches are summarised first, giving the |
| 174 | number of fetches made (this is the number of instructions executed, which |
| 175 | can be useful to know in its own right), the number of I1 misses, and the |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 176 | number of L2 instruction (<code>L2i</code>) misses. |
| 177 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 178 | Cache accesses for data follow. The information is similar to that of the |
| 179 | instruction fetches, except that the values are also shown split between reads |
| 180 | and writes (note each row's <code>rd</code> and <code>wr</code> values add up |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 181 | to the row's total). |
| 182 | <p> |
| 183 | Combined instruction and data figures for the L2 cache follow that. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 184 | |
| 185 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 186 | <h3>4.5 Output file</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 187 | |
| 188 | As well as printing summary information, Cachegrind also writes |
| 189 | line-by-line cache profiling information to a file named |
| 190 | <code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is |
| 191 | best interpreted by the accompanying program <code>cg_annotate</code>, |
| 192 | described in the next section. |
| 193 | <p> |
| 194 | Things to note about the <code>cachegrind.out.<i>pid</i></code> file: |
| 195 | <ul> |
| 196 | <li>It is written every time <code>valgrind --skin=cachegrind</code> |
| 197 | is run, and will overwrite any existing |
| 198 | <code>cachegrind.out.<i>pid</i></code> in the current directory (but |
| 199 | that won't happen very often because it takes some time for process ids |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 200 | to be recycled).</li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 201 | <li>It can be huge: <code>ls -l</code> generates a file of about |
| 202 | 350KB. Browsing a few files and web pages with a Konqueror |
| 203 | built with full debugging information generates a file |
| 204 | of around 15 MB.</li> |
| 205 | </ul> |
| 206 | |
| 207 | Note that older versions of Cachegrind used a log file named |
| 208 | <code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix). |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 209 | The suffix serves two purposes. Firstly, it means you don't have to |
| 210 | rename old log files that you don't want to overwrite. Secondly, and |
| 211 | more importantly, it allows correct profiling with the |
| 212 | <code>--trace-children=yes</code> option of programs that spawn child |
| 213 | processes. |
| 214 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 215 | |
| 216 | <a name="profileflags"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 217 | <h3>4.6 Cachegrind options</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 218 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 219 | Cache-simulation specific options are: |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 220 | |
| 221 | <ul> |
| 222 | <li><code>--I1=<size>,<associativity>,<line_size></code><br> |
| 223 | <code>--D1=<size>,<associativity>,<line_size></code><br> |
| 224 | <code>--L2=<size>,<associativity>,<line_size></code><p> |
| 225 | [default: uses CPUID for automagic cache configuration]<p> |
| 226 | |
| 227 | Manually specifies the I1/D1/L2 cache configuration, where |
| 228 | <code>size</code> and <code>line_size</code> are measured in bytes. The |
| 229 | three items must be comma-separated, but with no spaces, eg: |
| 230 | |
| 231 | <blockquote> |
| 232 | <code>valgrind --skin=cachegrind --I1=65535,2,64</code> |
| 233 | </blockquote> |
| 234 | |
| 235 | You can specify one, two or three of the I1/D1/L2 caches. Any level not |
| 236 | manually specified will be simulated using the configuration found in the |
| 237 | normal way (via the CPUID instruction, or failing that, via defaults). |
| 238 | </ul> |
| 239 | |
| 240 | |
| 241 | <a name="annotate"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 242 | <h3>4.7 Annotating C/C++ programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 243 | |
| 244 | Before using <code>cg_annotate</code>, it is worth widening your |
| 245 | window to be at least 120-characters wide if possible, as the output |
| 246 | lines can be quite long. |
| 247 | <p> |
| 248 | To get a function-by-function summary, run <code>cg_annotate |
| 249 | --<i>pid</i></code> in a directory containing a |
| 250 | <code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code> |
| 251 | is required so that <code>cg_annotate</code> knows which log file to use when |
| 252 | several are present. |
| 253 | <p> |
| 254 | The output looks like this: |
| 255 | |
| 256 | <pre> |
| 257 | -------------------------------------------------------------------------------- |
| 258 | I1 cache: 65536 B, 64 B, 2-way associative |
| 259 | D1 cache: 65536 B, 64 B, 2-way associative |
| 260 | L2 cache: 262144 B, 64 B, 8-way associative |
| 261 | Command: concord vg_to_ucode.c |
| 262 | Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 263 | Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 264 | Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 265 | Threshold: 99% |
| 266 | Chosen for annotation: |
| 267 | Auto-annotation: on |
| 268 | |
| 269 | -------------------------------------------------------------------------------- |
| 270 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 271 | -------------------------------------------------------------------------------- |
| 272 | 27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS |
| 273 | |
| 274 | -------------------------------------------------------------------------------- |
| 275 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function |
| 276 | -------------------------------------------------------------------------------- |
| 277 | 8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc |
| 278 | 5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word |
| 279 | 2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp |
| 280 | 2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash |
| 281 | 2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower |
| 282 | 1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert |
| 283 | 897,991 51 51 897,831 95 30 62 1 1 ???:??? |
| 284 | 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile |
| 285 | 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile |
| 286 | 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc |
| 287 | 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing |
| 288 | 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER |
| 289 | 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table |
| 290 | 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create |
| 291 | 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 |
| 292 | 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 |
| 293 | 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node |
| 294 | 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue |
| 295 | </pre> |
| 296 | |
| 297 | First up is a summary of the annotation options: |
| 298 | |
| 299 | <ul> |
| 300 | <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the |
| 301 | configuration with which these results were obtained.</li><p> |
| 302 | |
| 303 | <li>Command: the command line invocation of the program under |
| 304 | examination.</li><p> |
| 305 | |
| 306 | <li>Events recorded: event abbreviations are:<p> |
| 307 | <ul> |
| 308 | <li><code>Ir </code>: I cache reads (ie. instructions executed)</li> |
| 309 | <li><code>I1mr</code>: I1 cache read misses</li> |
| 310 | <li><code>I2mr</code>: L2 cache instruction read misses</li> |
| 311 | <li><code>Dr </code>: D cache reads (ie. memory reads)</li> |
| 312 | <li><code>D1mr</code>: D1 cache read misses</li> |
| 313 | <li><code>D2mr</code>: L2 cache data read misses</li> |
| 314 | <li><code>Dw </code>: D cache writes (ie. memory writes)</li> |
| 315 | <li><code>D1mw</code>: D1 cache write misses</li> |
| 316 | <li><code>D2mw</code>: L2 cache data write misses</li> |
| 317 | </ul><p> |
| 318 | Note that D1 total accesses is given by <code>D1mr</code> + |
| 319 | <code>D1mw</code>, and that L2 total accesses is given by |
| 320 | <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p> |
| 321 | |
| 322 | <li>Events shown: the events shown (a subset of events gathered). This can |
| 323 | be adjusted with the <code>--show</code> option.</li><p> |
| 324 | |
| 325 | <li>Event sort order: the sort order in which functions are shown. For |
| 326 | example, in this case the functions are sorted from highest |
| 327 | <code>Ir</code> counts to lowest. If two functions have identical |
| 328 | <code>Ir</code> counts, they will then be sorted by <code>I1mr</code> |
| 329 | counts, and so on. This order can be adjusted with the |
| 330 | <code>--sort</code> option.<p> |
| 331 | |
| 332 | Note that this dictates the order the functions appear. It is <b>not</b> |
| 333 | the order in which the columns appear; that is dictated by the "events |
| 334 | shown" line (and can be changed with the <code>--show</code> option). |
| 335 | </li><p> |
| 336 | |
| 337 | <li>Threshold: <code>cg_annotate</code> by default omits functions |
| 338 | that cause very low numbers of misses to avoid drowning you in |
| 339 | information. In this case, cg_annotate shows summaries the |
| 340 | functions that account for 99% of the <code>Ir</code> counts; |
| 341 | <code>Ir</code> is chosen as the threshold event since it is the |
| 342 | primary sort event. The threshold can be adjusted with the |
| 343 | <code>--threshold</code> option.</li><p> |
| 344 | |
| 345 | <li>Chosen for annotation: names of files specified manually for annotation; |
| 346 | in this case none.</li><p> |
| 347 | |
| 348 | <li>Auto-annotation: whether auto-annotation was requested via the |
| 349 | <code>--auto=yes</code> option. In this case no.</li><p> |
| 350 | </ul> |
| 351 | |
| 352 | Then follows summary statistics for the whole program. These are similar |
| 353 | to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p> |
| 354 | |
| 355 | Then follows function-by-function statistics. Each function is |
| 356 | identified by a <code>file_name:function_name</code> pair. If a column |
| 357 | contains only a dot it means the function never performs |
| 358 | that event (eg. the third row shows that <code>strcmp()</code> |
| 359 | contains no instructions that write to memory). The name |
| 360 | <code>???</code> is used if the the file name and/or function name |
| 361 | could not be determined from debugging information. If most of the |
| 362 | entries have the form <code>???:???</code> the program probably wasn't |
| 363 | compiled with <code>-g</code>. If any code was invalidated (either due to |
| 364 | self-modifying code or unloading of shared objects) its counts are aggregated |
| 365 | into a single cost centre written as <code>(discarded):(discarded)</code>.<p> |
| 366 | |
| 367 | It is worth noting that functions will come from three types of source files: |
| 368 | <ol> |
| 369 | <li> From the profiled program (<code>concord.c</code> in this example).</li> |
| 370 | <li>From libraries (eg. <code>getc.c</code>)</li> |
| 371 | <li>From Valgrind's implementation of some libc functions (eg. |
| 372 | <code>vg_clientmalloc.c:malloc</code>). These are recognisable because |
| 373 | the filename begins with <code>vg_</code>, and is probably one of |
| 374 | <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or |
| 375 | <code>vg_mylibc.c</code>. |
| 376 | </li> |
| 377 | </ol> |
| 378 | |
| 379 | There are two ways to annotate source files -- by choosing them |
| 380 | manually, or with the <code>--auto=yes</code> option. To do it |
| 381 | manually, just specify the filenames as arguments to |
| 382 | <code>cg_annotate</code>. For example, the output from running |
| 383 | <code>cg_annotate concord.c</code> for our example produces the same |
| 384 | output as above followed by an annotated version of |
| 385 | <code>concord.c</code>, a section of which looks like: |
| 386 | |
| 387 | <pre> |
| 388 | -------------------------------------------------------------------------------- |
| 389 | -- User-annotated source: concord.c |
| 390 | -------------------------------------------------------------------------------- |
| 391 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 392 | |
| 393 | [snip] |
| 394 | |
| 395 | . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[]) |
| 396 | 3 1 1 . . . 1 0 0 { |
| 397 | . . . . . . . . . FILE *file_ptr; |
| 398 | . . . . . . . . . Word_Info *data; |
| 399 | 1 0 0 . . . 1 1 1 int line = 1, i; |
| 400 | . . . . . . . . . |
| 401 | 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info)); |
| 402 | . . . . . . . . . |
| 403 | 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++) |
| 404 | 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL; |
| 405 | . . . . . . . . . |
| 406 | . . . . . . . . . /* Open file, check it. */ |
| 407 | 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r"); |
| 408 | 2 0 0 1 0 0 . . . if (!(file_ptr)) { |
| 409 | . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name); |
| 410 | 1 1 1 . . . . . . exit(EXIT_FAILURE); |
| 411 | . . . . . . . . . } |
| 412 | . . . . . . . . . |
| 413 | 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF) |
| 414 | 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table); |
| 415 | . . . . . . . . . |
| 416 | 4 0 0 1 0 0 2 0 0 free(data); |
| 417 | 4 0 0 1 0 0 2 0 0 fclose(file_ptr); |
| 418 | 3 0 0 2 0 0 . . . } |
| 419 | </pre> |
| 420 | |
| 421 | (Although column widths are automatically minimised, a wide terminal is clearly |
| 422 | useful.)<p> |
| 423 | |
| 424 | Each source file is clearly marked (<code>User-annotated source</code>) as |
| 425 | having been chosen manually for annotation. If the file was found in one of |
| 426 | the directories specified with the <code>-I</code>/<code>--include</code> |
| 427 | option, the directory and file are both given.<p> |
| 428 | |
| 429 | Each line is annotated with its event counts. Events not applicable for a line |
| 430 | are represented by a `.'; this is useful for distinguishing between an event |
| 431 | which cannot happen, and one which can but did not.<p> |
| 432 | |
| 433 | Sometimes only a small section of a source file is executed. To minimise |
| 434 | uninteresting output, Valgrind only shows annotated lines and lines within a |
| 435 | small distance of annotated lines. Gaps are marked with the line numbers so |
| 436 | you know which part of a file the shown code comes from, eg: |
| 437 | |
| 438 | <pre> |
| 439 | (figures and code for line 704) |
| 440 | -- line 704 ---------------------------------------- |
| 441 | -- line 878 ---------------------------------------- |
| 442 | (figures and code for line 878) |
| 443 | </pre> |
| 444 | |
| 445 | The amount of context to show around annotated lines is controlled by the |
| 446 | <code>--context</code> option.<p> |
| 447 | |
| 448 | To get automatic annotation, run <code>cg_annotate --auto=yes</code>. |
| 449 | cg_annotate will automatically annotate every source file it can find that is |
| 450 | mentioned in the function-by-function summary. Therefore, the files chosen for |
| 451 | auto-annotation are affected by the <code>--sort</code> and |
| 452 | <code>--threshold</code> options. Each source file is clearly marked |
| 453 | (<code>Auto-annotated source</code>) as being chosen automatically. Any files |
| 454 | that could not be found are mentioned at the end of the output, eg: |
| 455 | |
| 456 | <pre> |
| 457 | -------------------------------------------------------------------------------- |
| 458 | The following files chosen for auto-annotation could not be found: |
| 459 | -------------------------------------------------------------------------------- |
| 460 | getc.c |
| 461 | ctype.c |
| 462 | ../sysdeps/generic/lockfile.c |
| 463 | </pre> |
| 464 | |
| 465 | This is quite common for library files, since libraries are usually compiled |
| 466 | with debugging information, but the source files are often not present on a |
| 467 | system. If a file is chosen for annotation <b>both</b> manually and |
| 468 | automatically, it is marked as <code>User-annotated source</code>. |
| 469 | |
| 470 | Use the <code>-I/--include</code> option to tell Valgrind where to look for |
| 471 | source files if the filenames found from the debugging information aren't |
| 472 | specific enough. |
| 473 | |
| 474 | Beware that cg_annotate can take some time to digest large |
| 475 | <code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also |
| 476 | beware that auto-annotation can produce a lot of output if your program is |
| 477 | large! |
| 478 | |
| 479 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 480 | <h3>4.8 Annotating assembler programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 481 | |
| 482 | Valgrind can annotate assembler programs too, or annotate the |
| 483 | assembler generated for your C program. Sometimes this is useful for |
| 484 | understanding what is really happening when an interesting line of C |
| 485 | code is translated into multiple instructions.<p> |
| 486 | |
| 487 | To do this, you just need to assemble your <code>.s</code> files with |
| 488 | assembler-level debug information. gcc doesn't do this, but you can |
| 489 | use the GNU assembler with the <code>--gstabs</code> option to |
| 490 | generate object files with this information, eg: |
| 491 | |
| 492 | <blockquote><code>as --gstabs foo.s</code></blockquote> |
| 493 | |
| 494 | You can then profile and annotate source files in the same way as for C/C++ |
| 495 | programs. |
| 496 | |
| 497 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 498 | <h3>4.9 <code>cg_annotate</code> options</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 499 | <ul> |
| 500 | <li><code>--<i>pid</i></code></li><p> |
| 501 | |
| 502 | Indicates which <code>cachegrind.out.<i>pid</i></code> file to read. |
| 503 | Not actually an option -- it is required. |
| 504 | |
| 505 | <li><code>-h, --help</code></li><p> |
| 506 | <li><code>-v, --version</code><p> |
| 507 | |
| 508 | Help and version, as usual.</li> |
| 509 | |
| 510 | <li><code>--sort=A,B,C</code> [default: order in |
| 511 | <code>cachegrind.out.<i>pid</i></code>]<p> |
| 512 | Specifies the events upon which the sorting of the function-by-function |
| 513 | entries will be based. Useful if you want to concentrate on eg. I cache |
| 514 | misses (<code>--sort=I1mr,I2mr</code>), or D cache misses |
| 515 | (<code>--sort=D1mr,D2mr</code>), or L2 misses |
| 516 | (<code>--sort=D2mr,I2mr</code>).</li><p> |
| 517 | |
| 518 | <li><code>--show=A,B,C</code> [default: all, using order in |
| 519 | <code>cachegrind.out.<i>pid</i></code>]<p> |
| 520 | Specifies which events to show (and the column order). Default is to use |
| 521 | all present in the <code>cachegrind.out.<i>pid</i></code> file (and use |
| 522 | the order in the file).</li><p> |
| 523 | |
| 524 | <li><code>--threshold=X</code> [default: 99%] <p> |
| 525 | Sets the threshold for the function-by-function summary. Functions are |
| 526 | shown that account for more than X% of the primary sort event. If |
| 527 | auto-annotating, also affects which files are annotated. |
| 528 | |
| 529 | Note: thresholds can be set for more than one of the events by appending |
| 530 | any events for the <code>--sort</code> option with a colon and a number |
| 531 | (no spaces, though). E.g. if you want to see the functions that cover |
| 532 | 99% of L2 read misses and 99% of L2 write misses, use this option: |
| 533 | |
| 534 | <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote> |
| 535 | </li><p> |
| 536 | |
| 537 | <li><code>--auto=no</code> [default]<br> |
| 538 | <code>--auto=yes</code> <p> |
| 539 | When enabled, automatically annotates every file that is mentioned in the |
| 540 | function-by-function summary that can be found. Also gives a list of |
| 541 | those that couldn't be found. |
| 542 | |
| 543 | <li><code>--context=N</code> [default: 8]<p> |
| 544 | Print N lines of context before and after each annotated line. Avoids |
| 545 | printing large sections of source files that were not executed. Use a |
| 546 | large number (eg. 10,000) to show all source lines. |
| 547 | </li><p> |
| 548 | |
| 549 | <li><code>-I=<dir>, --include=<dir></code> |
| 550 | [default: empty string]<p> |
| 551 | Adds a directory to the list in which to search for files. Multiple |
| 552 | -I/--include options can be given to add multiple directories. |
| 553 | </ul> |
| 554 | |
| 555 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 556 | <h3>4.10 Warnings</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 557 | There are a couple of situations in which cg_annotate issues warnings. |
| 558 | |
| 559 | <ul> |
| 560 | <li>If a source file is more recent than the |
| 561 | <code>cachegrind.out.<i>pid</i></code> file. This is because the |
| 562 | information in <code>cachegrind.out.<i>pid</i></code> is only recorded |
| 563 | with line numbers, so if the line numbers change at all in the source |
| 564 | (eg. lines added, deleted, swapped), any annotations will be |
| 565 | incorrect.<p> |
| 566 | |
| 567 | <li>If information is recorded about line numbers past the end of a file. |
| 568 | This can be caused by the above problem, ie. shortening the source file |
| 569 | while using an old <code>cachegrind.out.<i>pid</i></code> file. If this |
| 570 | happens, the figures for the bogus lines are printed anyway (clearly |
| 571 | marked as bogus) in case they are important.</li><p> |
| 572 | </ul> |
| 573 | |
| 574 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 575 | <h3>4.11 Things to watch out for</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 576 | Some odd things that can occur during annotation: |
| 577 | |
| 578 | <ul> |
| 579 | <li>If annotating at the assembler level, you might see something like this: |
| 580 | |
| 581 | <pre> |
| 582 | 1 0 0 . . . . . . leal -12(%ebp),%eax |
| 583 | 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) |
| 584 | 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) |
| 585 | . . . . . . . . . .align 4,0x90 |
| 586 | 1 0 0 . . . . . . movl $.LnrB,%eax |
| 587 | 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp) |
| 588 | </pre> |
| 589 | |
| 590 | How can the third instruction be executed twice when the others are |
| 591 | executed only once? As it turns out, it isn't. Here's a dump of the |
| 592 | executable, using <code>objdump -d</code>: |
| 593 | |
| 594 | <pre> |
| 595 | 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax |
| 596 | 8048f28: 89 43 54 mov %eax,0x54(%ebx) |
| 597 | 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) |
| 598 | 8048f32: 89 f6 mov %esi,%esi |
| 599 | 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax |
| 600 | 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp) |
| 601 | </pre> |
| 602 | |
| 603 | Notice the extra <code>mov %esi,%esi</code> instruction. Where did this |
| 604 | come from? The GNU assembler inserted it to serve as the two bytes of |
| 605 | padding needed to align the <code>movl $.LnrB,%eax</code> instruction on |
| 606 | a four-byte boundary, but pretended it didn't exist when adding debug |
| 607 | information. Thus when Valgrind reads the debug info it thinks that the |
| 608 | <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address |
| 609 | range 0x8048f2b--0x804833 by itself, and attributes the counts for the |
| 610 | <code>mov %esi,%esi</code> to it.<p> |
| 611 | </li> |
| 612 | |
| 613 | <li>Inlined functions can cause strange results in the function-by-function |
| 614 | summary. If a function <code>inline_me()</code> is defined in |
| 615 | <code>foo.h</code> and inlined in the functions <code>f1()</code>, |
| 616 | <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will |
| 617 | not be a <code>foo.h:inline_me()</code> function entry. Instead, there |
| 618 | will be separate function entries for each inlining site, ie. |
| 619 | <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and |
| 620 | <code>foo.h:f3()</code>. To find the total counts for |
| 621 | <code>foo.h:inline_me()</code>, add up the counts from each entry.<p> |
| 622 | |
| 623 | The reason for this is that although the debug info output by gcc |
| 624 | indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it |
| 625 | doesn't indicate the name of the function in <code>foo.h</code>, so |
| 626 | Valgrind keeps using the old one.<p> |
| 627 | |
| 628 | <li>Sometimes, the same filename might be represented with a relative name |
| 629 | and with an absolute name in different parts of the debug info, eg: |
| 630 | <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this |
| 631 | case, if you use auto-annotation, the file will be annotated twice with |
| 632 | the counts split between the two.<p> |
| 633 | </li> |
| 634 | |
| 635 | <li>Files with more than 65,535 lines cause difficulties for the stabs debug |
| 636 | info reader. This is because the line number in the <code>struct |
| 637 | nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit |
| 638 | value. Valgrind can handle some files with more than 65,535 lines |
| 639 | correctly by making some guesses to identify line number overflows. But |
| 640 | some cases are beyond it, in which case you'll get a warning message |
| 641 | explaining that annotations for the file might be incorrect.<p> |
| 642 | </li> |
| 643 | |
| 644 | <li>If you compile some files with <code>-g</code> and some without, some |
| 645 | events that take place in a file without debug info could be attributed |
| 646 | to the last line of a file with debug info (whichever one gets placed |
| 647 | before the non-debug-info file in the executable).<p> |
| 648 | </li> |
| 649 | </ul> |
| 650 | |
| 651 | This list looks long, but these cases should be fairly rare.<p> |
| 652 | |
| 653 | Note: stabs is not an easy format to read. If you come across bizarre |
| 654 | annotations that look like might be caused by a bug in the stabs reader, |
| 655 | please let us know.<p> |
| 656 | |
| 657 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 658 | <h3>4.12 Accuracy</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 659 | Valgrind's cache profiling has a number of shortcomings: |
| 660 | |
| 661 | <ul> |
| 662 | <li>It doesn't account for kernel activity -- the effect of system calls on |
| 663 | the cache contents is ignored.</li><p> |
| 664 | |
| 665 | <li>It doesn't account for other process activity (although this is probably |
| 666 | desirable when considering a single program).</li><p> |
| 667 | |
| 668 | <li>It doesn't account for virtual-to-physical address mappings; hence the |
| 669 | entire simulation is not a true representation of what's happening in the |
| 670 | cache.</li><p> |
| 671 | |
| 672 | <li>It doesn't account for cache misses not visible at the instruction level, |
| 673 | eg. those arising from TLB misses, or speculative execution.</li><p> |
| 674 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 675 | <li>Valgrind's custom threads implementation will schedule threads |
njn | 3e88418 | 2003-04-15 13:03:23 +0000 | [diff] [blame] | 676 | differently to the standard one. This could warp the results for |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 677 | threaded programs. |
| 678 | </li><p> |
| 679 | |
| 680 | <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code> |
| 681 | will incorrectly be counted as doing a data read if both the arguments |
| 682 | are registers, eg: |
| 683 | |
| 684 | <blockquote><code>btsl %eax, %edx</code></blockquote> |
| 685 | |
| 686 | This should only happen rarely. |
| 687 | </li><p> |
| 688 | |
| 689 | <li>FPU instructions with data sizes of 28 and 108 bytes (e.g. |
| 690 | <code>fsave</code>) are treated as though they only access 16 bytes. |
| 691 | These instructions seem to be rare so hopefully this won't affect |
| 692 | accuracy much. |
| 693 | </li><p> |
| 694 | </ul> |
| 695 | |
| 696 | Another thing worth nothing is that results are very sensitive. Changing the |
| 697 | size of the <code>valgrind.so</code> file, the size of the program being |
| 698 | profiled, or even the length of its name can perturb the results. Variations |
| 699 | will be small, but don't expect perfectly repeatable results if your program |
| 700 | changes at all.<p> |
| 701 | |
| 702 | While these factors mean you shouldn't trust the results to be super-accurate, |
| 703 | hopefully they should be close enough to be useful.<p> |
| 704 | |
| 705 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 706 | <h3>4.13 Todo</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 707 | <ul> |
| 708 | <li>Program start-up/shut-down calls a lot of functions that aren't |
| 709 | interesting and just complicate the output. Would be nice to exclude |
| 710 | these somehow.</li> |
| 711 | <p> |
| 712 | </ul> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 713 | </body> |
| 714 | </html> |
| 715 | |