sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 1 | <html> |
| 2 | <head> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 3 | <title>Cachegrind: a cache-miss profiler</title> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 4 | </head> |
| 5 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 6 | <body> |
| 7 | <a name="cg-top"></a> |
| 8 | <h2>4 <b>Cachegrind</b>: a cache-miss profiler</h2> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 9 | |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 10 | To use this tool, you must specify <code>--tool=cachegrind</code> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 11 | on the Valgrind command line. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 12 | |
| 13 | <p> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 14 | Detailed technical documentation on how Cachegrind works is available |
| 15 | <A HREF="cg_techdocs.html">here</A>. If you want to know how |
| 16 | to <b>use</b> it, you only need to read this page. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 17 | |
| 18 | |
| 19 | <a name="cache"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 20 | <h3>4.1 Cache profiling</h3> |
| 21 | Cachegrind is a tool for doing cache simulations and annotating your source |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 22 | line-by-line with the number of cache misses. In particular, it records: |
| 23 | <ul> |
| 24 | <li>L1 instruction cache reads and misses; |
| 25 | <li>L1 data cache reads and read misses, writes and write misses; |
| 26 | <li>L2 unified cache reads and read misses, writes and writes misses. |
| 27 | </ul> |
| 28 | On a modern x86 machine, an L1 miss will typically cost around 10 cycles, |
| 29 | and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be |
| 30 | very useful for improving the performance of your program.<p> |
| 31 | |
| 32 | Also, since one instruction cache read is performed per instruction executed, |
| 33 | you can find out how many instructions are executed per line, which can be |
| 34 | useful for traditional profiling and test coverage.<p> |
| 35 | |
| 36 | Any feedback, bug-fixes, suggestions, etc, welcome. |
| 37 | |
| 38 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 39 | <h3>4.2 Overview</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 40 | First off, as for normal Valgrind use, you probably want to compile with |
| 41 | debugging info (the <code>-g</code> flag). But by contrast with normal |
| 42 | Valgrind use, you probably <b>do</b> want to turn optimisation on, since you |
| 43 | should profile your program as it will be normally run. |
| 44 | |
| 45 | The two steps are: |
| 46 | <ol> |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 47 | <li>Run your program with <code>valgrind --tool=cachegrind</code> in front of |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 48 | the normal command line invocation. When the program finishes, |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 49 | Cachegrind will print summary cache statistics. It also collects |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 50 | line-by-line information in a file |
| 51 | <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code> |
| 52 | is the program's process id. |
| 53 | <p> |
| 54 | This step should be done every time you want to collect |
| 55 | information about a new program, a changed program, or about the |
| 56 | same program with different input. |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 57 | </li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 58 | <li>Generate a function-by-function summary, and possibly annotate |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 59 | source files, using the supplied |
| 60 | <code>cg_annotate</code> program. Source files to annotate can be |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 61 | specified manually, or manually on the command line, or |
| 62 | "interesting" source files can be annotated automatically with |
| 63 | the <code>--auto=yes</code> option. You can annotate C/C++ |
| 64 | files or assembly language files equally easily. |
| 65 | <p> |
| 66 | This step can be performed as many times as you like for each |
| 67 | Step 2. You may want to do multiple annotations showing |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 68 | different information each time. |
| 69 | </li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 70 | </ol> |
| 71 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 72 | The steps are described in detail in the following sections. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 73 | |
| 74 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 75 | <h3>4.3 Cache simulation specifics</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 76 | |
| 77 | Cachegrind uses a simulation for a machine with a split L1 cache and a unified |
| 78 | L2 cache. This configuration is used for all (modern) x86-based machines we |
| 79 | are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are |
| 80 | ancient history now.<p> |
| 81 | |
| 82 | The more specific characteristics of the simulation are as follows. |
| 83 | |
| 84 | <ul> |
| 85 | <li>Write-allocate: when a write miss occurs, the block written to |
| 86 | is brought into the D1 cache. Most modern caches have this |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 87 | property.<p> |
| 88 | </li> |
| 89 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 90 | <li>Bit-selection hash function: the line(s) in the cache to which a |
| 91 | memory block maps is chosen by the middle bits M--(M+N-1) of the |
| 92 | byte address, where: |
| 93 | <ul> |
| 94 | <li> line size = 2^M bytes </li> |
| 95 | <li>(cache size / line size) = 2^N bytes</li> |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 96 | </ul> |
| 97 | </li> |
| 98 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 99 | <li>Inclusive L2 cache: the L2 cache replicates all the entries of |
| 100 | the L1 cache. This is standard on Pentium chips, but AMD |
| 101 | Athlons use an exclusive L2 cache that only holds blocks evicted |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 102 | from L1. Ditto AMD Durons and most modern VIAs.</li> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 103 | </ul> |
| 104 | |
| 105 | The cache configuration simulated (cache size, associativity and line size) is |
| 106 | determined automagically using the CPUID instruction. If you have an old |
| 107 | machine that (a) doesn't support the CPUID instruction, or (b) supports it in |
| 108 | an early incarnation that doesn't give any cache information, then Cachegrind |
| 109 | will fall back to using a default configuration (that of a model 3/4 Athlon). |
| 110 | Cachegrind will tell you if this happens. You can manually specify one, two or |
| 111 | all three levels (I1/D1/L2) of the cache from the command line using the |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 112 | <code>--I1</code>, <code>--D1</code> and <code>--L2</code> options. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 113 | |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 114 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 115 | Other noteworthy behaviour: |
| 116 | |
| 117 | <ul> |
| 118 | <li>References that straddle two cache lines are treated as follows: |
| 119 | <ul> |
| 120 | <li>If both blocks hit --> counted as one hit</li> |
| 121 | <li>If one block hits, the other misses --> counted as one miss</li> |
| 122 | <li>If both blocks miss --> counted as one miss (not two)</li> |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 123 | </ul> |
| 124 | </li> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 125 | |
| 126 | <li>Instructions that modify a memory location (eg. <code>inc</code> and |
| 127 | <code>dec</code>) are counted as doing just a read, ie. a single data |
| 128 | reference. This may seem strange, but since the write can never cause a |
| 129 | miss (the read guarantees the block is in the cache) it's not very |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 130 | interesting. |
| 131 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 132 | Thus it measures not the number of times the data cache is accessed, but |
| 133 | the number of times a data cache miss could occur.<p> |
| 134 | </li> |
| 135 | </ul> |
| 136 | |
| 137 | If you are interested in simulating a cache with different properties, it is |
| 138 | not particularly hard to write your own cache simulator, or to modify the |
| 139 | existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>, |
| 140 | <code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be |
| 141 | interested to hear from anyone who does. |
| 142 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 143 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 144 | <a name="profile"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 145 | <h3>4.4 Profiling programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 146 | |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 147 | To gather cache profiling information about the program <code>ls -l</code>, |
| 148 | invoke Cachegrind like this: |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 149 | |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 150 | <blockquote><code>valgrind --tool=cachegrind ls -l</code></blockquote> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 151 | |
| 152 | The program will execute (slowly). Upon completion, summary statistics |
| 153 | that look like this will be printed: |
| 154 | |
| 155 | <pre> |
| 156 | ==31751== I refs: 27,742,716 |
| 157 | ==31751== I1 misses: 276 |
| 158 | ==31751== L2 misses: 275 |
| 159 | ==31751== I1 miss rate: 0.0% |
| 160 | ==31751== L2i miss rate: 0.0% |
| 161 | ==31751== |
| 162 | ==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) |
| 163 | ==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) |
| 164 | ==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr) |
| 165 | ==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) |
| 166 | ==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%) |
| 167 | ==31751== |
| 168 | ==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr) |
| 169 | ==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%) |
| 170 | </pre> |
| 171 | |
| 172 | Cache accesses for instruction fetches are summarised first, giving the |
| 173 | number of fetches made (this is the number of instructions executed, which |
| 174 | can be useful to know in its own right), the number of I1 misses, and the |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 175 | number of L2 instruction (<code>L2i</code>) misses. |
| 176 | <p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 177 | Cache accesses for data follow. The information is similar to that of the |
| 178 | instruction fetches, except that the values are also shown split between reads |
| 179 | and writes (note each row's <code>rd</code> and <code>wr</code> values add up |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 180 | to the row's total). |
| 181 | <p> |
| 182 | Combined instruction and data figures for the L2 cache follow that. |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 183 | |
| 184 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 185 | <h3>4.5 Output file</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 186 | |
| 187 | As well as printing summary information, Cachegrind also writes |
| 188 | line-by-line cache profiling information to a file named |
| 189 | <code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is |
| 190 | best interpreted by the accompanying program <code>cg_annotate</code>, |
| 191 | described in the next section. |
| 192 | <p> |
| 193 | Things to note about the <code>cachegrind.out.<i>pid</i></code> file: |
| 194 | <ul> |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 195 | <li>It is written every time Cachegrind |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 196 | is run, and will overwrite any existing |
| 197 | <code>cachegrind.out.<i>pid</i></code> in the current directory (but |
| 198 | that won't happen very often because it takes some time for process ids |
njn | 3e87f7e | 2003-04-08 11:08:45 +0000 | [diff] [blame] | 199 | to be recycled).</li><p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 200 | <li>It can be huge: <code>ls -l</code> generates a file of about |
| 201 | 350KB. Browsing a few files and web pages with a Konqueror |
| 202 | built with full debugging information generates a file |
| 203 | of around 15 MB.</li> |
| 204 | </ul> |
| 205 | |
| 206 | Note that older versions of Cachegrind used a log file named |
| 207 | <code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix). |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 208 | The suffix serves two purposes. Firstly, it means you don't have to |
| 209 | rename old log files that you don't want to overwrite. Secondly, and |
| 210 | more importantly, it allows correct profiling with the |
| 211 | <code>--trace-children=yes</code> option of programs that spawn child |
| 212 | processes. |
| 213 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 214 | |
| 215 | <a name="profileflags"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 216 | <h3>4.6 Cachegrind options</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 217 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 218 | Cache-simulation specific options are: |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 219 | |
| 220 | <ul> |
| 221 | <li><code>--I1=<size>,<associativity>,<line_size></code><br> |
| 222 | <code>--D1=<size>,<associativity>,<line_size></code><br> |
| 223 | <code>--L2=<size>,<associativity>,<line_size></code><p> |
| 224 | [default: uses CPUID for automagic cache configuration]<p> |
| 225 | |
| 226 | Manually specifies the I1/D1/L2 cache configuration, where |
| 227 | <code>size</code> and <code>line_size</code> are measured in bytes. The |
| 228 | three items must be comma-separated, but with no spaces, eg: |
| 229 | |
| 230 | <blockquote> |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 231 | <code>valgrind --tool=cachegrind --I1=65535,2,64</code> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 232 | </blockquote> |
| 233 | |
| 234 | You can specify one, two or three of the I1/D1/L2 caches. Any level not |
| 235 | manually specified will be simulated using the configuration found in the |
| 236 | normal way (via the CPUID instruction, or failing that, via defaults). |
| 237 | </ul> |
| 238 | |
| 239 | |
| 240 | <a name="annotate"></a> |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 241 | <h3>4.7 Annotating C/C++ programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 242 | |
| 243 | Before using <code>cg_annotate</code>, it is worth widening your |
| 244 | window to be at least 120-characters wide if possible, as the output |
| 245 | lines can be quite long. |
| 246 | <p> |
| 247 | To get a function-by-function summary, run <code>cg_annotate |
| 248 | --<i>pid</i></code> in a directory containing a |
| 249 | <code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code> |
| 250 | is required so that <code>cg_annotate</code> knows which log file to use when |
| 251 | several are present. |
| 252 | <p> |
| 253 | The output looks like this: |
| 254 | |
| 255 | <pre> |
| 256 | -------------------------------------------------------------------------------- |
| 257 | I1 cache: 65536 B, 64 B, 2-way associative |
| 258 | D1 cache: 65536 B, 64 B, 2-way associative |
| 259 | L2 cache: 262144 B, 64 B, 8-way associative |
| 260 | Command: concord vg_to_ucode.c |
| 261 | Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 262 | Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 263 | Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 264 | Threshold: 99% |
| 265 | Chosen for annotation: |
| 266 | Auto-annotation: on |
| 267 | |
| 268 | -------------------------------------------------------------------------------- |
| 269 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 270 | -------------------------------------------------------------------------------- |
| 271 | 27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS |
| 272 | |
| 273 | -------------------------------------------------------------------------------- |
| 274 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function |
| 275 | -------------------------------------------------------------------------------- |
| 276 | 8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc |
| 277 | 5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word |
| 278 | 2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp |
| 279 | 2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash |
| 280 | 2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower |
| 281 | 1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert |
| 282 | 897,991 51 51 897,831 95 30 62 1 1 ???:??? |
| 283 | 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile |
| 284 | 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile |
| 285 | 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc |
| 286 | 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing |
| 287 | 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER |
| 288 | 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table |
| 289 | 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create |
| 290 | 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 |
| 291 | 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 |
| 292 | 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node |
| 293 | 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue |
| 294 | </pre> |
| 295 | |
| 296 | First up is a summary of the annotation options: |
| 297 | |
| 298 | <ul> |
| 299 | <li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the |
| 300 | configuration with which these results were obtained.</li><p> |
| 301 | |
| 302 | <li>Command: the command line invocation of the program under |
| 303 | examination.</li><p> |
| 304 | |
| 305 | <li>Events recorded: event abbreviations are:<p> |
| 306 | <ul> |
| 307 | <li><code>Ir </code>: I cache reads (ie. instructions executed)</li> |
| 308 | <li><code>I1mr</code>: I1 cache read misses</li> |
| 309 | <li><code>I2mr</code>: L2 cache instruction read misses</li> |
| 310 | <li><code>Dr </code>: D cache reads (ie. memory reads)</li> |
| 311 | <li><code>D1mr</code>: D1 cache read misses</li> |
| 312 | <li><code>D2mr</code>: L2 cache data read misses</li> |
| 313 | <li><code>Dw </code>: D cache writes (ie. memory writes)</li> |
| 314 | <li><code>D1mw</code>: D1 cache write misses</li> |
| 315 | <li><code>D2mw</code>: L2 cache data write misses</li> |
| 316 | </ul><p> |
| 317 | Note that D1 total accesses is given by <code>D1mr</code> + |
| 318 | <code>D1mw</code>, and that L2 total accesses is given by |
| 319 | <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p> |
| 320 | |
| 321 | <li>Events shown: the events shown (a subset of events gathered). This can |
| 322 | be adjusted with the <code>--show</code> option.</li><p> |
| 323 | |
| 324 | <li>Event sort order: the sort order in which functions are shown. For |
| 325 | example, in this case the functions are sorted from highest |
| 326 | <code>Ir</code> counts to lowest. If two functions have identical |
| 327 | <code>Ir</code> counts, they will then be sorted by <code>I1mr</code> |
| 328 | counts, and so on. This order can be adjusted with the |
| 329 | <code>--sort</code> option.<p> |
| 330 | |
| 331 | Note that this dictates the order the functions appear. It is <b>not</b> |
| 332 | the order in which the columns appear; that is dictated by the "events |
| 333 | shown" line (and can be changed with the <code>--show</code> option). |
| 334 | </li><p> |
| 335 | |
| 336 | <li>Threshold: <code>cg_annotate</code> by default omits functions |
| 337 | that cause very low numbers of misses to avoid drowning you in |
| 338 | information. In this case, cg_annotate shows summaries the |
| 339 | functions that account for 99% of the <code>Ir</code> counts; |
| 340 | <code>Ir</code> is chosen as the threshold event since it is the |
| 341 | primary sort event. The threshold can be adjusted with the |
| 342 | <code>--threshold</code> option.</li><p> |
| 343 | |
| 344 | <li>Chosen for annotation: names of files specified manually for annotation; |
| 345 | in this case none.</li><p> |
| 346 | |
| 347 | <li>Auto-annotation: whether auto-annotation was requested via the |
| 348 | <code>--auto=yes</code> option. In this case no.</li><p> |
| 349 | </ul> |
| 350 | |
| 351 | Then follows summary statistics for the whole program. These are similar |
nethercote | 137bc55 | 2003-11-14 17:47:54 +0000 | [diff] [blame] | 352 | to the summary provided when running <code>valgrind --tool=cachegrind</code>.<p> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 353 | |
| 354 | Then follows function-by-function statistics. Each function is |
| 355 | identified by a <code>file_name:function_name</code> pair. If a column |
| 356 | contains only a dot it means the function never performs |
| 357 | that event (eg. the third row shows that <code>strcmp()</code> |
| 358 | contains no instructions that write to memory). The name |
| 359 | <code>???</code> is used if the the file name and/or function name |
| 360 | could not be determined from debugging information. If most of the |
| 361 | entries have the form <code>???:???</code> the program probably wasn't |
| 362 | compiled with <code>-g</code>. If any code was invalidated (either due to |
| 363 | self-modifying code or unloading of shared objects) its counts are aggregated |
| 364 | into a single cost centre written as <code>(discarded):(discarded)</code>.<p> |
| 365 | |
| 366 | It is worth noting that functions will come from three types of source files: |
| 367 | <ol> |
| 368 | <li> From the profiled program (<code>concord.c</code> in this example).</li> |
| 369 | <li>From libraries (eg. <code>getc.c</code>)</li> |
| 370 | <li>From Valgrind's implementation of some libc functions (eg. |
| 371 | <code>vg_clientmalloc.c:malloc</code>). These are recognisable because |
| 372 | the filename begins with <code>vg_</code>, and is probably one of |
| 373 | <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or |
| 374 | <code>vg_mylibc.c</code>. |
| 375 | </li> |
| 376 | </ol> |
| 377 | |
| 378 | There are two ways to annotate source files -- by choosing them |
| 379 | manually, or with the <code>--auto=yes</code> option. To do it |
| 380 | manually, just specify the filenames as arguments to |
| 381 | <code>cg_annotate</code>. For example, the output from running |
| 382 | <code>cg_annotate concord.c</code> for our example produces the same |
| 383 | output as above followed by an annotated version of |
| 384 | <code>concord.c</code>, a section of which looks like: |
| 385 | |
| 386 | <pre> |
| 387 | -------------------------------------------------------------------------------- |
| 388 | -- User-annotated source: concord.c |
| 389 | -------------------------------------------------------------------------------- |
| 390 | Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw |
| 391 | |
| 392 | [snip] |
| 393 | |
| 394 | . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[]) |
| 395 | 3 1 1 . . . 1 0 0 { |
| 396 | . . . . . . . . . FILE *file_ptr; |
| 397 | . . . . . . . . . Word_Info *data; |
| 398 | 1 0 0 . . . 1 1 1 int line = 1, i; |
| 399 | . . . . . . . . . |
| 400 | 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info)); |
| 401 | . . . . . . . . . |
| 402 | 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++) |
| 403 | 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL; |
| 404 | . . . . . . . . . |
| 405 | . . . . . . . . . /* Open file, check it. */ |
| 406 | 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r"); |
| 407 | 2 0 0 1 0 0 . . . if (!(file_ptr)) { |
| 408 | . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name); |
| 409 | 1 1 1 . . . . . . exit(EXIT_FAILURE); |
| 410 | . . . . . . . . . } |
| 411 | . . . . . . . . . |
| 412 | 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF) |
| 413 | 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table); |
| 414 | . . . . . . . . . |
| 415 | 4 0 0 1 0 0 2 0 0 free(data); |
| 416 | 4 0 0 1 0 0 2 0 0 fclose(file_ptr); |
| 417 | 3 0 0 2 0 0 . . . } |
| 418 | </pre> |
| 419 | |
| 420 | (Although column widths are automatically minimised, a wide terminal is clearly |
| 421 | useful.)<p> |
| 422 | |
| 423 | Each source file is clearly marked (<code>User-annotated source</code>) as |
| 424 | having been chosen manually for annotation. If the file was found in one of |
| 425 | the directories specified with the <code>-I</code>/<code>--include</code> |
| 426 | option, the directory and file are both given.<p> |
| 427 | |
| 428 | Each line is annotated with its event counts. Events not applicable for a line |
| 429 | are represented by a `.'; this is useful for distinguishing between an event |
| 430 | which cannot happen, and one which can but did not.<p> |
| 431 | |
| 432 | Sometimes only a small section of a source file is executed. To minimise |
| 433 | uninteresting output, Valgrind only shows annotated lines and lines within a |
| 434 | small distance of annotated lines. Gaps are marked with the line numbers so |
| 435 | you know which part of a file the shown code comes from, eg: |
| 436 | |
| 437 | <pre> |
| 438 | (figures and code for line 704) |
| 439 | -- line 704 ---------------------------------------- |
| 440 | -- line 878 ---------------------------------------- |
| 441 | (figures and code for line 878) |
| 442 | </pre> |
| 443 | |
| 444 | The amount of context to show around annotated lines is controlled by the |
| 445 | <code>--context</code> option.<p> |
| 446 | |
| 447 | To get automatic annotation, run <code>cg_annotate --auto=yes</code>. |
| 448 | cg_annotate will automatically annotate every source file it can find that is |
| 449 | mentioned in the function-by-function summary. Therefore, the files chosen for |
| 450 | auto-annotation are affected by the <code>--sort</code> and |
| 451 | <code>--threshold</code> options. Each source file is clearly marked |
| 452 | (<code>Auto-annotated source</code>) as being chosen automatically. Any files |
| 453 | that could not be found are mentioned at the end of the output, eg: |
| 454 | |
| 455 | <pre> |
| 456 | -------------------------------------------------------------------------------- |
| 457 | The following files chosen for auto-annotation could not be found: |
| 458 | -------------------------------------------------------------------------------- |
| 459 | getc.c |
| 460 | ctype.c |
| 461 | ../sysdeps/generic/lockfile.c |
| 462 | </pre> |
| 463 | |
| 464 | This is quite common for library files, since libraries are usually compiled |
| 465 | with debugging information, but the source files are often not present on a |
| 466 | system. If a file is chosen for annotation <b>both</b> manually and |
| 467 | automatically, it is marked as <code>User-annotated source</code>. |
| 468 | |
| 469 | Use the <code>-I/--include</code> option to tell Valgrind where to look for |
| 470 | source files if the filenames found from the debugging information aren't |
| 471 | specific enough. |
| 472 | |
| 473 | Beware that cg_annotate can take some time to digest large |
| 474 | <code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also |
| 475 | beware that auto-annotation can produce a lot of output if your program is |
| 476 | large! |
| 477 | |
| 478 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 479 | <h3>4.8 Annotating assembler programs</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 480 | |
| 481 | Valgrind can annotate assembler programs too, or annotate the |
| 482 | assembler generated for your C program. Sometimes this is useful for |
| 483 | understanding what is really happening when an interesting line of C |
| 484 | code is translated into multiple instructions.<p> |
| 485 | |
| 486 | To do this, you just need to assemble your <code>.s</code> files with |
| 487 | assembler-level debug information. gcc doesn't do this, but you can |
| 488 | use the GNU assembler with the <code>--gstabs</code> option to |
| 489 | generate object files with this information, eg: |
| 490 | |
| 491 | <blockquote><code>as --gstabs foo.s</code></blockquote> |
| 492 | |
| 493 | You can then profile and annotate source files in the same way as for C/C++ |
| 494 | programs. |
| 495 | |
| 496 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 497 | <h3>4.9 <code>cg_annotate</code> options</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 498 | <ul> |
| 499 | <li><code>--<i>pid</i></code></li><p> |
| 500 | |
| 501 | Indicates which <code>cachegrind.out.<i>pid</i></code> file to read. |
| 502 | Not actually an option -- it is required. |
| 503 | |
| 504 | <li><code>-h, --help</code></li><p> |
| 505 | <li><code>-v, --version</code><p> |
| 506 | |
| 507 | Help and version, as usual.</li> |
| 508 | |
| 509 | <li><code>--sort=A,B,C</code> [default: order in |
| 510 | <code>cachegrind.out.<i>pid</i></code>]<p> |
| 511 | Specifies the events upon which the sorting of the function-by-function |
| 512 | entries will be based. Useful if you want to concentrate on eg. I cache |
| 513 | misses (<code>--sort=I1mr,I2mr</code>), or D cache misses |
| 514 | (<code>--sort=D1mr,D2mr</code>), or L2 misses |
| 515 | (<code>--sort=D2mr,I2mr</code>).</li><p> |
| 516 | |
| 517 | <li><code>--show=A,B,C</code> [default: all, using order in |
| 518 | <code>cachegrind.out.<i>pid</i></code>]<p> |
| 519 | Specifies which events to show (and the column order). Default is to use |
| 520 | all present in the <code>cachegrind.out.<i>pid</i></code> file (and use |
| 521 | the order in the file).</li><p> |
| 522 | |
| 523 | <li><code>--threshold=X</code> [default: 99%] <p> |
| 524 | Sets the threshold for the function-by-function summary. Functions are |
| 525 | shown that account for more than X% of the primary sort event. If |
| 526 | auto-annotating, also affects which files are annotated. |
| 527 | |
| 528 | Note: thresholds can be set for more than one of the events by appending |
| 529 | any events for the <code>--sort</code> option with a colon and a number |
| 530 | (no spaces, though). E.g. if you want to see the functions that cover |
| 531 | 99% of L2 read misses and 99% of L2 write misses, use this option: |
| 532 | |
| 533 | <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote> |
| 534 | </li><p> |
| 535 | |
| 536 | <li><code>--auto=no</code> [default]<br> |
| 537 | <code>--auto=yes</code> <p> |
| 538 | When enabled, automatically annotates every file that is mentioned in the |
| 539 | function-by-function summary that can be found. Also gives a list of |
| 540 | those that couldn't be found. |
| 541 | |
| 542 | <li><code>--context=N</code> [default: 8]<p> |
| 543 | Print N lines of context before and after each annotated line. Avoids |
| 544 | printing large sections of source files that were not executed. Use a |
| 545 | large number (eg. 10,000) to show all source lines. |
| 546 | </li><p> |
| 547 | |
| 548 | <li><code>-I=<dir>, --include=<dir></code> |
| 549 | [default: empty string]<p> |
| 550 | Adds a directory to the list in which to search for files. Multiple |
| 551 | -I/--include options can be given to add multiple directories. |
| 552 | </ul> |
| 553 | |
| 554 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 555 | <h3>4.10 Warnings</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 556 | There are a couple of situations in which cg_annotate issues warnings. |
| 557 | |
| 558 | <ul> |
| 559 | <li>If a source file is more recent than the |
| 560 | <code>cachegrind.out.<i>pid</i></code> file. This is because the |
| 561 | information in <code>cachegrind.out.<i>pid</i></code> is only recorded |
| 562 | with line numbers, so if the line numbers change at all in the source |
| 563 | (eg. lines added, deleted, swapped), any annotations will be |
| 564 | incorrect.<p> |
| 565 | |
| 566 | <li>If information is recorded about line numbers past the end of a file. |
| 567 | This can be caused by the above problem, ie. shortening the source file |
| 568 | while using an old <code>cachegrind.out.<i>pid</i></code> file. If this |
| 569 | happens, the figures for the bogus lines are printed anyway (clearly |
| 570 | marked as bogus) in case they are important.</li><p> |
| 571 | </ul> |
| 572 | |
| 573 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 574 | <h3>4.11 Things to watch out for</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 575 | Some odd things that can occur during annotation: |
| 576 | |
| 577 | <ul> |
| 578 | <li>If annotating at the assembler level, you might see something like this: |
| 579 | |
| 580 | <pre> |
| 581 | 1 0 0 . . . . . . leal -12(%ebp),%eax |
| 582 | 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) |
| 583 | 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) |
| 584 | . . . . . . . . . .align 4,0x90 |
| 585 | 1 0 0 . . . . . . movl $.LnrB,%eax |
| 586 | 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp) |
| 587 | </pre> |
| 588 | |
| 589 | How can the third instruction be executed twice when the others are |
| 590 | executed only once? As it turns out, it isn't. Here's a dump of the |
| 591 | executable, using <code>objdump -d</code>: |
| 592 | |
| 593 | <pre> |
| 594 | 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax |
| 595 | 8048f28: 89 43 54 mov %eax,0x54(%ebx) |
| 596 | 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) |
| 597 | 8048f32: 89 f6 mov %esi,%esi |
| 598 | 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax |
| 599 | 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp) |
| 600 | </pre> |
| 601 | |
| 602 | Notice the extra <code>mov %esi,%esi</code> instruction. Where did this |
| 603 | come from? The GNU assembler inserted it to serve as the two bytes of |
| 604 | padding needed to align the <code>movl $.LnrB,%eax</code> instruction on |
| 605 | a four-byte boundary, but pretended it didn't exist when adding debug |
| 606 | information. Thus when Valgrind reads the debug info it thinks that the |
| 607 | <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address |
| 608 | range 0x8048f2b--0x804833 by itself, and attributes the counts for the |
| 609 | <code>mov %esi,%esi</code> to it.<p> |
| 610 | </li> |
| 611 | |
| 612 | <li>Inlined functions can cause strange results in the function-by-function |
| 613 | summary. If a function <code>inline_me()</code> is defined in |
| 614 | <code>foo.h</code> and inlined in the functions <code>f1()</code>, |
| 615 | <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will |
| 616 | not be a <code>foo.h:inline_me()</code> function entry. Instead, there |
| 617 | will be separate function entries for each inlining site, ie. |
| 618 | <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and |
| 619 | <code>foo.h:f3()</code>. To find the total counts for |
| 620 | <code>foo.h:inline_me()</code>, add up the counts from each entry.<p> |
| 621 | |
| 622 | The reason for this is that although the debug info output by gcc |
| 623 | indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it |
| 624 | doesn't indicate the name of the function in <code>foo.h</code>, so |
| 625 | Valgrind keeps using the old one.<p> |
| 626 | |
| 627 | <li>Sometimes, the same filename might be represented with a relative name |
| 628 | and with an absolute name in different parts of the debug info, eg: |
| 629 | <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this |
| 630 | case, if you use auto-annotation, the file will be annotated twice with |
| 631 | the counts split between the two.<p> |
| 632 | </li> |
| 633 | |
| 634 | <li>Files with more than 65,535 lines cause difficulties for the stabs debug |
| 635 | info reader. This is because the line number in the <code>struct |
| 636 | nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit |
| 637 | value. Valgrind can handle some files with more than 65,535 lines |
| 638 | correctly by making some guesses to identify line number overflows. But |
| 639 | some cases are beyond it, in which case you'll get a warning message |
| 640 | explaining that annotations for the file might be incorrect.<p> |
| 641 | </li> |
| 642 | |
| 643 | <li>If you compile some files with <code>-g</code> and some without, some |
| 644 | events that take place in a file without debug info could be attributed |
| 645 | to the last line of a file with debug info (whichever one gets placed |
| 646 | before the non-debug-info file in the executable).<p> |
| 647 | </li> |
| 648 | </ul> |
| 649 | |
| 650 | This list looks long, but these cases should be fairly rare.<p> |
| 651 | |
| 652 | Note: stabs is not an easy format to read. If you come across bizarre |
| 653 | annotations that look like might be caused by a bug in the stabs reader, |
| 654 | please let us know.<p> |
| 655 | |
| 656 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 657 | <h3>4.12 Accuracy</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 658 | Valgrind's cache profiling has a number of shortcomings: |
| 659 | |
| 660 | <ul> |
| 661 | <li>It doesn't account for kernel activity -- the effect of system calls on |
| 662 | the cache contents is ignored.</li><p> |
| 663 | |
| 664 | <li>It doesn't account for other process activity (although this is probably |
| 665 | desirable when considering a single program).</li><p> |
| 666 | |
| 667 | <li>It doesn't account for virtual-to-physical address mappings; hence the |
| 668 | entire simulation is not a true representation of what's happening in the |
| 669 | cache.</li><p> |
| 670 | |
| 671 | <li>It doesn't account for cache misses not visible at the instruction level, |
| 672 | eg. those arising from TLB misses, or speculative execution.</li><p> |
| 673 | |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 674 | <li>Valgrind's custom threads implementation will schedule threads |
njn | 3e88418 | 2003-04-15 13:03:23 +0000 | [diff] [blame] | 675 | differently to the standard one. This could warp the results for |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 676 | threaded programs. |
| 677 | </li><p> |
| 678 | |
| 679 | <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code> |
| 680 | will incorrectly be counted as doing a data read if both the arguments |
| 681 | are registers, eg: |
| 682 | |
| 683 | <blockquote><code>btsl %eax, %edx</code></blockquote> |
| 684 | |
| 685 | This should only happen rarely. |
| 686 | </li><p> |
| 687 | |
| 688 | <li>FPU instructions with data sizes of 28 and 108 bytes (e.g. |
| 689 | <code>fsave</code>) are treated as though they only access 16 bytes. |
| 690 | These instructions seem to be rare so hopefully this won't affect |
| 691 | accuracy much. |
| 692 | </li><p> |
| 693 | </ul> |
| 694 | |
| 695 | Another thing worth nothing is that results are very sensitive. Changing the |
| 696 | size of the <code>valgrind.so</code> file, the size of the program being |
| 697 | profiled, or even the length of its name can perturb the results. Variations |
| 698 | will be small, but don't expect perfectly repeatable results if your program |
| 699 | changes at all.<p> |
| 700 | |
| 701 | While these factors mean you shouldn't trust the results to be super-accurate, |
| 702 | hopefully they should be close enough to be useful.<p> |
| 703 | |
| 704 | |
sewardj | f555ac7 | 2002-11-18 00:07:28 +0000 | [diff] [blame] | 705 | <h3>4.13 Todo</h3> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 706 | <ul> |
| 707 | <li>Program start-up/shut-down calls a lot of functions that aren't |
| 708 | interesting and just complicate the output. Would be nice to exclude |
| 709 | these somehow.</li> |
| 710 | <p> |
| 711 | </ul> |
sewardj | a9a2dcf | 2002-11-11 00:20:07 +0000 | [diff] [blame] | 712 | </body> |
| 713 | </html> |
| 714 | |