| <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
| "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" |
| [ <!ENTITY % cl-entities SYSTEM "cl-entities.xml"> %cl-entities; ]> |
| |
| <chapter id="cl-manual" xreflabel="Callgrind Manual"> |
| <title>Callgrind: a heavyweight profiler</title> |
| |
| |
| <sect1 id="cl-manual.use" xreflabel="Overview"> |
| <title>Overview</title> |
| |
| <para>Callgrind is profiling tool that can |
| construct a call graph for a program's run. |
| By default, the collected data consists of |
| the number of instructions executed, their relationship |
| to source lines, the caller/callee relationship between functions, |
| and the numbers of such calls. |
| Optionally, a cache simulator (similar to cachegrind) can produce |
| further information about the memory access behavior of the application. |
| </para> |
| |
| <para>The profile data is written out to a file at program |
| termination. For presentation of the data, and interactive control |
| of the profiling, two command line tools are provided:</para> |
| <variablelist> |
| <varlistentry> |
| <term><command>callgrind_annotate</command></term> |
| <listitem> |
| <para>This command reads in the profile data, and prints a |
| sorted lists of functions, optionally with source annotation.</para> |
| <!-- |
| <para>You can read the manpage here: <xref |
| linkend="callgrind-annotate"/>.</para> |
| --> |
| <para>For graphical visualization of the data, try |
| <ulink url="&cl-gui;">KCachegrind</ulink>, which is a KDE/Qt based |
| GUI that makes it easy to navigate the large amount of data that |
| Callgrind produces.</para> |
| |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term><command>callgrind_control</command></term> |
| <listitem> |
| <para>This command enables you to interactively observe and control |
| the status of currently running applications, without stopping |
| the application. You can |
| get statistics information as well as the current stack trace, and |
| you can request zeroing of counters or dumping of profile data.</para> |
| <!-- |
| <para>You can read the manpage here: <xref linkend="callgrind-control"/>.</para> |
| --> |
| </listitem> |
| </varlistentry> |
| </variablelist> |
| |
| <para>To use Callgrind, you must specify |
| <computeroutput>--tool=callgrind</computeroutput> on the Valgrind |
| command line.</para> |
| |
| <sect2 id="cl-manual.functionality" xreflabel="Functionality"> |
| <title>Functionality</title> |
| |
| <para>Cachegrind collects flat profile data: event counts (data reads, |
| cache misses, etc.) are attributed directly to the function they |
| occurred in. This simple cost attribution mechanism is sometimes |
| called <emphasis>self</emphasis> or <emphasis>exclusive</emphasis> |
| attribution.</para> |
| |
| <para>Callgrind extends this functionality by propagating costs |
| across function call boundaries. If function <code>foo</code> calls |
| <code>bar</code>, the costs from <code>bar</code> are added into |
| <code>foo</code>'s costs. When applied to the program as a whole, |
| this builds up a picture of so called <emphasis>inclusive</emphasis> |
| costs, that is, where the cost of each function includes the costs of |
| all functions it called, directly or indirectly.</para> |
| |
| <para>As an example, the inclusive cost of |
| <computeroutput>main</computeroutput> should be almost 100 percent |
| of the total program cost. Because of costs arising before |
| <computeroutput>main</computeroutput> is run, such as |
| initialization of the run time linker and construction of global C++ |
| objects, the inclusive cost of <computeroutput>main</computeroutput> |
| is not exactly 100 percent of the total program cost.</para> |
| |
| <para>Together with the call graph, this allows you to find the |
| specific call chains starting from |
| <computeroutput>main</computeroutput> in which the majority of the |
| program's costs occur. Caller/callee cost attribution is also useful |
| for profiling functions called from multiple call sites, and where |
| optimization opportunities depend on changing code in the callers, in |
| particular by reducing the call count.</para> |
| |
| <para>Callgrind's cache simulation is based on the |
| <ulink url="&cg-tool-url;">Cachegrind tool</ulink>. Read |
| <ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first. |
| The material below describes the features supported in addition to |
| Cachegrind's features.</para> |
| |
| <para>Callgrind's ability to detect function calls and returns depends |
| on the instruction set of the platform it is run on. It works best |
| on x86 and amd64, and unfortunately currently does not work so well |
| on PowerPC code. This is because there are no explicit call or return |
| instructions in the PowerPC instruction set, so Callgrind has to rely |
| on heuristics to detect calls and returns.</para> |
| |
| </sect2> |
| |
| <sect2 id="cl-manual.basics" xreflabel="Basic Usage"> |
| <title>Basic Usage</title> |
| |
| <para>As with Cachegrind, you probably want to compile with debugging info |
| (the -g flag), but with optimization turned on.</para> |
| |
| <para>To start a profile run for a program, execute: |
| <screen>callgrind [callgrind options] your-program [program options]</screen> |
| </para> |
| |
| <para>While the simulation is running, you can observe execution with |
| <screen>callgrind_control -b</screen> |
| This will print out the current backtrace. To annotate the backtrace with |
| event counts, run |
| <screen>callgrind_control -e -b</screen> |
| </para> |
| |
| <para>After program termination, a profile data file named |
| <computeroutput>callgrind.out.<pid></computeroutput> |
| is generated, where <emphasis>pid</emphasis> is the process ID |
| of the program being profiled. |
| The data file contains information about the calls made in the |
| program among the functions executed, together with events of type |
| <command>Instruction Read Accesses</command> (Ir).</para> |
| |
| <para>To generate a function-by-function summary from the profile |
| data file, use |
| <screen>callgrind_annotate [options] callgrind.out.<pid></screen> |
| This summary is similar to the output you get from a Cachegrind |
| run with <computeroutput>cg_annotate</computeroutput>: the list |
| of functions is ordered by exclusive cost of functions, which also |
| are the ones that are shown. |
| Important for the additional features of Callgrind are |
| the following two options:</para> |
| |
| <itemizedlist> |
| <listitem> |
| <para><option>--inclusive=yes</option>: Instead of using |
| exclusive cost of functions as sorting order, use and show |
| inclusive cost.</para> |
| </listitem> |
| |
| <listitem> |
| <para><option>--tree=both</option>: Interleave into the |
| top level list of functions, information on the callers and the callees |
| of each function. In these lines, which represents executed |
| calls, the cost gives the number of events spent in the call. |
| Indented, above each function, there is the list of callers, |
| and below, the list of callees. The sum of events in calls to |
| a given function (caller lines), as well as the sum of events in |
| calls from the function (callee lines) together with the self |
| cost, gives the total inclusive cost of the function.</para> |
| </listitem> |
| </itemizedlist> |
| |
| <para>Use <option>--auto=yes</option> to get annotated source code |
| for all relevant functions for which the source can be found. In |
| addition to source annotation as produced by |
| <computeroutput>cg_annotate</computeroutput>, you will see the |
| annotated call sites with call counts. For all other options, |
| consult the (Cachegrind) documentation for |
| <computeroutput>cg_annotate</computeroutput>. |
| </para> |
| |
| <para>For better call graph browsing experience, it is highly recommended |
| to use <ulink url="&cl-gui;">KCachegrind</ulink>. |
| If your code |
| has a significant fraction of its cost in <emphasis>cycles</emphasis> (sets |
| of functions calling each other in a recursive manner), you have to |
| use KCachegrind, as <computeroutput>callgrind_annotate</computeroutput> |
| currently does not do any cycle detection, which is important to get correct |
| results in this case.</para> |
| |
| <para>If you are additionally interested in measuring the |
| cache behavior of your |
| program, use Callgrind with the option |
| <option><xref linkend="opt.simulate-cache"/>=yes.</option> |
| However, expect a further slow down approximately by a factor of 2.</para> |
| |
| <para>If the program section you want to profile is somewhere in the |
| middle of the run, it is beneficial to |
| <emphasis>fast forward</emphasis> to this section without any |
| profiling, and then switch on profiling. This is achieved by using |
| the command line option |
| <option><xref linkend="opt.instr-atstart"/>=no</option> |
| and running, in a shell, |
| <computeroutput>callgrind_control -i on</computeroutput> just before the |
| interesting code section is executed. To exactly specify |
| the code position where profiling should start, use the client request |
| <computeroutput>CALLGRIND_START_INSTRUMENTATION</computeroutput>.</para> |
| |
| <para>If you want to be able to see assembly code level annotation, specify |
| <option><xref linkend="opt.dump-instr"/>=yes</option>. This will produce |
| profile data at instruction granularity. Note that the resulting profile |
| data |
| can only be viewed with KCachegrind. For assembly annotation, it also is |
| interesting to see more details of the control flow inside of functions, |
| ie. (conditional) jumps. This will be collected by further specifying |
| <option><xref linkend="opt.collect-jumps"/>=yes</option>.</para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="cl-manual.usage" xreflabel="Advanced Usage"> |
| <title>Advanced Usage</title> |
| |
| <sect2 id="cl-manual.dumps" |
| xreflabel="Multiple dumps from one program run"> |
| <title>Multiple profiling dumps from one program run</title> |
| |
| <para>Sometimes you are not interested in characteristics of a full |
| program run, but only of a small part of it, for example execution of one |
| algorithm. If there are multiple algorithms, or one algorithm |
| running with different input data, it may even be useful to get different |
| profile information for different parts of a single program run.</para> |
| |
| <para>Profile data files have names of the form |
| <screen> |
| callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threadID</emphasis> |
| </screen> |
| </para> |
| <para>where <emphasis>pid</emphasis> is the PID of the running |
| program, <emphasis>part</emphasis> is a number incremented on each |
| dump (".part" is skipped for the dump at program termination), and |
| <emphasis>threadID</emphasis> is a thread identification |
| ("-threadID" is only used if you request dumps of individual |
| threads with <option><xref linkend="opt.separate-threads"/>=yes</option>).</para> |
| |
| <para>There are different ways to generate multiple profile dumps |
| while a program is running under Callgrind's supervision. Nevertheless, |
| all methods trigger the same action, which is "dump all profile |
| information since the last dump or program start, and zero cost |
| counters afterwards". To allow for zeroing cost counters without |
| dumping, there is a second action "zero all cost counters now". |
| The different methods are:</para> |
| <itemizedlist> |
| |
| <listitem> |
| <para><command>Dump on program termination.</command> |
| This method is the standard way and doesn't need any special |
| action on your part.</para> |
| </listitem> |
| |
| <listitem> |
| <para><command>Spontaneous, interactive dumping.</command> Use |
| <screen>callgrind_control -d [hint [PID/Name]]</screen> to |
| request the dumping of profile information of the supervised |
| application with PID or Name. <emphasis>hint</emphasis> is an |
| arbitrary string you can optionally specify to later be able to |
| distinguish profile dumps. The control program will not terminate |
| before the dump is completely written. Note that the application |
| must be actively running for detection of the dump command. So, |
| for a GUI application, resize the window, or for a server, send a |
| request.</para> |
| <para>If you are using <ulink url="&cl-gui;">KCachegrind</ulink> |
| for browsing of profile information, you can use the toolbar |
| button <command>Force dump</command>. This will request a dump |
| and trigger a reload after the dump is written.</para> |
| </listitem> |
| |
| <listitem> |
| <para><command>Periodic dumping after execution of a specified |
| number of basic blocks</command>. For this, use the command line |
| option <option><xref linkend="opt.dump-every-bb"/>=count</option>. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para><command>Dumping at enter/leave of all functions whose name |
| starts with</command> <emphasis>funcprefix</emphasis>. Use the |
| option <option><xref linkend="opt.dump-before"/>=funcprefix</option> |
| and <option><xref linkend="opt.dump-after"/>=funcprefix</option>. |
| To zero cost counters before entering a function, use |
| <option><xref linkend="opt.zero-before"/>=funcprefix</option>. |
| The prefix method for specifying function names was chosen to |
| ease the use with C++: you don't have to specify full |
| signatures.</para> <para>You can specify these options multiple |
| times for different function prefixes.</para> |
| </listitem> |
| |
| <listitem> |
| <para><command>Program controlled dumping.</command> |
| Put <screen><![CDATA[#include <valgrind/callgrind.h>]]></screen> |
| into your source and add |
| <computeroutput>CALLGRIND_DUMP_STATS;</computeroutput> when you |
| want a dump to happen. Use |
| <computeroutput>CALLGRIND_ZERO_STATS;</computeroutput> to only |
| zero cost centers.</para> |
| <para>In Valgrind terminology, this method is called "Client |
| requests". The given macros generate a special instruction |
| pattern with no effect at all (i.e. a NOP). When run under |
| Valgrind, the CPU simulation engine detects the special |
| instruction pattern and triggers special actions like the ones |
| described above.</para> |
| </listitem> |
| </itemizedlist> |
| |
| <para>If you are running a multi-threaded application and specify the |
| command line option <option><xref linkend="opt.separate-threads"/>=yes</option>, |
| every thread will be profiled on its own and will create its own |
| profile dump. Thus, the last two methods will only generate one dump |
| of the currently running thread. With the other methods, you will get |
| multiple dumps (one for each thread) on a dump request.</para> |
| |
| </sect2> |
| |
| |
| |
| <sect2 id="cl-manual.limits" |
| xreflabel="Limiting range of event collection"> |
| <title>Limiting the range of collected events</title> |
| |
| <para>For aggregating events (function enter/leave, |
| instruction execution, memory access) into event numbers, |
| first, the events must be recognizable by Callgrind, and second, |
| the collection state must be switched on.</para> |
| |
| <para>Event collection is only possible if <emphasis>instrumentation</emphasis> |
| for program code is switched on. This is the default, but for faster |
| execution (identical to <computeroutput>valgrind --tool=none</computeroutput>), |
| it can be switched off until the program reaches a state in which |
| you want to start collecting profiling data. |
| Callgrind can start without instrumentation |
| by specifying option <option><xref linkend="opt.instr-atstart"/>=no</option>. |
| Instrumentation can be switched on interactively |
| with <screen>callgrind_control -i on</screen> |
| and off by specifying "off" instead of "on". |
| Furthermore, instrumentation state can be programatically changed with |
| the macros <computeroutput>CALLGRIND_START_INSTRUMENTATION;</computeroutput> |
| and <computeroutput>CALLGRIND_STOP_INSTRUMENTATION;</computeroutput>. |
| </para> |
| |
| <para>In addition to enabling instrumentation, you must also enable |
| event collection for the parts of your program you are interested in. |
| By default, event collection is enabled everywhere. |
| You can limit collection to specific function(s) |
| by using |
| <option><xref linkend="opt.toggle-collect"/>=funcprefix</option>. |
| This will toggle the collection state on entering and leaving |
| the specified functions. |
| When this option is in effect, the default collection state |
| at program start is "off". Only events happening while running |
| inside of functions starting with <emphasis>funcprefix</emphasis> will |
| be collected. Recursive |
| calls of functions with <emphasis>funcprefix</emphasis> do not trigger |
| any action.</para> |
| |
| <para>It is important to note that with instrumentation switched off, the |
| cache simulator cannot see any memory access events, and thus, any |
| simulated cache state will be frozen and wrong without instrumentation. |
| Therefore, to get useful cache events (hits/misses) after switching on |
| instrumentation, the cache first must warm up, |
| probably leading to many <emphasis>cold misses</emphasis> |
| which would not have happened in reality. If you do not want to see these, |
| start event collection a few million instructions after you have switched |
| on instrumentation.</para> |
| |
| |
| </sect2> |
| |
| |
| |
| <sect2 id="cl-manual.cycles" xreflabel="Avoiding cycles"> |
| <title>Avoiding cycles</title> |
| |
| <para>Informally speaking, a cycle is a group of functions which |
| call each other in a recursive way.</para> |
| |
| <para>Formally speaking, a cycle is a nonempty set S of functions, |
| such that for every pair of functions F and G in S, it is possible |
| to call from F to G (possibly via intermediate functions) and also |
| from G to F. Furthermore, S must be maximal -- that is, be the |
| largest set of functions satisfying this property. For example, if |
| a third function H is called from inside S and calls back into S, |
| then H is also part of the cycle and should be included in S.</para> |
| |
| <para>Recursion is quite usual in programs, and therefore, cycles |
| sometimes appear in the call graph output of Callgrind. However, |
| the title of this chapter should raise two questions: What is bad |
| about cycles which makes you want to avoid them? And: How can |
| cycles be avoided without changing program code?</para> |
| |
| <para>Cycles are not bad in itself, but tend to make performance |
| analysis of your code harder. This is because inclusive costs |
| for calls inside of a cycle are meaningless. The definition of |
| inclusive cost, ie. self cost of a function plus inclusive cost |
| of its callees, needs a topological order among functions. For |
| cycles, this does not hold true: callees of a function in a cycle include |
| the function itself. Therefore, KCachegrind does cycle detection |
| and skips visualization of any inclusive cost for calls inside |
| of cycles. Further, all functions in a cycle are collapsed into artifical |
| functions called like <computeroutput>Cycle 1</computeroutput>.</para> |
| |
| <para>Now, when a program exposes really big cycles (as is |
| true for some GUI code, or in general code using event or callback based |
| programming style), you loose the nice property to let you pinpoint |
| the bottlenecks by following call chains from |
| <computeroutput>main()</computeroutput>, guided via |
| inclusive cost. In addition, KCachegrind looses its ability to show |
| interesting parts of the call graph, as it uses inclusive costs to |
| cut off uninteresting areas.</para> |
| |
| <para>Despite the meaningless of inclusive costs in cycles, the big |
| drawback for visualization motivates the possibility to temporarily |
| switch off cycle detection in KCachegrind, which can lead to |
| misguiding visualization. However, often cycles appear because of |
| unlucky superposition of independent call chains in a way that |
| the profile result will see a cycle. Neglecting uninteresting |
| calls with very small measured inclusive cost would break these |
| cycles. In such cases, incorrect handling of cycles by not detecting |
| them still gives meaningful profiling visualization.</para> |
| |
| <para>It has to be noted that currently, <command>callgrind_annotate</command> |
| does not do any cycle detection at all. For program executions with function |
| recursion, it e.g. can print nonsense inclusive costs way above 100%.</para> |
| |
| <para>After describing why cycles are bad for profiling, it is worth |
| talking about cycle avoidance. The key insight here is that symbols in |
| the profile data do not have to exactly match the symbols found in the |
| program. Instead, the symbol name could encode additional information |
| from the current execution context such as recursion level of the |
| current function, or even some part of the call chain leading to the |
| function. While encoding of additional information into symbols is |
| quite capable of avoiding cycles, it has to be used carefully to not cause |
| symbol explosion. The latter imposes large memory requirement for Callgrind |
| with possible out-of-memory conditions, and big profile data files.</para> |
| |
| <para>A further possibility to avoid cycles in Callgrind's profile data |
| output is to simply leave out given functions in the call graph. Of course, this |
| also skips any call information from and to an ignored function, and thus can |
| break a cycle. Candidates for this typically are dispatcher functions in event |
| driven code. The option to ignore calls to a function is |
| <option><xref linkend="opt.fn-skip"/>=funcprefix</option>. Aside from |
| possibly breaking cycles, this is used in Callgrind to skip |
| trampoline functions in the PLT sections |
| for calls to functions in shared libraries. You can see the difference |
| if you profile with <option><xref linkend="opt.skip-plt"/>=no</option>. |
| If a call is ignored, its cost events will be propagated to the |
| enclosing function.</para> |
| |
| <para>If you have a recursive function, you can distinguish the first |
| 10 recursion levels by specifying |
| <option><xref linkend="opt.fn-recursion-num"/>=funcprefix</option>. |
| Or for all functions with |
| <option><xref linkend="opt.fn-recursion"/>=10</option>, but this will |
| give you much bigger profile data files. In the profile data, you will see |
| the recursion levels of "func" as the different functions with names |
| "func", "func'2", "func'3" and so on.</para> |
| |
| <para>If you have call chains "A > B > C" and "A > C > B" |
| in your program, you usually get a "false" cycle "B <> C". Use |
| <option><xref linkend="opt.fn-caller-num"/>=B</option> |
| <option><xref linkend="opt.fn-caller-num"/>=C</option>, |
| and functions "B" and "C" will be treated as different functions |
| depending on the direct caller. Using the apostrophe for appending |
| this "context" to the function name, you get "A > B'A > C'B" |
| and "A > C'A > B'C", and there will be no cycle. Use |
| <option><xref linkend="opt.fn-caller"/>=3</option> to get a 2-caller |
| dependency for all functions. Note that doing this will increase |
| the size of profile data files.</para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| |
| <sect1 id="cl-manual.options" xreflabel="Command line option reference"> |
| <title>Command line option reference</title> |
| |
| <para> |
| In the following, options are grouped into classes, in same order as |
| the output as <computeroutput>callgrind --help</computeroutput>. |
| </para> |
| |
| <sect2 id="cl-manual.options.misc" |
| xreflabel="Miscellaneous options"> |
| <title>Miscellaneous options</title> |
| |
| <variablelist id="cl.opts.list.misc"> |
| |
| <varlistentry> |
| <term><option>--help</option></term> |
| <listitem> |
| <para>Show summary of options. This is a short version of this |
| manual section.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term><option>--version</option></term> |
| <listitem> |
| <para>Show version of callgrind.</para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </sect2> |
| |
| <sect2 id="cl-manual.options.creation" |
| xreflabel="Dump creation options"> |
| <title>Dump creation options</title> |
| |
| <para> |
| These options influence the name and format of the profile data files. |
| </para> |
| |
| <variablelist id="cl.opts.list.creation"> |
| |
| <varlistentry id="opt.callgrind-out-file" xreflabel="--callgrind-out-file"> |
| <term> |
| <option><![CDATA[--callgrind-out-file=<file> ]]></option> |
| </term> |
| <listitem> |
| <para>Write the profile data to |
| <computeroutput>file</computeroutput> rather than to the default |
| output file, |
| <computeroutput>callgrind.out.<pid></computeroutput>. The |
| <option>%p</option> and <option>%q</option> format specifiers |
| can be used to embed the process ID and/or the contents of an |
| environment variable in the name, as is the case for the core |
| option <option>--log-file</option>. See <link |
| linkend="manual-core.basicopts">here</link> for details. |
| When multiple dumps are made, the file name |
| is modified further; see below.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.dump-instr" xreflabel="--dump-instr"> |
| <term> |
| <option><![CDATA[--dump-instr=<no|yes> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>This specifies that event counting should be performed at |
| per-instruction granularity. |
| This allows for assembly code |
| annotation. Currently the results can only be |
| displayed by KCachegrind.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.dump-line" xreflabel="--dump-line"> |
| <term> |
| <option><![CDATA[--dump-line=<no|yes> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>This specifies that event counting should be performed at |
| source line granularity. This allows source |
| annotation for sources which are compiled with debug information ("-g").</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.compress-strings" xreflabel="--compress-strings"> |
| <term> |
| <option><![CDATA[--compress-strings=<no|yes> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>This option influences the output format of the profile data. |
| It specifies whether strings (file and function names) should be |
| identified by numbers. This shrinks the file, |
| but makes it more difficult |
| for humans to read (which is not recommended in any case).</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.compress-pos" xreflabel="--compress-pos"> |
| <term> |
| <option><![CDATA[--compress-pos=<no|yes> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>This option influences the output format of the profile data. |
| It specifies whether numerical positions are always specified as absolute |
| values or are allowed to be relative to previous numbers. |
| This shrinks the file size,</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.combine-dumps" xreflabel="--combine-dumps"> |
| <term> |
| <option><![CDATA[--combine-dumps=<no|yes> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>When multiple profile data parts are to be generated, these |
| parts are appended to the same output file if this option is set to |
| "yes". Not recommended.</para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </sect2> |
| |
| <sect2 id="cl-manual.options.activity" |
| xreflabel="Activity options"> |
| <title>Activity options</title> |
| |
| <para> |
| These options specify when actions relating to event counts are to |
| be executed. For interactive control use |
| <computeroutput>callgrind_control</computeroutput>. |
| </para> |
| |
| <variablelist id="cl.opts.list.activity"> |
| |
| <varlistentry id="opt.dump-every-bb" xreflabel="--dump-every-bb"> |
| <term> |
| <option><![CDATA[--dump-every-bb=<count> [default: 0, never] ]]></option> |
| </term> |
| <listitem> |
| <para>Dump profile data every <count> basic blocks. |
| Whether a dump is needed is only checked when Valgrind's internal |
| scheduler is run. Therefore, the minimum setting useful is about 100000. |
| The count is a 64-bit value to make long dump periods possible. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.dump-before" xreflabel="--dump-before"> |
| <term> |
| <option><![CDATA[--dump-before=<prefix> ]]></option> |
| </term> |
| <listitem> |
| <para>Dump when entering a function starting with <prefix></para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.zero-before" xreflabel="--zero-before"> |
| <term> |
| <option><![CDATA[--zero-before=<prefix> ]]></option> |
| </term> |
| <listitem> |
| <para>Zero all costs when entering a function starting with <prefix></para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.dump-after" xreflabel="--dump-after"> |
| <term> |
| <option><![CDATA[--dump-after=<prefix> ]]></option> |
| </term> |
| <listitem> |
| <para>Dump when leaving a function starting with <prefix></para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </sect2> |
| |
| <sect2 id="cl-manual.options.collection" |
| xreflabel="Data collection options"> |
| <title>Data collection options</title> |
| |
| <para> |
| These options specify when events are to be aggregated into event counts. |
| Also see <xref linkend="cl-manual.limits"/>.</para> |
| |
| <variablelist id="cl.opts.list.collection"> |
| |
| <varlistentry id="opt.instr-atstart" xreflabel="--instr-atstart"> |
| <term> |
| <option><![CDATA[--instr-atstart=<yes|no> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>Specify if you want Callgrind to start simulation and |
| profiling from the beginning of the program. |
| When set to <computeroutput>no</computeroutput>, |
| Callgrind will not be able |
| to collect any information, including calls, but it will have at |
| most a slowdown of around 4, which is the minimum Valgrind |
| overhead. Instrumentation can be interactively switched on via |
| <computeroutput>callgrind_control -i on</computeroutput>.</para> |
| <para>Note that the resulting call graph will most probably not |
| contain <computeroutput>main</computeroutput>, but will contain all the |
| functions executed after instrumentation was switched on. |
| Instrumentation can also programatically switched on/off. See the |
| Callgrind include file |
| <computeroutput><callgrind.h></computeroutput> for the macro |
| you have to use in your source code.</para> <para>For cache |
| simulation, results will be less accurate when switching on |
| instrumentation later in the program run, as the simulator starts |
| with an empty cache at that moment. Switch on event collection |
| later to cope with this error.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.collect-atstart"> |
| <term> |
| <option><![CDATA[--collect-atstart=<yes|no> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>Specify whether event collection is switched on at beginning |
| of the profile run.</para> |
| <para>To only look at parts of your program, you have two |
| possibilities:</para> |
| <orderedlist> |
| <listitem> |
| <para>Zero event counters before entering the program part you |
| want to profile, and dump the event counters to a file after |
| leaving that program part.</para> |
| </listitem> |
| <listitem> |
| <para>Switch on/off collection state as needed to only see |
| event counters happening while inside of the program part you |
| want to profile.</para> |
| </listitem> |
| </orderedlist> |
| <para>The second option can be used if the program part you want to |
| profile is called many times. Option 1, i.e. creating a lot of |
| dumps is not practical here.</para> |
| <para>Collection state can be |
| toggled at entry and exit of a given function with the |
| option <xref linkend="opt.toggle-collect"/>. If you use this flag, |
| collection |
| state should be switched off at the beginning. Note that the |
| specification of <computeroutput>--toggle-collect</computeroutput> |
| implicitly sets |
| <computeroutput>--collect-state=no</computeroutput>.</para> |
| <para>Collection state can be toggled also by using a Valgrind |
| Client Request in your application. For this, include |
| <computeroutput>valgrind/callgrind.h</computeroutput> and specify |
| the macro |
| <computeroutput>CALLGRIND_TOGGLE_COLLECT</computeroutput> at the |
| needed positions. This only will have any effect if run under |
| supervision of the Callgrind tool.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.toggle-collect" xreflabel="--toggle-collect"> |
| <term> |
| <option><![CDATA[--toggle-collect=<prefix> ]]></option> |
| </term> |
| <listitem> |
| <para>Toggle collection on entry/exit of a function whose name |
| starts with |
| <prefix>.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.collect-jumps" xreflabel="--collect-jumps"> |
| <term> |
| <option><![CDATA[--collect-jumps=<no|yes> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>This specifies whether information for (conditional) jumps |
| should be collected. As above, callgrind_annotate currently is not |
| able to show you the data. You have to use KCachegrind to get jump |
| arrows in the annotated code.</para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </sect2> |
| |
| <sect2 id="cl-manual.options.separation" |
| xreflabel="Cost entity separation options"> |
| <title>Cost entity separation options</title> |
| |
| <para> |
| These options specify how event counts should be attributed to execution |
| contexts. |
| For example, they specify whether the recursion level or the |
| call chain leading to a function should be taken into account, |
| and whether the thread ID should be considered. |
| Also see <xref linkend="cl-manual.cycles"/>.</para> |
| |
| <variablelist id="cmd-options.separation"> |
| |
| <varlistentry id="opt.separate-threads" xreflabel="--separate-threads"> |
| <term> |
| <option><![CDATA[--separate-threads=<no|yes> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>This option specifies whether profile data should be generated |
| separately for every thread. If yes, the file names get "-threadID" |
| appended.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-recursion" xreflabel="--fn-recursion"> |
| <term> |
| <option><![CDATA[--fn-recursion=<level> [default: 2] ]]></option> |
| </term> |
| <listitem> |
| <para>Separate function recursions by at most <level> levels. |
| See <xref linkend="cl-manual.cycles"/>.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-caller" xreflabel="--fn-caller"> |
| <term> |
| <option><![CDATA[--fn-caller=<callers> [default: 0] ]]></option> |
| </term> |
| <listitem> |
| <para>Separate contexts by at most <callers> functions in the |
| call chain. See <xref linkend="cl-manual.cycles"/>.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.skip-plt" xreflabel="--skip-plt"> |
| <term> |
| <option><![CDATA[--skip-plt=<no|yes> [default: yes] ]]></option> |
| </term> |
| <listitem> |
| <para>Ignore calls to/from PLT sections.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-skip" xreflabel="--fn-skip"> |
| <term> |
| <option><![CDATA[--fn-skip=<function> ]]></option> |
| </term> |
| <listitem> |
| <para>Ignore calls to/from a given function. E.g. if you have a |
| call chain A > B > C, and you specify function B to be |
| ignored, you will only see A > C.</para> |
| <para>This is very convenient to skip functions handling callback |
| behaviour. For example, with the signal/slot mechanism in the |
| Qt graphics library, you only want |
| to see the function emitting a signal to call the slots connected |
| to that signal. First, determine the real call chain to see the |
| functions needed to be skipped, then use this option.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-group"> |
| <term> |
| <option><![CDATA[--fn-group<number>=<function> ]]></option> |
| </term> |
| <listitem> |
| <para>Put a function into a separate group. This influences the |
| context name for cycle avoidance. All functions inside such a |
| group are treated as being the same for context name building, which |
| resembles the call chain leading to a context. By specifying function |
| groups with this option, you can shorten the context name, as functions |
| in the same group will not appear in sequence in the name. </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-recursion-num" xreflabel="--fn-recursion10"> |
| <term> |
| <option><![CDATA[--fn-recursion<number>=<function> ]]></option> |
| </term> |
| <listitem> |
| <para>Separate <number> recursions for <function>. |
| See <xref linkend="cl-manual.cycles"/>.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.fn-caller-num" xreflabel="--fn-caller2"> |
| <term> |
| <option><![CDATA[--fn-caller<number>=<function> ]]></option> |
| </term> |
| <listitem> |
| <para>Separate <number> callers for <function>. |
| See <xref linkend="cl-manual.cycles"/>.</para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </sect2> |
| |
| <sect2 id="cl-manual.options.simulation" |
| xreflabel="Cache simulation options"> |
| <title>Cache simulation options</title> |
| |
| <variablelist id="cl.opts.list.simulation"> |
| |
| <varlistentry id="opt.simulate-cache" xreflabel="--simulate-cache"> |
| <term> |
| <option><![CDATA[--simulate-cache=<yes|no> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>Specify if you want to do full cache simulation. By default, |
| only instruction read accesses will be profiled.</para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry id="opt.simulate-hwpref" xreflabel="--simulate-hwpref"> |
| <term> |
| <option><![CDATA[--simulate-hwpref=<yes|no> [default: no] ]]></option> |
| </term> |
| <listitem> |
| <para>Specify whether simulation of a hardware prefetcher should be |
| added which is able to detect stream access in the second level cache |
| by comparing accesses to separate to each page. |
| As the simulation can not decide about any timing issues of prefetching, |
| it is assumed that any hardware prefetch triggered succeeds before a |
| real access is done. Thus, this gives a best-case scenario by covering |
| all possible stream accesses.</para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |