callgrind/docs/cl-manual.xml - platform/external/valgrind - Gitiles

 <?xml version="1.0"?> <!-- -*- sgml -*- -->
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
   "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
 [ <!ENTITY % cl-entities SYSTEM "cl-entities.xml"> %cl-entities; ]>

 <chapter id="cl-manual" xreflabel="Callgrind Manual">
 <title>Callgrind: a call-graph generating cache profiler</title>


 <para>To use this tool, you must specify
 <computeroutput>--tool=callgrind</computeroutput> on the
 Valgrind command line.</para>

 <sect1 id="cl-manual.use" xreflabel="Overview">
 <title>Overview</title>

 <para>Callgrind is a profiling tool that can
 construct a call graph for a program's run.
 By default, the collected data consists of
 the number of instructions executed, their relationship
 to source lines, the caller/callee relationship between functions,
 and the numbers of such calls.
 Optionally, a cache simulator (similar to cachegrind) can produce
 further information about the memory access behavior of the application.
 </para>

 <para>The profile data is written out to a file at program
 termination. For presentation of the data, and interactive control
 of the profiling, two command line tools are provided:</para>
 <variablelist>
   <varlistentry>
   <term><command>callgrind_annotate</command></term>
   <listitem>
     <para>This command reads in the profile data, and prints a
     sorted lists of functions, optionally with source annotation.</para>
 <!--
     <para>You can read the manpage here: <xref
 	      linkend="callgrind-annotate"/>.</para>
 -->
     <para>For graphical visualization of the data, try
     <ulink url="&cl-gui;">KCachegrind</ulink>, which is a KDE/Qt based
     GUI that makes it easy to navigate the large amount of data that
     Callgrind produces.</para>

   </listitem>
   </varlistentry>

   <varlistentry>
   <term><command>callgrind_control</command></term>
   <listitem>
     <para>This command enables you to interactively observe and control
     the status of currently running applications, without stopping
     the application.  You can
     get statistics information as well as the current stack trace, and
     you can request zeroing of counters or dumping of profile data.</para>
 <!--
     <para>You can read the manpage here: <xref linkend="callgrind-control"/>.</para>
 -->
   </listitem>
   </varlistentry>
 </variablelist>

 <para>To use Callgrind, you must specify
 <computeroutput>--tool=callgrind</computeroutput> on the Valgrind
 command line.</para>

   <sect2 id="cl-manual.functionality" xreflabel="Functionality">
   <title>Functionality</title>

 <para>Cachegrind collects flat profile data: event counts (data reads,
 cache misses, etc.) are attributed directly to the function they
 occurred in.  This cost attribution mechanism is
 called <emphasis>self</emphasis> or <emphasis>exclusive</emphasis>
 attribution.</para>

 <para>Callgrind extends this functionality by propagating costs
 across function call boundaries.  If function <code>foo</code> calls
 <code>bar</code>, the costs from <code>bar</code> are added into
 <code>foo</code>'s costs.  When applied to the program as a whole,
 this builds up a picture of so called <emphasis>inclusive</emphasis>
 costs, that is, where the cost of each function includes the costs of
 all functions it called, directly or indirectly.</para>

 <para>As an example, the inclusive cost of
 <computeroutput>main</computeroutput> should be almost 100 percent
 of the total program cost.  Because of costs arising before
 <computeroutput>main</computeroutput> is run, such as
 initialization of the run time linker and construction of global C++
 objects, the inclusive cost of <computeroutput>main</computeroutput>
 is not exactly 100 percent of the total program cost.</para>

 <para>Together with the call graph, this allows you to find the
 specific call chains starting from
 <computeroutput>main</computeroutput> in which the majority of the
 program's costs occur.  Caller/callee cost attribution is also useful
 for profiling functions called from multiple call sites, and where
 optimization opportunities depend on changing code in the callers, in
 particular by reducing the call count.</para>

 <para>Callgrind's cache simulation is based on the
 <ulink url="&cg-tool-url;">Cachegrind tool</ulink>. Read
 <ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first.
 The material below describes the features supported in addition to
 Cachegrind's features.</para>

 <para>Callgrind's ability to detect function calls and returns depends
 on the instruction set of the platform it is run on.  It works best
 on x86 and amd64, and unfortunately currently does not work so well
 on PowerPC code.  This is because there are no explicit call or return
 instructions in the PowerPC instruction set, so Callgrind has to rely
 on heuristics to detect calls and returns.</para>

   </sect2>

   <sect2 id="cl-manual.basics" xreflabel="Basic Usage">
   <title>Basic Usage</title>

   <para>As with Cachegrind, you probably want to compile with debugging info
   (the -g flag), but with optimization turned on.</para>

   <para>To start a profile run for a program, execute:
   <screen>valgrind --tool=callgrind [callgrind options] your-program [program options]</screen>
   </para>

   <para>While the simulation is running, you can observe execution with
   <screen>callgrind_control -b</screen>
   This will print out the current backtrace. To annotate the backtrace with
   event counts, run
   <screen>callgrind_control -e -b</screen>
   </para>

   <para>After program termination, a profile data file named
   <computeroutput>callgrind.out.&lt;pid&gt;</computeroutput>
   is generated, where <emphasis>pid</emphasis> is the process ID
   of the program being profiled.
   The data file contains information about the calls made in the
   program among the functions executed, together with events of type
   <command>Instruction Read Accesses</command> (Ir).</para>

   <para>To generate a function-by-function summary from the profile
   data file, use
   <screen>callgrind_annotate [options] callgrind.out.&lt;pid&gt;</screen>
   This summary is similar to the output you get from a Cachegrind
   run with <computeroutput>cg_annotate</computeroutput>: the list
   of functions is ordered by exclusive cost of functions, which also
   are the ones that are shown.
   Important for the additional features of Callgrind are
   the following two options:</para>

   <itemizedlist>
     <listitem>
       <para><option>--inclusive=yes</option>: Instead of using
       exclusive cost of functions as sorting order, use and show
       inclusive cost.</para>
     </listitem>

     <listitem>
       <para><option>--tree=both</option>: Interleave into the
       top level list of functions, information on the callers and the callees
       of each function. In these lines, which represents executed
       calls, the cost gives the number of events spent in the call.
       Indented, above each function, there is the list of callers,
       and below, the list of callees. The sum of events in calls to
       a given function (caller lines), as well as the sum of events in
       calls from the function (callee lines) together with the self
       cost, gives the total inclusive cost of the function.</para>
      </listitem>
   </itemizedlist>

   <para>Use <option>--auto=yes</option> to get annotated source code
   for all relevant functions for which the source can be found. In
   addition to source annotation as produced by
   <computeroutput>cg_annotate</computeroutput>, you will see the
   annotated call sites with call counts. For all other options,
   consult the (Cachegrind) documentation for
   <computeroutput>cg_annotate</computeroutput>.
   </para>

   <para>For better call graph browsing experience, it is highly recommended
   to use <ulink url="&cl-gui;">KCachegrind</ulink>.
   If your code
   has a significant fraction of its cost in <emphasis>cycles</emphasis> (sets
   of functions calling each other in a recursive manner), you have to
   use KCachegrind, as <computeroutput>callgrind_annotate</computeroutput>
   currently does not do any cycle detection, which is important to get correct
   results in this case.</para>

   <para>If you are additionally interested in measuring the
   cache behavior of your
   program, use Callgrind with the option
   <option><xref linkend="opt.simulate-cache"/>=yes.</option>
   However, expect a  further slow down approximately by a factor of 2.</para>

   <para>If the program section you want to profile is somewhere in the
   middle of the run, it is beneficial to
   <emphasis>fast forward</emphasis> to this section without any
   profiling, and then switch on profiling.  This is achieved by using
   the command line option
   <option><xref linkend="opt.instr-atstart"/>=no</option>
   and running, in a shell,
   <computeroutput>callgrind_control -i on</computeroutput> just before the
   interesting code section is executed. To exactly specify
   the code position where profiling should start, use the client request
   <computeroutput><xref linkend="cr.start-instr"/></computeroutput>.</para>

   <para>If you want to be able to see assembly code level annotation, specify
   <option><xref linkend="opt.dump-instr"/>=yes</option>. This will produce
   profile data at instruction granularity. Note that the resulting profile
   data
   can only be viewed with KCachegrind. For assembly annotation, it also is
   interesting to see more details of the control flow inside of functions,
   ie. (conditional) jumps. This will be collected by further specifying
   <option><xref linkend="opt.collect-jumps"/>=yes</option>.</para>

   </sect2>

 </sect1>

 <sect1 id="cl-manual.usage" xreflabel="Advanced Usage">
 <title>Advanced Usage</title>

   <sect2 id="cl-manual.dumps"
          xreflabel="Multiple dumps from one program run">
   <title>Multiple profiling dumps from one program run</title>

   <para>Sometimes you are not interested in characteristics of a full
   program run, but only of a small part of it, for example execution of one
   algorithm.  If there are multiple algorithms, or one algorithm
   running with different input data, it may even be useful to get different
   profile information for different parts of a single program run.</para>

   <para>Profile data files have names of the form
 <screen>
 callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threadID</emphasis>
 </screen>
   </para>
   <para>where <emphasis>pid</emphasis> is the PID of the running
   program, <emphasis>part</emphasis> is a number incremented on each
   dump (".part" is skipped for the dump at program termination), and
   <emphasis>threadID</emphasis> is a thread identification
   ("-threadID" is only used if you request dumps of individual
   threads with <option><xref linkend="opt.separate-threads"/>=yes</option>).</para>

   <para>There are different ways to generate multiple profile dumps
   while a program is running under Callgrind's supervision.  Nevertheless,
   all methods trigger the same action, which is "dump all profile
   information since the last dump or program start, and zero cost
   counters afterwards".  To allow for zeroing cost counters without
   dumping, there is a second action "zero all cost counters now".
   The different methods are:</para>
   <itemizedlist>

     <listitem>
       <para><command>Dump on program termination.</command>
       This method is the standard way and doesn't need any special
       action on your part.</para>
     </listitem>

     <listitem>
       <para><command>Spontaneous, interactive dumping.</command> Use
       <screen>callgrind_control -d [hint [PID/Name]]</screen> to
       request the dumping of profile information of the supervised
       application with PID or Name.  <emphasis>hint</emphasis> is an
       arbitrary string you can optionally specify to later be able to
       distinguish profile dumps.  The control program will not terminate
       before the dump is completely written.  Note that the application
       must be actively running for detection of the dump command. So,
       for a GUI application, resize the window, or for a server, send a
       request.</para>
       <para>If you are using <ulink url="&cl-gui;">KCachegrind</ulink>
       for browsing of profile information, you can use the toolbar
       button <command>Force dump</command>. This will request a dump
       and trigger a reload after the dump is written.</para>
     </listitem>

     <listitem>
       <para><command>Periodic dumping after execution of a specified
       number of basic blocks</command>. For this, use the command line
       option <option><xref linkend="opt.dump-every-bb"/>=count</option>.
       </para>
     </listitem>

     <listitem>
       <para><command>Dumping at enter/leave of specified functions.</command>
       Use the
       option <option><xref linkend="opt.dump-before"/>=function</option>
       and <option><xref linkend="opt.dump-after"/>=function</option>.
       To zero cost counters before entering a function, use
       <option><xref linkend="opt.zero-before"/>=function</option>.</para>
       <para>You can specify these options multiple times for different
       functions. Function specifications support wildcards: eg. use
       <option><xref linkend="opt.dump-before"/>='foo*'</option> to
       generate dumps before entering any function starting with
       <emphasis>foo</emphasis>.</para>
     </listitem>

     <listitem>
       <para><command>Program controlled dumping.</command>
       Insert
       <computeroutput><xref linkend="cr.dump-stats"/>;</computeroutput>
       at the position in your code where you want a profile dump to happen. Use
       <computeroutput><xref linkend="cr.zero-stats"/>;</computeroutput> to only
       zero profile counters.
       See <xref linkend="cl-manual.clientrequests"/> for more information on
       Callgrind specific client requests.</para>
     </listitem>
   </itemizedlist>

   <para>If you are running a multi-threaded application and specify the
   command line option <option><xref linkend="opt.separate-threads"/>=yes</option>,
   every thread will be profiled on its own and will create its own
   profile dump. Thus, the last two methods will only generate one dump
   of the currently running thread. With the other methods, you will get
   multiple dumps (one for each thread) on a dump request.</para>

   </sect2>


   <sect2 id="cl-manual.limits"
          xreflabel="Limiting range of event collection">
   <title>Limiting the range of collected events</title>

   <para>For aggregating events (function enter/leave,
   instruction execution, memory access) into event numbers,
   first, the events must be recognizable by Callgrind, and second,
   the collection state must be switched on.</para>

   <para>Event collection is only possible if <emphasis>instrumentation</emphasis>
   for program code is switched on. This is the default, but for faster
   execution (identical to <computeroutput>valgrind --tool=none</computeroutput>),
   it can be switched off until the program reaches a state in which
   you want to start collecting profiling data.
   Callgrind can start without instrumentation
   by specifying option <option><xref linkend="opt.instr-atstart"/>=no</option>.
   Instrumentation can be switched on interactively
   with <screen>callgrind_control -i on</screen>
   and off by specifying "off" instead of "on".
   Furthermore, instrumentation state can be programatically changed with
   the macros <computeroutput><xref linkend="cr.start-instr"/>;</computeroutput>
   and <computeroutput><xref linkend="cr.stop-instr"/>;</computeroutput>.
   </para>

   <para>In addition to enabling instrumentation, you must also enable
   event collection for the parts of your program you are interested in.
   By default, event collection is enabled everywhere.
   You can limit collection to a specific function
   by using
   <option><xref linkend="opt.toggle-collect"/>=function</option>.
   This will toggle the collection state on entering and leaving
   the specified functions.
   When this option is in effect, the default collection state
   at program start is "off".  Only events happening while running
   inside of the given function will be collected. Recursive
   calls of the given function do not trigger any action.</para>

   <para>It is important to note that with instrumentation switched off, the
   cache simulator cannot see any memory access events, and thus, any
   simulated cache state will be frozen and wrong without instrumentation.
   Therefore, to get useful cache events (hits/misses) after switching on
   instrumentation, the cache first must warm up,
   probably leading to many <emphasis>cold misses</emphasis>
   which would not have happened in reality. If you do not want to see these,
   start event collection a few million instructions after you have switched
   on instrumentation.</para>


   </sect2>


   <sect2 id="cl-manual.cycles" xreflabel="Avoiding cycles">
   <title>Avoiding cycles</title>

   <para>Informally speaking, a cycle is a group of functions which
   call each other in a recursive way.</para>

   <para>Formally speaking, a cycle is a nonempty set S of functions,
   such that for every pair of functions F and G in S, it is possible
   to call from F to G (possibly via intermediate functions) and also
   from G to F.  Furthermore, S must be maximal -- that is, be the
   largest set of functions satisfying this property.  For example, if
   a third function H is called from inside S and calls back into S,
   then H is also part of the cycle and should be included in S.</para>

   <para>Recursion is quite usual in programs, and therefore, cycles
   sometimes appear in the call graph output of Callgrind. However,
   the title of this chapter should raise two questions: What is bad
   about cycles which makes you want to avoid them? And: How can
   cycles be avoided without changing program code?</para>

   <para>Cycles are not bad in itself, but tend to make performance
   analysis of your code harder. This is because inclusive costs
   for calls inside of a cycle are meaningless. The definition of
   inclusive cost, ie. self cost of a function plus inclusive cost
   of its callees, needs a topological order among functions. For
   cycles, this does not hold true: callees of a function in a cycle include
   the function itself. Therefore, KCachegrind does cycle detection
   and skips visualization of any inclusive cost for calls inside
   of cycles. Further, all functions in a cycle are collapsed into artifical
   functions called like <computeroutput>Cycle 1</computeroutput>.</para>

   <para>Now, when a program exposes really big cycles (as is
   true for some GUI code, or in general code using event or callback based
   programming style), you loose the nice property to let you pinpoint
   the bottlenecks by following call chains from
   <computeroutput>main()</computeroutput>, guided via
   inclusive cost. In addition, KCachegrind looses its ability to show
   interesting parts of the call graph, as it uses inclusive costs to
   cut off uninteresting areas.</para>

   <para>Despite the meaningless of inclusive costs in cycles, the big
   drawback for visualization motivates the possibility to temporarily
   switch off cycle detection in KCachegrind, which can lead to
   misguiding visualization. However, often cycles appear because of
   unlucky superposition of independent call chains in a way that
   the profile result will see a cycle. Neglecting uninteresting
   calls with very small measured inclusive cost would break these
   cycles. In such cases, incorrect handling of cycles by not detecting
   them still gives meaningful profiling visualization.</para>

   <para>It has to be noted that currently, <command>callgrind_annotate</command>
   does not do any cycle detection at all. For program executions with function
   recursion, it e.g. can print nonsense inclusive costs way above 100%.</para>

   <para>After describing why cycles are bad for profiling, it is worth
   talking about cycle avoidance. The key insight here is that symbols in
   the profile data do not have to exactly match the symbols found in the
   program. Instead, the symbol name could encode additional information
   from the current execution context such as recursion level of the
   current function, or even some part of the call chain leading to the
   function. While encoding of additional information into symbols is
   quite capable of avoiding cycles, it has to be used carefully to not cause
   symbol explosion. The latter imposes large memory requirement for Callgrind
   with possible out-of-memory conditions, and big profile data files.</para>

   <para>A further possibility to avoid cycles in Callgrind's profile data
   output is to simply leave out given functions in the call graph. Of course, this
   also skips any call information from and to an ignored function, and thus can
   break a cycle. Candidates for this typically are dispatcher functions in event
   driven code. The option to ignore calls to a function is
   <option><xref linkend="opt.fn-skip"/>=function</option>. Aside from
   possibly breaking cycles, this is used in Callgrind to skip
   trampoline functions in the PLT sections
   for calls to functions in shared libraries. You can see the difference
   if you profile with <option><xref linkend="opt.skip-plt"/>=no</option>.
   If a call is ignored, its cost events will be propagated to the
   enclosing function.</para>

   <para>If you have a recursive function, you can distinguish the first
   10 recursion levels by specifying
   <option><xref linkend="opt.separate-recs-num"/>=function</option>.
   Or for all functions with
   <option><xref linkend="opt.separate-recs"/>=10</option>, but this will
   give you much bigger profile data files.  In the profile data, you will see
   the recursion levels of "func" as the different functions with names
   "func", "func'2", "func'3" and so on.</para>

   <para>If you have call chains "A &gt; B &gt; C" and "A &gt; C &gt; B"
   in your program, you usually get a "false" cycle "B &lt;&gt; C". Use
   <option><xref linkend="opt.separate-callers-num"/>=B</option>
   <option><xref linkend="opt.separate-callers-num"/>=C</option>,
   and functions "B" and "C" will be treated as different functions
   depending on the direct caller. Using the apostrophe for appending
   this "context" to the function name, you get "A &gt; B'A &gt; C'B"
   and "A &gt; C'A &gt; B'C", and there will be no cycle. Use
   <option><xref linkend="opt.separate-callers"/>=2</option> to get a 2-caller
   dependency for all functions.  Note that doing this will increase
   the size of profile data files.</para>

   </sect2>

   <sect2 id="cl-manual.forkingprograms" xreflabel="Forking Programs">
   <title>Forking Programs</title>

   <para>If your program forks, the child will inherit all the profiling
   data that has been gathered for the parent. To start with empty profile
   counter values in the child, the client request
   <computeroutput><xref linkend="cr.zero-stats"/>;</computeroutput>
   can be inserted into code to be executed by the child, directly after
   <computeroutput>fork()</computeroutput>.</para>

   <para>However, you will have to make sure that the output file format string
   (controlled by <option>--callgrind-out-file</option>) does contain
   <option>%p</option> (which is true by default). Otherwise, the
   outputs from the parent and child will overwrite each other or will be
   intermingled, which almost certainly is not what you want.</para>

   <para>You will be able to control the new child independently from
   the parent via <computeroutput>callgrind_control</computeroutput>.</para>

   </sect2>

 </sect1>


 <sect1 id="cl-manual.options" xreflabel="Command line option reference">
 <title>Command line option reference</title>

 <para>
 In the following, options are grouped into classes, in the same order as
 the output of <computeroutput>callgrind --help</computeroutput>.
 </para>
 <para>
 Some options allow the specification of a function/symbol name, such as
 <option><xref linkend="opt.dump-before"/>=function</option>, or
 <option><xref linkend="opt.fn-skip"/>=function</option>. All these options
 can be specified multiple times for different functions.
 In addition, the function specifications actually are patterns by supporting
 the use of wildcards '*' (zero or more arbitrary characters) and '?'
 (exactly one arbitrary character), similar to file name globbing in the
 shell. This feature is important especially for C++, as without wildcard
 usage, the function would have to be specified in full extent, including
 parameter signature. </para>

 <sect2 id="cl-manual.options.misc"
        xreflabel="Miscellaneous options">
 <title>Miscellaneous options</title>

 <variablelist id="cl.opts.list.misc">

   <varlistentry>
     <term><option>--help</option></term>
     <listitem>
       <para>Show summary of options. This is a short version of this
       manual section.</para>
     </listitem>
   </varlistentry>

   <varlistentry>
     <term><option>--version</option></term>
     <listitem>
       <para>Show version of callgrind.</para>
     </listitem>
   </varlistentry>

 </variablelist>
 </sect2>

 <sect2 id="cl-manual.options.creation"
        xreflabel="Dump creation options">
 <title>Dump creation options</title>

 <para>
 These options influence the name and format of the profile data files.
 </para>

 <variablelist id="cl.opts.list.creation">

   <varlistentry id="opt.callgrind-out-file" xreflabel="--callgrind-out-file">
     <term>
       <option><![CDATA[--callgrind-out-file=<file> ]]></option>
     </term>
     <listitem>
       <para>Write the profile data to
             <computeroutput>file</computeroutput> rather than to the default
             output file,
             <computeroutput>callgrind.out.&lt;pid&gt;</computeroutput>.  The
             <option>%p</option> and <option>%q</option> format specifiers
             can be used to embed the process ID and/or the contents of an
             environment variable in the name, as is the case for the core
             option <option>--log-file</option>.  See <link
             linkend="manual-core.basicopts">here</link> for details.
             When multiple dumps are made, the file name
             is modified further; see below.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.dump-instr" xreflabel="--dump-instr">
     <term>
       <option><![CDATA[--dump-instr=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
       <para>This specifies that event counting should be performed at
       per-instruction granularity.
       This allows for assembly code
       annotation.  Currently the results can only be
       displayed by KCachegrind.</para>
   </listitem>
   </varlistentry>

   <varlistentry id="opt.dump-line" xreflabel="--dump-line">
     <term>
       <option><![CDATA[--dump-line=<no|yes> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>This specifies that event counting should be performed at
       source line granularity. This allows source
       annotation for sources which are compiled with debug information ("-g").</para>
   </listitem>
   </varlistentry>

   <varlistentry id="opt.compress-strings" xreflabel="--compress-strings">
     <term>
       <option><![CDATA[--compress-strings=<no|yes> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>This option influences the output format of the profile data.
       It specifies whether strings (file and function names) should be
       identified by numbers. This shrinks the file,
       but makes it more difficult
       for humans to read (which is not recommended in any case).</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.compress-pos" xreflabel="--compress-pos">
     <term>
       <option><![CDATA[--compress-pos=<no|yes> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>This option influences the output format of the profile data.
       It specifies whether numerical positions are always specified as absolute
       values or are allowed to be relative to previous numbers.
       This shrinks the file size,</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.combine-dumps" xreflabel="--combine-dumps">
     <term>
       <option><![CDATA[--combine-dumps=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
       <para>When multiple profile data parts are to be generated, these
       parts are appended to the same output file if this option is set to
       "yes". Not recommended.</para>
   </listitem>
   </varlistentry>

 </variablelist>
 </sect2>

 <sect2 id="cl-manual.options.activity"
        xreflabel="Activity options">
 <title>Activity options</title>

 <para>
 These options specify when actions relating to event counts are to
 be executed. For interactive control use
 <computeroutput>callgrind_control</computeroutput>.
 </para>

 <variablelist id="cl.opts.list.activity">

   <varlistentry id="opt.dump-every-bb" xreflabel="--dump-every-bb">
     <term>
       <option><![CDATA[--dump-every-bb=<count> [default: 0, never] ]]></option>
     </term>
     <listitem>
       <para>Dump profile data every &lt;count&gt; basic blocks.
       Whether a dump is needed is only checked when Valgrind's internal
       scheduler is run. Therefore, the minimum setting useful is about 100000.
       The count is a 64-bit value to make long dump periods possible.
       </para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.dump-before" xreflabel="--dump-before">
     <term>
       <option><![CDATA[--dump-before=<function> ]]></option>
     </term>
     <listitem>
       <para>Dump when entering &lt;function&gt;</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.zero-before" xreflabel="--zero-before">
     <term>
       <option><![CDATA[--zero-before=<function> ]]></option>
     </term>
     <listitem>
       <para>Zero all costs when entering &lt;function&gt;</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.dump-after" xreflabel="--dump-after">
     <term>
       <option><![CDATA[--dump-after=<function> ]]></option>
     </term>
     <listitem>
       <para>Dump when leaving &lt;function&gt;</para>
     </listitem>
   </varlistentry>

 </variablelist>
 </sect2>

 <sect2 id="cl-manual.options.collection"
        xreflabel="Data collection options">
 <title>Data collection options</title>

 <para>
 These options specify when events are to be aggregated into event counts.
 Also see <xref linkend="cl-manual.limits"/>.</para>

 <variablelist id="cl.opts.list.collection">

   <varlistentry id="opt.instr-atstart" xreflabel="--instr-atstart">
     <term>
       <option><![CDATA[--instr-atstart=<yes|no> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>Specify if you want Callgrind to start simulation and
       profiling from the beginning of the program.
       When set to <computeroutput>no</computeroutput>,
       Callgrind will not be able
       to collect any information, including calls, but it will have at
       most a slowdown of around 4, which is the minimum Valgrind
       overhead.  Instrumentation can be interactively switched on via
       <computeroutput>callgrind_control -i on</computeroutput>.</para>
       <para>Note that the resulting call graph will most probably not
       contain <computeroutput>main</computeroutput>, but will contain all the
       functions executed after instrumentation was switched on.
       Instrumentation can also programatically switched on/off. See the
       Callgrind include file
       <computeroutput>&lt;callgrind.h&gt;</computeroutput> for the macro
       you have to use in your source code.</para> <para>For cache
       simulation, results will be less accurate when switching on
       instrumentation later in the program run, as the simulator starts
       with an empty cache at that moment.  Switch on event collection
       later to cope with this error.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.collect-atstart" xreflabel="--collect-atstart">
     <term>
       <option><![CDATA[--collect-atstart=<yes|no> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>Specify whether event collection is switched on at beginning
       of the profile run.</para>
       <para>To only look at parts of your program, you have two
       possibilities:</para>
       <orderedlist>
       <listitem>
         <para>Zero event counters before entering the program part you
         want to profile, and dump the event counters to a file after
         leaving that program part.</para>
         </listitem>
         <listitem>
           <para>Switch on/off collection state as needed to only see
           event counters happening while inside of the program part you
           want to profile.</para>
         </listitem>
       </orderedlist>
       <para>The second option can be used if the program part you want to
       profile is called many times. Option 1, i.e. creating a lot of
       dumps is not practical here.</para>
       <para>Collection state can be
       toggled at entry and exit of a given function with the
       option <xref linkend="opt.toggle-collect"/>.  If you use this flag,
       collection
       state should be switched off at the beginning.  Note that the
       specification of <computeroutput>--toggle-collect</computeroutput>
       implicitly sets
       <computeroutput>--collect-state=no</computeroutput>.</para>
       <para>Collection state can be toggled also by inserting the client request
       <computeroutput><xref linkend="cr.toggle-collect"/>;</computeroutput>
       at the needed code positions.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.toggle-collect" xreflabel="--toggle-collect">
     <term>
       <option><![CDATA[--toggle-collect=<function> ]]></option>
     </term>
     <listitem>
       <para>Toggle collection on entry/exit of &lt;function&gt;.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.collect-jumps" xreflabel="--collect-jumps">
     <term>
       <option><![CDATA[--collect-jumps=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
       <para>This specifies whether information for (conditional) jumps
       should be collected.  As above, callgrind_annotate currently is not
       able to show you the data.  You have to use KCachegrind to get jump
       arrows in the annotated code.</para>
     </listitem>
   </varlistentry>

 </variablelist>
 </sect2>

 <sect2 id="cl-manual.options.separation"
        xreflabel="Cost entity separation options">
 <title>Cost entity separation options</title>

 <para>
 These options specify how event counts should be attributed to execution
 contexts.
 For example, they specify whether the recursion level or the
 call chain leading to a function should be taken into account,
 and whether the thread ID should be considered.
 Also see <xref linkend="cl-manual.cycles"/>.</para>

 <variablelist id="cmd-options.separation">

   <varlistentry id="opt.separate-threads" xreflabel="--separate-threads">
     <term>
       <option><![CDATA[--separate-threads=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
       <para>This option specifies whether profile data should be generated
       separately for every thread. If yes, the file names get "-threadID"
       appended.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.separate-recs" xreflabel="--separate-recs">
     <term>
       <option><![CDATA[--separate-recs=<level> [default: 2] ]]></option>
     </term>
     <listitem>
       <para>Separate function recursions by at most &lt;level&gt; levels.
       See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.separate-callers" xreflabel="--separate-callers">
     <term>
       <option><![CDATA[--separate-callers=<callers> [default: 0] ]]></option>
     </term>
     <listitem>
       <para>Separate contexts by at most &lt;callers&gt; functions in the
       call chain. See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.skip-plt" xreflabel="--skip-plt">
     <term>
       <option><![CDATA[--skip-plt=<no|yes> [default: yes] ]]></option>
     </term>
     <listitem>
       <para>Ignore calls to/from PLT sections.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.fn-skip" xreflabel="--fn-skip">
     <term>
       <option><![CDATA[--fn-skip=<function> ]]></option>
     </term>
     <listitem>
       <para>Ignore calls to/from a given function.  E.g. if you have a
       call chain A &gt; B &gt; C, and you specify function B to be
       ignored, you will only see A &gt; C.</para>
       <para>This is very convenient to skip functions handling callback
       behaviour.  For example, with the signal/slot mechanism in the
       Qt graphics library, you only want
       to see the function emitting a signal to call the slots connected
       to that signal. First, determine the real call chain to see the
       functions needed to be skipped, then use this option.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.fn-group">
     <term>
       <option><![CDATA[--fn-group<number>=<function> ]]></option>
     </term>
     <listitem>
       <para>Put a function into a separate group. This influences the
       context name for cycle avoidance. All functions inside such a
       group are treated as being the same for context name building, which
       resembles the call chain leading to a context. By specifying function
       groups with this option, you can shorten the context name, as functions
       in the same group will not appear in sequence in the name. </para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.separate-recs-num" xreflabel="--separate-recs10">
     <term>
       <option><![CDATA[--separate-recs<number>=<function> ]]></option>
     </term>
     <listitem>
       <para>Separate &lt;number&gt; recursions for &lt;function&gt;.
       See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.separate-callers-num" xreflabel="--separate-callers2">
     <term>
       <option><![CDATA[--separate-callers<number>=<function> ]]></option>
     </term>
     <listitem>
       <para>Separate &lt;number&gt; callers for &lt;function&gt;.
       See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>

 </variablelist>
 </sect2>

 <sect2 id="cl-manual.options.simulation"
        xreflabel="Cache simulation options">
 <title>Cache simulation options</title>

 <variablelist id="cl.opts.list.simulation">

   <varlistentry id="opt.simulate-cache" xreflabel="--simulate-cache">
     <term>
       <option><![CDATA[--simulate-cache=<yes|no> [default: no] ]]></option>
     </term>
     <listitem>
       <para>Specify if you want to do full cache simulation.  By default,
       only instruction read accesses will be profiled.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="opt.simulate-hwpref" xreflabel="--simulate-hwpref">
     <term>
       <option><![CDATA[--simulate-hwpref=<yes|no> [default: no] ]]></option>
     </term>
     <listitem>
       <para>Specify whether simulation of a hardware prefetcher should be
       added which is able to detect stream access in the second level cache
       by comparing accesses to separate to each page.
       As the simulation can not decide about any timing issues of prefetching,
       it is assumed that any hardware prefetch triggered succeeds before a
       real access is done. Thus, this gives a best-case scenario by covering
       all possible stream accesses.</para>
     </listitem>
   </varlistentry>

 </variablelist>

 </sect2>

 </sect1>

 <sect1 id="cl-manual.clientrequests" xreflabel="Client request reference">
 <title>Callgrind specific client requests</title>

 <para>In Valgrind terminology, a client request is a C macro which
 can be inserted into your code to request specific functionality when
 run under Valgrind. For this, special instruction patterns resulting
 in NOPs are used, but which can be detected by Valgrind.</para>

 <para>Callgrind provides the following specific client requests.
 To use them, add the line
 <screen><![CDATA[#include <valgrind/callgrind.h>]]></screen>
 into your code for the macro definitions.
 .</para>

 <variablelist id="cl.clientrequests.list">

   <varlistentry id="cr.dump-stats" xreflabel="CALLGRIND_DUMP_STATS">
     <term>
       <computeroutput>CALLGRIND_DUMP_STATS</computeroutput>
     </term>
     <listitem>
       <para>Force generation of a profile dump at specified position
       in code, for the current thread only. Written counters will be reset
       to zero.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="cr.dump-stats-at" xreflabel="CALLGRIND_DUMP_STATS_AT">
     <term>
       <computeroutput>CALLGRIND_DUMP_STATS_AT(string)</computeroutput>
     </term>
     <listitem>
       <para>Same as CALLGRIND_DUMP_STATS, but allows to specify a string
       to be able to distinguish profile dumps.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="cr.zero-stats" xreflabel="CALLGRIND_ZERO_STATS">
     <term>
       <computeroutput>CALLGRIND_ZERO_STATS</computeroutput>
     </term>
     <listitem>
       <para>Reset the profile counters for the current thread to zero.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="cr.toggle-collect" xreflabel="CALLGRIND_TOGGLE_COLLECT">
     <term>
       <computeroutput>CALLGRIND_TOGGLE_COLLECT</computeroutput>
     </term>
     <listitem>
       <para>Toggle the collection state. This allows to ignore events
       with regard to profile counters. See also options
       <xref linkend="opt.collect-atstart"/> and
       <xref linkend="opt.toggle-collect"/>.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="cr.start-instr" xreflabel="CALLGRIND_START_INSTRUMENTATION">
     <term>
       <computeroutput>CALLGRIND_START_INSTRUMENTATION</computeroutput>
     </term>
     <listitem>
       <para>Start full Callgrind instrumentation if not already switched on.
       When cache simulation is done, this will flush the simulated cache
       and lead to an artifical cache warmup phase afterwards with
       cache misses which would not have happened in reality.
       See also option <xref linkend="opt.instr-atstart"/>.</para>
     </listitem>
   </varlistentry>

   <varlistentry id="cr.stop-instr" xreflabel="CALLGRIND_STOP_INSTRUMENTATION">
     <term>
       <computeroutput>CALLGRIND_STOP_INSTRUMENTATION</computeroutput>
     </term>
     <listitem>
       <para>Stop full Callgrind instrumentation if not already switched off.
       This flushes Valgrinds translation cache, and does no additional
       instrumentation afterwards: it effectivly will run at the same
       speed as the "none" tool, ie. at minimal slowdown. Use this to
       speed up the Callgrind run for uninteresting code parts. Use
       <xref linkend="cr.start-instr"/> to switch on instrumentation again.
       See also option <xref linkend="opt.instr-atstart"/>.</para>
     </listitem>
   </varlistentry>

 </variablelist>

 </sect1>

 </chapter>