Blame - Documentation/trace/ftrace.rst - kernel/msm-5.4

blob: 636aa9bf5674076bc6e0781c56ed8288c1541c23 [file] [log] [blame]

Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame^]	1	========================
				2	ftrace - Function Tracer
				3	========================
				4
				5	Copyright 2008 Red Hat Inc.
				6
				7	:Author: Steven Rostedt <srostedt@redhat.com>
				8	:License: The GNU Free Documentation License, Version 1.2
				9	(dual licensed under the GPL v2)
				10	:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
				11	John Kacur, and David Teigland.
				12
				13	- Written for: 2.6.28-rc2
				14	- Updated for: 3.10
				15	- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
				16	- Converted to rst format - Changbin Du <changbin.du@intel.com>
				17
				18	Introduction
				19	------------
				20
				21	Ftrace is an internal tracer designed to help out developers and
				22	designers of systems to find what is going on inside the kernel.
				23	It can be used for debugging or analyzing latencies and
				24	performance issues that take place outside of user-space.
				25
				26	Although ftrace is typically considered the function tracer, it
				27	is really a frame work of several assorted tracing utilities.
				28	There's latency tracing to examine what occurs between interrupts
				29	disabled and enabled, as well as for preemption and from a time
				30	a task is woken to the task is actually scheduled in.
				31
				32	One of the most common uses of ftrace is the event tracing.
				33	Through out the kernel is hundreds of static event points that
				34	can be enabled via the tracefs file system to see what is
				35	going on in certain parts of the kernel.
				36
				37	See events.txt for more information.
				38
				39
				40	Implementation Details
				41	----------------------
				42
				43	See :doc:`ftrace-design` for details for arch porters and such.
				44
				45
				46	The File System
				47	---------------
				48
				49	Ftrace uses the tracefs file system to hold the control files as
				50	well as the files to display output.
				51
				52	When tracefs is configured into the kernel (which selecting any ftrace
				53	option will do) the directory /sys/kernel/tracing will be created. To mount
				54	this directory, you can add to your /etc/fstab file::
				55
				56	tracefs /sys/kernel/tracing tracefs defaults 0 0
				57
				58	Or you can mount it at run time with::
				59
				60	mount -t tracefs nodev /sys/kernel/tracing
				61
				62	For quicker access to that directory you may want to make a soft link to
				63	it::
				64
				65	ln -s /sys/kernel/tracing /tracing
				66
				67	.. attention::
				68
				69	Before 4.1, all ftrace tracing control files were within the debugfs
				70	file system, which is typically located at /sys/kernel/debug/tracing.
				71	For backward compatibility, when mounting the debugfs file system,
				72	the tracefs file system will be automatically mounted at:
				73
				74	/sys/kernel/debug/tracing
				75
				76	All files located in the tracefs file system will be located in that
				77	debugfs file system directory as well.
				78
				79	.. attention::
				80
				81	Any selected ftrace option will also create the tracefs file system.
				82	The rest of the document will assume that you are in the ftrace directory
				83	(cd /sys/kernel/tracing) and will only concentrate on the files within that
				84	directory and not distract from the content with the extended
				85	"/sys/kernel/tracing" path name.
				86
				87	That's it! (assuming that you have ftrace configured into your kernel)
				88
				89	After mounting tracefs you will have access to the control and output files
				90	of ftrace. Here is a list of some of the key files:
				91
				92
				93	Note: all time values are in microseconds.
				94
				95	current_tracer:
				96
				97	This is used to set or display the current tracer
				98	that is configured.
				99
				100	available_tracers:
				101
				102	This holds the different types of tracers that
				103	have been compiled into the kernel. The
				104	tracers listed here can be configured by
				105	echoing their name into current_tracer.
				106
				107	tracing_on:
				108
				109	This sets or displays whether writing to the trace
				110	ring buffer is enabled. Echo 0 into this file to disable
				111	the tracer or 1 to enable it. Note, this only disables
				112	writing to the ring buffer, the tracing overhead may
				113	still be occurring.
				114
				115	The kernel function tracing_off() can be used within the
				116	kernel to disable writing to the ring buffer, which will
				117	set this file to "0". User space can re-enable tracing by
				118	echoing "1" into the file.
				119
				120	Note, the function and event trigger "traceoff" will also
				121	set this file to zero and stop tracing. Which can also
				122	be re-enabled by user space using this file.
				123
				124	trace:
				125
				126	This file holds the output of the trace in a human
				127	readable format (described below). Note, tracing is temporarily
				128	disabled while this file is being read (opened).
				129
				130	trace_pipe:
				131
				132	The output is the same as the "trace" file but this
				133	file is meant to be streamed with live tracing.
				134	Reads from this file will block until new data is
				135	retrieved. Unlike the "trace" file, this file is a
				136	consumer. This means reading from this file causes
				137	sequential reads to display more current data. Once
				138	data is read from this file, it is consumed, and
				139	will not be read again with a sequential read. The
				140	"trace" file is static, and if the tracer is not
				141	adding more data, it will display the same
				142	information every time it is read. This file will not
				143	disable tracing while being read.
				144
				145	trace_options:
				146
				147	This file lets the user control the amount of data
				148	that is displayed in one of the above output
				149	files. Options also exist to modify how a tracer
				150	or events work (stack traces, timestamps, etc).
				151
				152	options:
				153
				154	This is a directory that has a file for every available
				155	trace option (also in trace_options). Options may also be set
				156	or cleared by writing a "1" or "0" respectively into the
				157	corresponding file with the option name.
				158
				159	tracing_max_latency:
				160
				161	Some of the tracers record the max latency.
				162	For example, the maximum time that interrupts are disabled.
				163	The maximum time is saved in this file. The max trace will also be
				164	stored, and displayed by "trace". A new max trace will only be
				165	recorded if the latency is greater than the value in this file
				166	(in microseconds).
				167
				168	By echoing in a time into this file, no latency will be recorded
				169	unless it is greater than the time in this file.
				170
				171	tracing_thresh:
				172
				173	Some latency tracers will record a trace whenever the
				174	latency is greater than the number in this file.
				175	Only active when the file contains a number greater than 0.
				176	(in microseconds)
				177
				178	buffer_size_kb:
				179
				180	This sets or displays the number of kilobytes each CPU
				181	buffer holds. By default, the trace buffers are the same size
				182	for each CPU. The displayed number is the size of the
				183	CPU buffer and not total size of all buffers. The
				184	trace buffers are allocated in pages (blocks of memory
				185	that the kernel uses for allocation, usually 4 KB in size).
				186	If the last page allocated has room for more bytes
				187	than requested, the rest of the page will be used,
				188	making the actual allocation bigger than requested or shown.
				189	( Note, the size may not be a multiple of the page size
				190	due to buffer management meta-data. )
				191
				192	Buffer sizes for individual CPUs may vary
				193	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
				194	this file will show "X".
				195
				196	buffer_total_size_kb:
				197
				198	This displays the total combined size of all the trace buffers.
				199
				200	free_buffer:
				201
				202	If a process is performing tracing, and the ring buffer should be
				203	shrunk "freed" when the process is finished, even if it were to be
				204	killed by a signal, this file can be used for that purpose. On close
				205	of this file, the ring buffer will be resized to its minimum size.
				206	Having a process that is tracing also open this file, when the process
				207	exits its file descriptor for this file will be closed, and in doing so,
				208	the ring buffer will be "freed".
				209
				210	It may also stop tracing if disable_on_free option is set.
				211
				212	tracing_cpumask:
				213
				214	This is a mask that lets the user only trace on specified CPUs.
				215	The format is a hex string representing the CPUs.
				216
				217	set_ftrace_filter:
				218
				219	When dynamic ftrace is configured in (see the
				220	section below "dynamic ftrace"), the code is dynamically
				221	modified (code text rewrite) to disable calling of the
				222	function profiler (mcount). This lets tracing be configured
				223	in with practically no overhead in performance. This also
				224	has a side effect of enabling or disabling specific functions
				225	to be traced. Echoing names of functions into this file
				226	will limit the trace to only those functions.
				227
				228	The functions listed in "available_filter_functions" are what
				229	can be written into this file.
				230
				231	This interface also allows for commands to be used. See the
				232	"Filter commands" section for more details.
				233
				234	set_ftrace_notrace:
				235
				236	This has an effect opposite to that of
				237	set_ftrace_filter. Any function that is added here will not
				238	be traced. If a function exists in both set_ftrace_filter
				239	and set_ftrace_notrace, the function will _not_ be traced.
				240
				241	set_ftrace_pid:
				242
				243	Have the function tracer only trace the threads whose PID are
				244	listed in this file.
				245
				246	If the "function-fork" option is set, then when a task whose
				247	PID is listed in this file forks, the child's PID will
				248	automatically be added to this file, and the child will be
				249	traced by the function tracer as well. This option will also
				250	cause PIDs of tasks that exit to be removed from the file.
				251
				252	set_event_pid:
				253
				254	Have the events only trace a task with a PID listed in this file.
				255	Note, sched_switch and sched_wake_up will also trace events
				256	listed in this file.
				257
				258	To have the PIDs of children of tasks with their PID in this file
				259	added on fork, enable the "event-fork" option. That option will also
				260	cause the PIDs of tasks to be removed from this file when the task
				261	exits.
				262
				263	set_graph_function:
				264
				265	Functions listed in this file will cause the function graph
				266	tracer to only trace these functions and the functions that
				267	they call. (See the section "dynamic ftrace" for more details).
				268
				269	set_graph_notrace:
				270
				271	Similar to set_graph_function, but will disable function graph
				272	tracing when the function is hit until it exits the function.
				273	This makes it possible to ignore tracing functions that are called
				274	by a specific function.
				275
				276	available_filter_functions:
				277
				278	This lists the functions that ftrace has processed and can trace.
				279	These are the function names that you can pass to
				280	"set_ftrace_filter" or "set_ftrace_notrace".
				281	(See the section "dynamic ftrace" below for more details.)
				282
				283	dyn_ftrace_total_info:
				284
				285	This file is for debugging purposes. The number of functions that
				286	have been converted to nops and are available to be traced.
				287
				288	enabled_functions:
				289
				290	This file is more for debugging ftrace, but can also be useful
				291	in seeing if any function has a callback attached to it.
				292	Not only does the trace infrastructure use ftrace function
				293	trace utility, but other subsystems might too. This file
				294	displays all functions that have a callback attached to them
				295	as well as the number of callbacks that have been attached.
				296	Note, a callback may also call multiple functions which will
				297	not be listed in this count.
				298
				299	If the callback registered to be traced by a function with
				300	the "save regs" attribute (thus even more overhead), a 'R'
				301	will be displayed on the same line as the function that
				302	is returning registers.
				303
				304	If the callback registered to be traced by a function with
				305	the "ip modify" attribute (thus the regs->ip can be changed),
				306	an 'I' will be displayed on the same line as the function that
				307	can be overridden.
				308
				309	If the architecture supports it, it will also show what callback
				310	is being directly called by the function. If the count is greater
				311	than 1 it most likely will be ftrace_ops_list_func().
				312
				313	If the callback of the function jumps to a trampoline that is
				314	specific to a the callback and not the standard trampoline,
				315	its address will be printed as well as the function that the
				316	trampoline calls.
				317
				318	function_profile_enabled:
				319
				320	When set it will enable all functions with either the function
				321	tracer, or if configured, the function graph tracer. It will
				322	keep a histogram of the number of functions that were called
				323	and if the function graph tracer was configured, it will also keep
				324	track of the time spent in those functions. The histogram
				325	content can be displayed in the files:
				326
				327	trace_stats/function<cpu> ( function0, function1, etc).
				328
				329	trace_stats:
				330
				331	A directory that holds different tracing stats.
				332
				333	kprobe_events:
				334
				335	Enable dynamic trace points. See kprobetrace.txt.
				336
				337	kprobe_profile:
				338
				339	Dynamic trace points stats. See kprobetrace.txt.
				340
				341	max_graph_depth:
				342
				343	Used with the function graph tracer. This is the max depth
				344	it will trace into a function. Setting this to a value of
				345	one will show only the first kernel function that is called
				346	from user space.
				347
				348	printk_formats:
				349
				350	This is for tools that read the raw format files. If an event in
				351	the ring buffer references a string, only a pointer to the string
				352	is recorded into the buffer and not the string itself. This prevents
				353	tools from knowing what that string was. This file displays the string
				354	and address for the string allowing tools to map the pointers to what
				355	the strings were.
				356
				357	saved_cmdlines:
				358
				359	Only the pid of the task is recorded in a trace event unless
				360	the event specifically saves the task comm as well. Ftrace
				361	makes a cache of pid mappings to comms to try to display
				362	comms for events. If a pid for a comm is not listed, then
				363	"<...>" is displayed in the output.
				364
				365	If the option "record-cmd" is set to "0", then comms of tasks
				366	will not be saved during recording. By default, it is enabled.
				367
				368	saved_cmdlines_size:
				369
				370	By default, 128 comms are saved (see "saved_cmdlines" above). To
				371	increase or decrease the amount of comms that are cached, echo
				372	in a the number of comms to cache, into this file.
				373
				374	saved_tgids:
				375
				376	If the option "record-tgid" is set, on each scheduling context switch
				377	the Task Group ID of a task is saved in a table mapping the PID of
				378	the thread to its TGID. By default, the "record-tgid" option is
				379	disabled.
				380
				381	snapshot:
				382
				383	This displays the "snapshot" buffer and also lets the user
				384	take a snapshot of the current running trace.
				385	See the "Snapshot" section below for more details.
				386
				387	stack_max_size:
				388
				389	When the stack tracer is activated, this will display the
				390	maximum stack size it has encountered.
				391	See the "Stack Trace" section below.
				392
				393	stack_trace:
				394
				395	This displays the stack back trace of the largest stack
				396	that was encountered when the stack tracer is activated.
				397	See the "Stack Trace" section below.
				398
				399	stack_trace_filter:
				400
				401	This is similar to "set_ftrace_filter" but it limits what
				402	functions the stack tracer will check.
				403
				404	trace_clock:
				405
				406	Whenever an event is recorded into the ring buffer, a
				407	"timestamp" is added. This stamp comes from a specified
				408	clock. By default, ftrace uses the "local" clock. This
				409	clock is very fast and strictly per cpu, but on some
				410	systems it may not be monotonic with respect to other
				411	CPUs. In other words, the local clocks may not be in sync
				412	with local clocks on other CPUs.
				413
				414	Usual clocks for tracing::
				415
				416	# cat trace_clock
				417	[local] global counter x86-tsc
				418
				419	The clock with the square brackets around it is the one in effect.
				420
				421	local:
				422	Default clock, but may not be in sync across CPUs
				423
				424	global:
				425	This clock is in sync with all CPUs but may
				426	be a bit slower than the local clock.
				427
				428	counter:
				429	This is not a clock at all, but literally an atomic
				430	counter. It counts up one by one, but is in sync
				431	with all CPUs. This is useful when you need to
				432	know exactly the order events occurred with respect to
				433	each other on different CPUs.
				434
				435	uptime:
				436	This uses the jiffies counter and the time stamp
				437	is relative to the time since boot up.
				438
				439	perf:
				440	This makes ftrace use the same clock that perf uses.
				441	Eventually perf will be able to read ftrace buffers
				442	and this will help out in interleaving the data.
				443
				444	x86-tsc:
				445	Architectures may define their own clocks. For
				446	example, x86 uses its own TSC cycle clock here.
				447
				448	ppc-tb:
				449	This uses the powerpc timebase register value.
				450	This is in sync across CPUs and can also be used
				451	to correlate events across hypervisor/guest if
				452	tb_offset is known.
				453
				454	mono:
				455	This uses the fast monotonic clock (CLOCK_MONOTONIC)
				456	which is monotonic and is subject to NTP rate adjustments.
				457
				458	mono_raw:
				459	This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
				460	which is montonic but is not subject to any rate adjustments
				461	and ticks at the same rate as the hardware clocksource.
				462
				463	boot:
				464	This is the boot clock (CLOCK_BOOTTIME) and is based on the
				465	fast monotonic clock, but also accounts for time spent in
				466	suspend. Since the clock access is designed for use in
				467	tracing in the suspend path, some side effects are possible
				468	if clock is accessed after the suspend time is accounted before
				469	the fast mono clock is updated. In this case, the clock update
				470	appears to happen slightly sooner than it normally would have.
				471	Also on 32-bit systems, it's possible that the 64-bit boot offset
				472	sees a partial update. These effects are rare and post
				473	processing should be able to handle them. See comments in the
				474	ktime_get_boot_fast_ns() function for more information.
				475
				476	To set a clock, simply echo the clock name into this file::
				477
				478	echo global > trace_clock
				479
				480	trace_marker:
				481
				482	This is a very useful file for synchronizing user space
				483	with events happening in the kernel. Writing strings into
				484	this file will be written into the ftrace buffer.
				485
				486	It is useful in applications to open this file at the start
				487	of the application and just reference the file descriptor
				488	for the file::
				489
				490	void trace_write(const char *fmt, ...)
				491	{
				492	va_list ap;
				493	char buf[256];
				494	int n;
				495
				496	if (trace_fd < 0)
				497	return;
				498
				499	va_start(ap, fmt);
				500	n = vsnprintf(buf, 256, fmt, ap);
				501	va_end(ap);
				502
				503	write(trace_fd, buf, n);
				504	}
				505
				506	start::
				507
				508	trace_fd = open("trace_marker", WR_ONLY);
				509
				510	trace_marker_raw:
				511
				512	This is similar to trace_marker above, but is meant for for binary data
				513	to be written to it, where a tool can be used to parse the data
				514	from trace_pipe_raw.
				515
				516	uprobe_events:
				517
				518	Add dynamic tracepoints in programs.
				519	See uprobetracer.txt
				520
				521	uprobe_profile:
				522
				523	Uprobe statistics. See uprobetrace.txt
				524
				525	instances:
				526
				527	This is a way to make multiple trace buffers where different
				528	events can be recorded in different buffers.
				529	See "Instances" section below.
				530
				531	events:
				532
				533	This is the trace event directory. It holds event tracepoints
				534	(also known as static tracepoints) that have been compiled
				535	into the kernel. It shows what event tracepoints exist
				536	and how they are grouped by system. There are "enable"
				537	files at various levels that can enable the tracepoints
				538	when a "1" is written to them.
				539
				540	See events.txt for more information.
				541
				542	set_event:
				543
				544	By echoing in the event into this file, will enable that event.
				545
				546	See events.txt for more information.
				547
				548	available_events:
				549
				550	A list of events that can be enabled in tracing.
				551
				552	See events.txt for more information.
				553
				554	hwlat_detector:
				555
				556	Directory for the Hardware Latency Detector.
				557	See "Hardware Latency Detector" section below.
				558
				559	per_cpu:
				560
				561	This is a directory that contains the trace per_cpu information.
				562
				563	per_cpu/cpu0/buffer_size_kb:
				564
				565	The ftrace buffer is defined per_cpu. That is, there's a separate
				566	buffer for each CPU to allow writes to be done atomically,
				567	and free from cache bouncing. These buffers may have different
				568	size buffers. This file is similar to the buffer_size_kb
				569	file, but it only displays or sets the buffer size for the
				570	specific CPU. (here cpu0).
				571
				572	per_cpu/cpu0/trace:
				573
				574	This is similar to the "trace" file, but it will only display
				575	the data specific for the CPU. If written to, it only clears
				576	the specific CPU buffer.
				577
				578	per_cpu/cpu0/trace_pipe
				579
				580	This is similar to the "trace_pipe" file, and is a consuming
				581	read, but it will only display (and consume) the data specific
				582	for the CPU.
				583
				584	per_cpu/cpu0/trace_pipe_raw
				585
				586	For tools that can parse the ftrace ring buffer binary format,
				587	the trace_pipe_raw file can be used to extract the data
				588	from the ring buffer directly. With the use of the splice()
				589	system call, the buffer data can be quickly transferred to
				590	a file or to the network where a server is collecting the
				591	data.
				592
				593	Like trace_pipe, this is a consuming reader, where multiple
				594	reads will always produce different data.
				595
				596	per_cpu/cpu0/snapshot:
				597
				598	This is similar to the main "snapshot" file, but will only
				599	snapshot the current CPU (if supported). It only displays
				600	the content of the snapshot for a given CPU, and if
				601	written to, only clears this CPU buffer.
				602
				603	per_cpu/cpu0/snapshot_raw:
				604
				605	Similar to the trace_pipe_raw, but will read the binary format
				606	from the snapshot buffer for the given CPU.
				607
				608	per_cpu/cpu0/stats:
				609
				610	This displays certain stats about the ring buffer:
				611
				612	entries:
				613	The number of events that are still in the buffer.
				614
				615	overrun:
				616	The number of lost events due to overwriting when
				617	the buffer was full.
				618
				619	commit overrun:
				620	Should always be zero.
				621	This gets set if so many events happened within a nested
				622	event (ring buffer is re-entrant), that it fills the
				623	buffer and starts dropping events.
				624
				625	bytes:
				626	Bytes actually read (not overwritten).
				627
				628	oldest event ts:
				629	The oldest timestamp in the buffer
				630
				631	now ts:
				632	The current timestamp
				633
				634	dropped events:
				635	Events lost due to overwrite option being off.
				636
				637	read events:
				638	The number of events read.
				639
				640	The Tracers
				641	-----------
				642
				643	Here is the list of current tracers that may be configured.
				644
				645	"function"
				646
				647	Function call tracer to trace all kernel functions.
				648
				649	"function_graph"
				650
				651	Similar to the function tracer except that the
				652	function tracer probes the functions on their entry
				653	whereas the function graph tracer traces on both entry
				654	and exit of the functions. It then provides the ability
				655	to draw a graph of function calls similar to C code
				656	source.
				657
				658	"blk"
				659
				660	The block tracer. The tracer used by the blktrace user
				661	application.
				662
				663	"hwlat"
				664
				665	The Hardware Latency tracer is used to detect if the hardware
				666	produces any latency. See "Hardware Latency Detector" section
				667	below.
				668
				669	"irqsoff"
				670
				671	Traces the areas that disable interrupts and saves
				672	the trace with the longest max latency.
				673	See tracing_max_latency. When a new max is recorded,
				674	it replaces the old trace. It is best to view this
				675	trace with the latency-format option enabled, which
				676	happens automatically when the tracer is selected.
				677
				678	"preemptoff"
				679
				680	Similar to irqsoff but traces and records the amount of
				681	time for which preemption is disabled.
				682
				683	"preemptirqsoff"
				684
				685	Similar to irqsoff and preemptoff, but traces and
				686	records the largest time for which irqs and/or preemption
				687	is disabled.
				688
				689	"wakeup"
				690
				691	Traces and records the max latency that it takes for
				692	the highest priority task to get scheduled after
				693	it has been woken up.
				694	Traces all tasks as an average developer would expect.
				695
				696	"wakeup_rt"
				697
				698	Traces and records the max latency that it takes for just
				699	RT tasks (as the current "wakeup" does). This is useful
				700	for those interested in wake up timings of RT tasks.
				701
				702	"wakeup_dl"
				703
				704	Traces and records the max latency that it takes for
				705	a SCHED_DEADLINE task to be woken (as the "wakeup" and
				706	"wakeup_rt" does).
				707
				708	"mmiotrace"
				709
				710	A special tracer that is used to trace binary module.
				711	It will trace all the calls that a module makes to the
				712	hardware. Everything it writes and reads from the I/O
				713	as well.
				714
				715	"branch"
				716
				717	This tracer can be configured when tracing likely/unlikely
				718	calls within the kernel. It will trace when a likely and
				719	unlikely branch is hit and if it was correct in its prediction
				720	of being correct.
				721
				722	"nop"
				723
				724	This is the "trace nothing" tracer. To remove all
				725	tracers from tracing simply echo "nop" into
				726	current_tracer.
				727
				728
				729	Examples of using the tracer
				730	----------------------------
				731
				732	Here are typical examples of using the tracers when controlling
				733	them only with the tracefs interface (without using any
				734	user-land utilities).
				735
				736	Output format:
				737	--------------
				738
				739	Here is an example of the output format of the file "trace"::
				740
				741	# tracer: function
				742	#
				743	# entries-in-buffer/entries-written: 140080/250280 #P:4
				744	#
				745	# _-----=> irqs-off
				746	# / _----=> need-resched
				747	# \| / _---=> hardirq/softirq
				748	# \|\| / _--=> preempt-depth
				749	# \|\|\| / delay
				750	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				751	# \| \| \| \|\|\|\| \| \|
				752	bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
				753	bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
				754	bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
				755	sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
				756	bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
				757	bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
				758	bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
				759	bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
				760	bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
				761	sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
				762	....
				763
				764	A header is printed with the tracer name that is represented by
				765	the trace. In this case the tracer is "function". Then it shows the
				766	number of events in the buffer as well as the total number of entries
				767	that were written. The difference is the number of entries that were
				768	lost due to the buffer filling up (250280 - 140080 = 110200 events
				769	lost).
				770
				771	The header explains the content of the events. Task name "bash", the task
				772	PID "1977", the CPU that it was running on "000", the latency format
				773	(explained below), the timestamp in <secs>.<usecs> format, the
				774	function name that was traced "sys_close" and the parent function that
				775	called this function "system_call_fastpath". The timestamp is the time
				776	at which the function was entered.
				777
				778	Latency trace format
				779	--------------------
				780
				781	When the latency-format option is enabled or when one of the latency
				782	tracers is set, the trace file gives somewhat more information to see
				783	why a latency happened. Here is a typical trace::
				784
				785	# tracer: irqsoff
				786	#
				787	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				788	# --------------------------------------------------------------------
				789	# latency: 259 us, #4/4, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				790	# -----------------
				791	# \| task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
				792	# -----------------
				793	# => started at: __lock_task_sighand
				794	# => ended at: _raw_spin_unlock_irqrestore
				795	#
				796	#
				797	# _------=> CPU#
				798	# / _-----=> irqs-off
				799	# \| / _----=> need-resched
				800	# \|\| / _---=> hardirq/softirq
				801	# \|\|\| / _--=> preempt-depth
				802	# \|\|\|\| / delay
				803	# cmd pid \|\|\|\|\| time \| caller
				804	# \ / \|\|\|\|\| \ \| /
				805	ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
				806	ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
				807	ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
				808	ps-6143 2d..1 306us : <stack trace>
				809	=> trace_hardirqs_on_caller
				810	=> trace_hardirqs_on
				811	=> _raw_spin_unlock_irqrestore
				812	=> do_task_stat
				813	=> proc_tgid_stat
				814	=> proc_single_show
				815	=> seq_read
				816	=> vfs_read
				817	=> sys_read
				818	=> system_call_fastpath
				819
				820
				821	This shows that the current tracer is "irqsoff" tracing the time
				822	for which interrupts were disabled. It gives the trace version (which
				823	never changes) and the version of the kernel upon which this was executed on
				824	(3.8). Then it displays the max latency in microseconds (259 us). The number
				825	of trace entries displayed and the total number (both are four: #4/4).
				826	VP, KP, SP, and HP are always zero and are reserved for later use.
				827	#P is the number of online CPUs (#P:4).
				828
				829	The task is the process that was running when the latency
				830	occurred. (ps pid: 6143).
				831
				832	The start and stop (the functions in which the interrupts were
				833	disabled and enabled respectively) that caused the latencies:
				834
				835	- __lock_task_sighand is where the interrupts were disabled.
				836	- _raw_spin_unlock_irqrestore is where they were enabled again.
				837
				838	The next lines after the header are the trace itself. The header
				839	explains which is which.
				840
				841	cmd: The name of the process in the trace.
				842
				843	pid: The PID of that process.
				844
				845	CPU#: The CPU which the process was running on.
				846
				847	irqs-off: 'd' interrupts are disabled. '.' otherwise.
				848	.. caution:: If the architecture does not support a way to
				849	read the irq flags variable, an 'X' will always
				850	be printed here.
				851
				852	need-resched:
				853	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
				854	- 'n' only TIF_NEED_RESCHED is set,
				855	- 'p' only PREEMPT_NEED_RESCHED is set,
				856	- '.' otherwise.
				857
				858	hardirq/softirq:
				859	- 'Z' - NMI occurred inside a hardirq
				860	- 'z' - NMI is running
				861	- 'H' - hard irq occurred inside a softirq.
				862	- 'h' - hard irq is running
				863	- 's' - soft irq is running
				864	- '.' - normal context.
				865
				866	preempt-depth: The level of preempt_disabled
				867
				868	The above is mostly meaningful for kernel developers.
				869
				870	time:
				871	When the latency-format option is enabled, the trace file
				872	output includes a timestamp relative to the start of the
				873	trace. This differs from the output when latency-format
				874	is disabled, which includes an absolute timestamp.
				875
				876	delay:
				877	This is just to help catch your eye a bit better. And
				878	needs to be fixed to be only relative to the same CPU.
				879	The marks are determined by the difference between this
				880	current trace and the next trace.
				881
				882	- '$' - greater than 1 second
				883	- '@' - greater than 100 milisecond
				884	- '*' - greater than 10 milisecond
				885	- '#' - greater than 1000 microsecond
				886	- '!' - greater than 100 microsecond
				887	- '+' - greater than 10 microsecond
				888	- ' ' - less than or equal to 10 microsecond.
				889
				890	The rest is the same as the 'trace' file.
				891
				892	Note, the latency tracers will usually end with a back trace
				893	to easily find where the latency occurred.
				894
				895	trace_options
				896	-------------
				897
				898	The trace_options file (or the options directory) is used to control
				899	what gets printed in the trace output, or manipulate the tracers.
				900	To see what is available, simply cat the file::
				901
				902	cat trace_options
				903	print-parent
				904	nosym-offset
				905	nosym-addr
				906	noverbose
				907	noraw
				908	nohex
				909	nobin
				910	noblock
				911	trace_printk
				912	annotate
				913	nouserstacktrace
				914	nosym-userobj
				915	noprintk-msg-only
				916	context-info
				917	nolatency-format
				918	record-cmd
				919	norecord-tgid
				920	overwrite
				921	nodisable_on_free
				922	irq-info
				923	markers
				924	noevent-fork
				925	function-trace
				926	nofunction-fork
				927	nodisplay-graph
				928	nostacktrace
				929	nobranch
				930
				931	To disable one of the options, echo in the option prepended with
				932	"no"::
				933
				934	echo noprint-parent > trace_options
				935
				936	To enable an option, leave off the "no"::
				937
				938	echo sym-offset > trace_options
				939
				940	Here are the available options:
				941
				942	print-parent
				943	On function traces, display the calling (parent)
				944	function as well as the function being traced.
				945	::
				946
				947	print-parent:
				948	bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul
				949
				950	noprint-parent:
				951	bash-4000 [01] 1477.606694: simple_strtoul
				952
				953
				954	sym-offset
				955	Display not only the function name, but also the
				956	offset in the function. For example, instead of
				957	seeing just "ktime_get", you will see
				958	"ktime_get+0xb/0x20".
				959	::
				960
				961	sym-offset:
				962	bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
				963
				964	sym-addr
				965	This will also display the function address as well
				966	as the function name.
				967	::
				968
				969	sym-addr:
				970	bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
				971
				972	verbose
				973	This deals with the trace file when the
				974	latency-format option is enabled.
				975	::
				976
				977	bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
				978	(+0.000ms): simple_strtoul (kstrtoul)
				979
				980	raw
				981	This will display raw numbers. This option is best for
				982	use with user applications that can translate the raw
				983	numbers better than having it done in the kernel.
				984
				985	hex
				986	Similar to raw, but the numbers will be in a hexadecimal format.
				987
				988	bin
				989	This will print out the formats in raw binary.
				990
				991	block
				992	When set, reading trace_pipe will not block when polled.
				993
				994	trace_printk
				995	Can disable trace_printk() from writing into the buffer.
				996
				997	annotate
				998	It is sometimes confusing when the CPU buffers are full
				999	and one CPU buffer had a lot of events recently, thus
				1000	a shorter time frame, were another CPU may have only had
				1001	a few events, which lets it have older events. When
				1002	the trace is reported, it shows the oldest events first,
				1003	and it may look like only one CPU ran (the one with the
				1004	oldest events). When the annotate option is set, it will
				1005	display when a new CPU buffer started::
				1006
				1007	<idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
				1008	<idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
				1009	<idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
				1010	##### CPU 2 buffer started ####
				1011	<idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
				1012	<idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
				1013	<idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
				1014
				1015	userstacktrace
				1016	This option changes the trace. It records a
				1017	stacktrace of the current user space thread after
				1018	each trace event.
				1019
				1020	sym-userobj
				1021	when user stacktrace are enabled, look up which
				1022	object the address belongs to, and print a
				1023	relative address. This is especially useful when
				1024	ASLR is on, otherwise you don't get a chance to
				1025	resolve the address to object/file/line after
				1026	the app is no longer running
				1027
				1028	The lookup is performed when you read
				1029	trace,trace_pipe. Example::
				1030
				1031	a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
				1032	x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
				1033
				1034
				1035	printk-msg-only
				1036	When set, trace_printk()s will only show the format
				1037	and not their parameters (if trace_bprintk() or
				1038	trace_bputs() was used to save the trace_printk()).
				1039
				1040	context-info
				1041	Show only the event data. Hides the comm, PID,
				1042	timestamp, CPU, and other useful data.
				1043
				1044	latency-format
				1045	This option changes the trace output. When it is enabled,
				1046	the trace displays additional information about the
				1047	latency, as described in "Latency trace format".
				1048
				1049	record-cmd
				1050	When any event or tracer is enabled, a hook is enabled
				1051	in the sched_switch trace point to fill comm cache
				1052	with mapped pids and comms. But this may cause some
				1053	overhead, and if you only care about pids, and not the
				1054	name of the task, disabling this option can lower the
				1055	impact of tracing. See "saved_cmdlines".
				1056
				1057	record-tgid
				1058	When any event or tracer is enabled, a hook is enabled
				1059	in the sched_switch trace point to fill the cache of
				1060	mapped Thread Group IDs (TGID) mapping to pids. See
				1061	"saved_tgids".
				1062
				1063	overwrite
				1064	This controls what happens when the trace buffer is
				1065	full. If "1" (default), the oldest events are
				1066	discarded and overwritten. If "0", then the newest
				1067	events are discarded.
				1068	(see per_cpu/cpu0/stats for overrun and dropped)
				1069
				1070	disable_on_free
				1071	When the free_buffer is closed, tracing will
				1072	stop (tracing_on set to 0).
				1073
				1074	irq-info
				1075	Shows the interrupt, preempt count, need resched data.
				1076	When disabled, the trace looks like::
				1077
				1078	# tracer: function
				1079	#
				1080	# entries-in-buffer/entries-written: 144405/9452052 #P:4
				1081	#
				1082	# TASK-PID CPU# TIMESTAMP FUNCTION
				1083	# \| \| \| \| \|
				1084	<idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
				1085	<idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
				1086	<idle>-0 [002] 23636.756055: enqueue_task <-activate_task
				1087
				1088
				1089	markers
				1090	When set, the trace_marker is writable (only by root).
				1091	When disabled, the trace_marker will error with EINVAL
				1092	on write.
				1093
				1094	event-fork
				1095	When set, tasks with PIDs listed in set_event_pid will have
				1096	the PIDs of their children added to set_event_pid when those
				1097	tasks fork. Also, when tasks with PIDs in set_event_pid exit,
				1098	their PIDs will be removed from the file.
				1099
				1100	function-trace
				1101	The latency tracers will enable function tracing
				1102	if this option is enabled (default it is). When
				1103	it is disabled, the latency tracers do not trace
				1104	functions. This keeps the overhead of the tracer down
				1105	when performing latency tests.
				1106
				1107	function-fork
				1108	When set, tasks with PIDs listed in set_ftrace_pid will
				1109	have the PIDs of their children added to set_ftrace_pid
				1110	when those tasks fork. Also, when tasks with PIDs in
				1111	set_ftrace_pid exit, their PIDs will be removed from the
				1112	file.
				1113
				1114	display-graph
				1115	When set, the latency tracers (irqsoff, wakeup, etc) will
				1116	use function graph tracing instead of function tracing.
				1117
				1118	stacktrace
				1119	When set, a stack trace is recorded after any trace event
				1120	is recorded.
				1121
				1122	branch
				1123	Enable branch tracing with the tracer. This enables branch
				1124	tracer along with the currently set tracer. Enabling this
				1125	with the "nop" tracer is the same as just enabling the
				1126	"branch" tracer.
				1127
				1128	.. tip:: Some tracers have their own options. They only appear in this
				1129	file when the tracer is active. They always appear in the
				1130	options directory.
				1131
				1132
				1133	Here are the per tracer options:
				1134
				1135	Options for function tracer:
				1136
				1137	func_stack_trace
				1138	When set, a stack trace is recorded after every
				1139	function that is recorded. NOTE! Limit the functions
				1140	that are recorded before enabling this, with
				1141	"set_ftrace_filter" otherwise the system performance
				1142	will be critically degraded. Remember to disable
				1143	this option before clearing the function filter.
				1144
				1145	Options for function_graph tracer:
				1146
				1147	Since the function_graph tracer has a slightly different output
				1148	it has its own options to control what is displayed.
				1149
				1150	funcgraph-overrun
				1151	When set, the "overrun" of the graph stack is
				1152	displayed after each function traced. The
				1153	overrun, is when the stack depth of the calls
				1154	is greater than what is reserved for each task.
				1155	Each task has a fixed array of functions to
				1156	trace in the call graph. If the depth of the
				1157	calls exceeds that, the function is not traced.
				1158	The overrun is the number of functions missed
				1159	due to exceeding this array.
				1160
				1161	funcgraph-cpu
				1162	When set, the CPU number of the CPU where the trace
				1163	occurred is displayed.
				1164
				1165	funcgraph-overhead
				1166	When set, if the function takes longer than
				1167	A certain amount, then a delay marker is
				1168	displayed. See "delay" above, under the
				1169	header description.
				1170
				1171	funcgraph-proc
				1172	Unlike other tracers, the process' command line
				1173	is not displayed by default, but instead only
				1174	when a task is traced in and out during a context
				1175	switch. Enabling this options has the command
				1176	of each process displayed at every line.
				1177
				1178	funcgraph-duration
				1179	At the end of each function (the return)
				1180	the duration of the amount of time in the
				1181	function is displayed in microseconds.
				1182
				1183	funcgraph-abstime
				1184	When set, the timestamp is displayed at each line.
				1185
				1186	funcgraph-irqs
				1187	When disabled, functions that happen inside an
				1188	interrupt will not be traced.
				1189
				1190	funcgraph-tail
				1191	When set, the return event will include the function
				1192	that it represents. By default this is off, and
				1193	only a closing curly bracket "}" is displayed for
				1194	the return of a function.
				1195
				1196	sleep-time
				1197	When running function graph tracer, to include
				1198	the time a task schedules out in its function.
				1199	When enabled, it will account time the task has been
				1200	scheduled out as part of the function call.
				1201
				1202	graph-time
				1203	When running function profiler with function graph tracer,
				1204	to include the time to call nested functions. When this is
				1205	not set, the time reported for the function will only
				1206	include the time the function itself executed for, not the
				1207	time for functions that it called.
				1208
				1209	Options for blk tracer:
				1210
				1211	blk_classic
				1212	Shows a more minimalistic output.
				1213
				1214
				1215	irqsoff
				1216	-------
				1217
				1218	When interrupts are disabled, the CPU can not react to any other
				1219	external event (besides NMIs and SMIs). This prevents the timer
				1220	interrupt from triggering or the mouse interrupt from letting
				1221	the kernel know of a new mouse event. The result is a latency
				1222	with the reaction time.
				1223
				1224	The irqsoff tracer tracks the time for which interrupts are
				1225	disabled. When a new maximum latency is hit, the tracer saves
				1226	the trace leading up to that latency point so that every time a
				1227	new maximum is reached, the old saved trace is discarded and the
				1228	new trace is saved.
				1229
				1230	To reset the maximum, echo 0 into tracing_max_latency. Here is
				1231	an example::
				1232
				1233	# echo 0 > options/function-trace
				1234	# echo irqsoff > current_tracer
				1235	# echo 1 > tracing_on
				1236	# echo 0 > tracing_max_latency
				1237	# ls -ltr
				1238	[...]
				1239	# echo 0 > tracing_on
				1240	# cat trace
				1241	# tracer: irqsoff
				1242	#
				1243	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				1244	# --------------------------------------------------------------------
				1245	# latency: 16 us, #4/4, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1246	# -----------------
				1247	# \| task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
				1248	# -----------------
				1249	# => started at: run_timer_softirq
				1250	# => ended at: run_timer_softirq
				1251	#
				1252	#
				1253	# _------=> CPU#
				1254	# / _-----=> irqs-off
				1255	# \| / _----=> need-resched
				1256	# \|\| / _---=> hardirq/softirq
				1257	# \|\|\| / _--=> preempt-depth
				1258	# \|\|\|\| / delay
				1259	# cmd pid \|\|\|\|\| time \| caller
				1260	# \ / \|\|\|\|\| \ \| /
				1261	<idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
				1262	<idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
				1263	<idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
				1264	<idle>-0 0dNs3 25us : <stack trace>
				1265	=> _raw_spin_unlock_irq
				1266	=> run_timer_softirq
				1267	=> __do_softirq
				1268	=> call_softirq
				1269	=> do_softirq
				1270	=> irq_exit
				1271	=> smp_apic_timer_interrupt
				1272	=> apic_timer_interrupt
				1273	=> rcu_idle_exit
				1274	=> cpu_idle
				1275	=> rest_init
				1276	=> start_kernel
				1277	=> x86_64_start_reservations
				1278	=> x86_64_start_kernel
				1279
				1280	Here we see that that we had a latency of 16 microseconds (which is
				1281	very good). The _raw_spin_lock_irq in run_timer_softirq disabled
				1282	interrupts. The difference between the 16 and the displayed
				1283	timestamp 25us occurred because the clock was incremented
				1284	between the time of recording the max latency and the time of
				1285	recording the function that had that latency.
				1286
				1287	Note the above example had function-trace not set. If we set
				1288	function-trace, we get a much larger output::
				1289
				1290	with echo 1 > options/function-trace
				1291
				1292	# tracer: irqsoff
				1293	#
				1294	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				1295	# --------------------------------------------------------------------
				1296	# latency: 71 us, #168/168, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1297	# -----------------
				1298	# \| task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
				1299	# -----------------
				1300	# => started at: ata_scsi_queuecmd
				1301	# => ended at: ata_scsi_queuecmd
				1302	#
				1303	#
				1304	# _------=> CPU#
				1305	# / _-----=> irqs-off
				1306	# \| / _----=> need-resched
				1307	# \|\| / _---=> hardirq/softirq
				1308	# \|\|\| / _--=> preempt-depth
				1309	# \|\|\|\| / delay
				1310	# cmd pid \|\|\|\|\| time \| caller
				1311	# \ / \|\|\|\|\| \ \| /
				1312	bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
				1313	bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
				1314	bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
				1315	bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
				1316	bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
				1317	bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
				1318	bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
				1319	bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
				1320	bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
				1321	[...]
				1322	bash-2042 3d..1 67us : delay_tsc <-__delay
				1323	bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
				1324	bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
				1325	bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
				1326	bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
				1327	bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
				1328	bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1329	bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1330	bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
				1331	bash-2042 3d..1 120us : <stack trace>
				1332	=> _raw_spin_unlock_irqrestore
				1333	=> ata_scsi_queuecmd
				1334	=> scsi_dispatch_cmd
				1335	=> scsi_request_fn
				1336	=> __blk_run_queue_uncond
				1337	=> __blk_run_queue
				1338	=> blk_queue_bio
				1339	=> generic_make_request
				1340	=> submit_bio
				1341	=> submit_bh
				1342	=> __ext3_get_inode_loc
				1343	=> ext3_iget
				1344	=> ext3_lookup
				1345	=> lookup_real
				1346	=> __lookup_hash
				1347	=> walk_component
				1348	=> lookup_last
				1349	=> path_lookupat
				1350	=> filename_lookup
				1351	=> user_path_at_empty
				1352	=> user_path_at
				1353	=> vfs_fstatat
				1354	=> vfs_stat
				1355	=> sys_newstat
				1356	=> system_call_fastpath
				1357
				1358
				1359	Here we traced a 71 microsecond latency. But we also see all the
				1360	functions that were called during that time. Note that by
				1361	enabling function tracing, we incur an added overhead. This
				1362	overhead may extend the latency times. But nevertheless, this
				1363	trace has provided some very helpful debugging information.
				1364
				1365
				1366	preemptoff
				1367	----------
				1368
				1369	When preemption is disabled, we may be able to receive
				1370	interrupts but the task cannot be preempted and a higher
				1371	priority task must wait for preemption to be enabled again
				1372	before it can preempt a lower priority task.
				1373
				1374	The preemptoff tracer traces the places that disable preemption.
				1375	Like the irqsoff tracer, it records the maximum latency for
				1376	which preemption was disabled. The control of preemptoff tracer
				1377	is much like the irqsoff tracer.
				1378	::
				1379
				1380	# echo 0 > options/function-trace
				1381	# echo preemptoff > current_tracer
				1382	# echo 1 > tracing_on
				1383	# echo 0 > tracing_max_latency
				1384	# ls -ltr
				1385	[...]
				1386	# echo 0 > tracing_on
				1387	# cat trace
				1388	# tracer: preemptoff
				1389	#
				1390	# preemptoff latency trace v1.1.5 on 3.8.0-test+
				1391	# --------------------------------------------------------------------
				1392	# latency: 46 us, #4/4, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1393	# -----------------
				1394	# \| task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
				1395	# -----------------
				1396	# => started at: do_IRQ
				1397	# => ended at: do_IRQ
				1398	#
				1399	#
				1400	# _------=> CPU#
				1401	# / _-----=> irqs-off
				1402	# \| / _----=> need-resched
				1403	# \|\| / _---=> hardirq/softirq
				1404	# \|\|\| / _--=> preempt-depth
				1405	# \|\|\|\| / delay
				1406	# cmd pid \|\|\|\|\| time \| caller
				1407	# \ / \|\|\|\|\| \ \| /
				1408	sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
				1409	sshd-1991 1d..1 46us : irq_exit <-do_IRQ
				1410	sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
				1411	sshd-1991 1d..1 52us : <stack trace>
				1412	=> sub_preempt_count
				1413	=> irq_exit
				1414	=> do_IRQ
				1415	=> ret_from_intr
				1416
				1417
				1418	This has some more changes. Preemption was disabled when an
				1419	interrupt came in (notice the 'h'), and was enabled on exit.
				1420	But we also see that interrupts have been disabled when entering
				1421	the preempt off section and leaving it (the 'd'). We do not know if
				1422	interrupts were enabled in the mean time or shortly after this
				1423	was over.
				1424	::
				1425
				1426	# tracer: preemptoff
				1427	#
				1428	# preemptoff latency trace v1.1.5 on 3.8.0-test+
				1429	# --------------------------------------------------------------------
				1430	# latency: 83 us, #241/241, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1431	# -----------------
				1432	# \| task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
				1433	# -----------------
				1434	# => started at: wake_up_new_task
				1435	# => ended at: task_rq_unlock
				1436	#
				1437	#
				1438	# _------=> CPU#
				1439	# / _-----=> irqs-off
				1440	# \| / _----=> need-resched
				1441	# \|\| / _---=> hardirq/softirq
				1442	# \|\|\| / _--=> preempt-depth
				1443	# \|\|\|\| / delay
				1444	# cmd pid \|\|\|\|\| time \| caller
				1445	# \ / \|\|\|\|\| \ \| /
				1446	bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
				1447	bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
				1448	bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
				1449	bash-1994 1d..1 1us : source_load <-select_task_rq_fair
				1450	bash-1994 1d..1 1us : source_load <-select_task_rq_fair
				1451	[...]
				1452	bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
				1453	bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
				1454	bash-1994 1d..1 13us : add_preempt_count <-irq_enter
				1455	bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
				1456	bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
				1457	bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
				1458	bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
				1459	bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
				1460	[...]
				1461	bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
				1462	bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
				1463	bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
				1464	bash-1994 1d..2 36us : do_softirq <-irq_exit
				1465	bash-1994 1d..2 36us : __do_softirq <-call_softirq
				1466	bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
				1467	bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
				1468	bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
				1469	bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
				1470	bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
				1471	[...]
				1472	bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
				1473	bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
				1474	bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
				1475	bash-1994 1dN.2 82us : idle_cpu <-irq_exit
				1476	bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
				1477	bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
				1478	bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
				1479	bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
				1480	bash-1994 1.N.1 104us : <stack trace>
				1481	=> sub_preempt_count
				1482	=> _raw_spin_unlock_irqrestore
				1483	=> task_rq_unlock
				1484	=> wake_up_new_task
				1485	=> do_fork
				1486	=> sys_clone
				1487	=> stub_clone
				1488
				1489
				1490	The above is an example of the preemptoff trace with
				1491	function-trace set. Here we see that interrupts were not disabled
				1492	the entire time. The irq_enter code lets us know that we entered
				1493	an interrupt 'h'. Before that, the functions being traced still
				1494	show that it is not in an interrupt, but we can see from the
				1495	functions themselves that this is not the case.
				1496
				1497	preemptirqsoff
				1498	--------------
				1499
				1500	Knowing the locations that have interrupts disabled or
				1501	preemption disabled for the longest times is helpful. But
				1502	sometimes we would like to know when either preemption and/or
				1503	interrupts are disabled.
				1504
				1505	Consider the following code::
				1506
				1507	local_irq_disable();
				1508	call_function_with_irqs_off();
				1509	preempt_disable();
				1510	call_function_with_irqs_and_preemption_off();
				1511	local_irq_enable();
				1512	call_function_with_preemption_off();
				1513	preempt_enable();
				1514
				1515	The irqsoff tracer will record the total length of
				1516	call_function_with_irqs_off() and
				1517	call_function_with_irqs_and_preemption_off().
				1518
				1519	The preemptoff tracer will record the total length of
				1520	call_function_with_irqs_and_preemption_off() and
				1521	call_function_with_preemption_off().
				1522
				1523	But neither will trace the time that interrupts and/or
				1524	preemption is disabled. This total time is the time that we can
				1525	not schedule. To record this time, use the preemptirqsoff
				1526	tracer.
				1527
				1528	Again, using this trace is much like the irqsoff and preemptoff
				1529	tracers.
				1530	::
				1531
				1532	# echo 0 > options/function-trace
				1533	# echo preemptirqsoff > current_tracer
				1534	# echo 1 > tracing_on
				1535	# echo 0 > tracing_max_latency
				1536	# ls -ltr
				1537	[...]
				1538	# echo 0 > tracing_on
				1539	# cat trace
				1540	# tracer: preemptirqsoff
				1541	#
				1542	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
				1543	# --------------------------------------------------------------------
				1544	# latency: 100 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1545	# -----------------
				1546	# \| task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
				1547	# -----------------
				1548	# => started at: ata_scsi_queuecmd
				1549	# => ended at: ata_scsi_queuecmd
				1550	#
				1551	#
				1552	# _------=> CPU#
				1553	# / _-----=> irqs-off
				1554	# \| / _----=> need-resched
				1555	# \|\| / _---=> hardirq/softirq
				1556	# \|\|\| / _--=> preempt-depth
				1557	# \|\|\|\| / delay
				1558	# cmd pid \|\|\|\|\| time \| caller
				1559	# \ / \|\|\|\|\| \ \| /
				1560	ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
				1561	ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1562	ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
				1563	ls-2230 3...1 111us : <stack trace>
				1564	=> sub_preempt_count
				1565	=> _raw_spin_unlock_irqrestore
				1566	=> ata_scsi_queuecmd
				1567	=> scsi_dispatch_cmd
				1568	=> scsi_request_fn
				1569	=> __blk_run_queue_uncond
				1570	=> __blk_run_queue
				1571	=> blk_queue_bio
				1572	=> generic_make_request
				1573	=> submit_bio
				1574	=> submit_bh
				1575	=> ext3_bread
				1576	=> ext3_dir_bread
				1577	=> htree_dirblock_to_tree
				1578	=> ext3_htree_fill_tree
				1579	=> ext3_readdir
				1580	=> vfs_readdir
				1581	=> sys_getdents
				1582	=> system_call_fastpath
				1583
				1584
				1585	The trace_hardirqs_off_thunk is called from assembly on x86 when
				1586	interrupts are disabled in the assembly code. Without the
				1587	function tracing, we do not know if interrupts were enabled
				1588	within the preemption points. We do see that it started with
				1589	preemption enabled.
				1590
				1591	Here is a trace with function-trace set::
				1592
				1593	# tracer: preemptirqsoff
				1594	#
				1595	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
				1596	# --------------------------------------------------------------------
				1597	# latency: 161 us, #339/339, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1598	# -----------------
				1599	# \| task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
				1600	# -----------------
				1601	# => started at: schedule
				1602	# => ended at: mutex_unlock
				1603	#
				1604	#
				1605	# _------=> CPU#
				1606	# / _-----=> irqs-off
				1607	# \| / _----=> need-resched
				1608	# \|\| / _---=> hardirq/softirq
				1609	# \|\|\| / _--=> preempt-depth
				1610	# \|\|\|\| / delay
				1611	# cmd pid \|\|\|\|\| time \| caller
				1612	# \ / \|\|\|\|\| \ \| /
				1613	kworker/-59 3...1 0us : __schedule <-schedule
				1614	kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
				1615	kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
				1616	kworker/-59 3d..2 1us : deactivate_task <-__schedule
				1617	kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
				1618	kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
				1619	kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
				1620	kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
				1621	kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
				1622	kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
				1623	kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
				1624	kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
				1625	kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
				1626	kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
				1627	kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
				1628	kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
				1629	kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
				1630	kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
				1631	kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
				1632	kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
				1633	kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
				1634	kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
				1635	kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
				1636	kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
				1637	kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
				1638	ls-2269 3d..2 7us : finish_task_switch <-__schedule
				1639	ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
				1640	ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
				1641	ls-2269 3d..2 8us : irq_enter <-do_IRQ
				1642	ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
				1643	ls-2269 3d..2 9us : add_preempt_count <-irq_enter
				1644	ls-2269 3d.h2 9us : exit_idle <-do_IRQ
				1645	[...]
				1646	ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
				1647	ls-2269 3d.h2 20us : irq_exit <-do_IRQ
				1648	ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
				1649	ls-2269 3d..3 21us : do_softirq <-irq_exit
				1650	ls-2269 3d..3 21us : __do_softirq <-call_softirq
				1651	ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
				1652	ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
				1653	ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
				1654	ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
				1655	ls-2269 3d.s5 31us : irq_enter <-do_IRQ
				1656	ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
				1657	[...]
				1658	ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
				1659	ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
				1660	ls-2269 3d.H5 32us : exit_idle <-do_IRQ
				1661	ls-2269 3d.H5 32us : handle_irq <-do_IRQ
				1662	ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
				1663	ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
				1664	[...]
				1665	ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
				1666	ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
				1667	ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
				1668	ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
				1669	ls-2269 3d..3 159us : idle_cpu <-irq_exit
				1670	ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
				1671	ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
				1672	ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
				1673	ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
				1674	ls-2269 3d... 186us : <stack trace>
				1675	=> __mutex_unlock_slowpath
				1676	=> mutex_unlock
				1677	=> process_output
				1678	=> n_tty_write
				1679	=> tty_write
				1680	=> vfs_write
				1681	=> sys_write
				1682	=> system_call_fastpath
				1683
				1684	This is an interesting trace. It started with kworker running and
				1685	scheduling out and ls taking over. But as soon as ls released the
				1686	rq lock and enabled interrupts (but not preemption) an interrupt
				1687	triggered. When the interrupt finished, it started running softirqs.
				1688	But while the softirq was running, another interrupt triggered.
				1689	When an interrupt is running inside a softirq, the annotation is 'H'.
				1690
				1691
				1692	wakeup
				1693	------
				1694
				1695	One common case that people are interested in tracing is the
				1696	time it takes for a task that is woken to actually wake up.
				1697	Now for non Real-Time tasks, this can be arbitrary. But tracing
				1698	it none the less can be interesting.
				1699
				1700	Without function tracing::
				1701
				1702	# echo 0 > options/function-trace
				1703	# echo wakeup > current_tracer
				1704	# echo 1 > tracing_on
				1705	# echo 0 > tracing_max_latency
				1706	# chrt -f 5 sleep 1
				1707	# echo 0 > tracing_on
				1708	# cat trace
				1709	# tracer: wakeup
				1710	#
				1711	# wakeup latency trace v1.1.5 on 3.8.0-test+
				1712	# --------------------------------------------------------------------
				1713	# latency: 15 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1714	# -----------------
				1715	# \| task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
				1716	# -----------------
				1717	#
				1718	# _------=> CPU#
				1719	# / _-----=> irqs-off
				1720	# \| / _----=> need-resched
				1721	# \|\| / _---=> hardirq/softirq
				1722	# \|\|\| / _--=> preempt-depth
				1723	# \|\|\|\| / delay
				1724	# cmd pid \|\|\|\|\| time \| caller
				1725	# \ / \|\|\|\|\| \ \| /
				1726	<idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
				1727	<idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
				1728	<idle>-0 3d..3 15us : __schedule <-schedule
				1729	<idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
				1730
				1731	The tracer only traces the highest priority task in the system
				1732	to avoid tracing the normal circumstances. Here we see that
				1733	the kworker with a nice priority of -20 (not very nice), took
				1734	just 15 microseconds from the time it woke up, to the time it
				1735	ran.
				1736
				1737	Non Real-Time tasks are not that interesting. A more interesting
				1738	trace is to concentrate only on Real-Time tasks.
				1739
				1740	wakeup_rt
				1741	---------
				1742
				1743	In a Real-Time environment it is very important to know the
				1744	wakeup time it takes for the highest priority task that is woken
				1745	up to the time that it executes. This is also known as "schedule
				1746	latency". I stress the point that this is about RT tasks. It is
				1747	also important to know the scheduling latency of non-RT tasks,
				1748	but the average schedule latency is better for non-RT tasks.
				1749	Tools like LatencyTop are more appropriate for such
				1750	measurements.
				1751
				1752	Real-Time environments are interested in the worst case latency.
				1753	That is the longest latency it takes for something to happen,
				1754	and not the average. We can have a very fast scheduler that may
				1755	only have a large latency once in a while, but that would not
				1756	work well with Real-Time tasks. The wakeup_rt tracer was designed
				1757	to record the worst case wakeups of RT tasks. Non-RT tasks are
				1758	not recorded because the tracer only records one worst case and
				1759	tracing non-RT tasks that are unpredictable will overwrite the
				1760	worst case latency of RT tasks (just run the normal wakeup
				1761	tracer for a while to see that effect).
				1762
				1763	Since this tracer only deals with RT tasks, we will run this
				1764	slightly differently than we did with the previous tracers.
				1765	Instead of performing an 'ls', we will run 'sleep 1' under
				1766	'chrt' which changes the priority of the task.
				1767	::
				1768
				1769	# echo 0 > options/function-trace
				1770	# echo wakeup_rt > current_tracer
				1771	# echo 1 > tracing_on
				1772	# echo 0 > tracing_max_latency
				1773	# chrt -f 5 sleep 1
				1774	# echo 0 > tracing_on
				1775	# cat trace
				1776	# tracer: wakeup
				1777	#
				1778	# tracer: wakeup_rt
				1779	#
				1780	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1781	# --------------------------------------------------------------------
				1782	# latency: 5 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1783	# -----------------
				1784	# \| task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
				1785	# -----------------
				1786	#
				1787	# _------=> CPU#
				1788	# / _-----=> irqs-off
				1789	# \| / _----=> need-resched
				1790	# \|\| / _---=> hardirq/softirq
				1791	# \|\|\| / _--=> preempt-depth
				1792	# \|\|\|\| / delay
				1793	# cmd pid \|\|\|\|\| time \| caller
				1794	# \ / \|\|\|\|\| \ \| /
				1795	<idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
				1796	<idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
				1797	<idle>-0 3d..3 5us : __schedule <-schedule
				1798	<idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
				1799
				1800
				1801	Running this on an idle system, we see that it only took 5 microseconds
				1802	to perform the task switch. Note, since the trace point in the schedule
				1803	is before the actual "switch", we stop the tracing when the recorded task
				1804	is about to schedule in. This may change if we add a new marker at the
				1805	end of the scheduler.
				1806
				1807	Notice that the recorded task is 'sleep' with the PID of 2389
				1808	and it has an rt_prio of 5. This priority is user-space priority
				1809	and not the internal kernel priority. The policy is 1 for
				1810	SCHED_FIFO and 2 for SCHED_RR.
				1811
				1812	Note, that the trace data shows the internal priority (99 - rtprio).
				1813	::
				1814
				1815	<idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
				1816
				1817	The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
				1818	and in the running state 'R'. The sleep task was scheduled in with
				1819	2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
				1820	and it too is in the running state.
				1821
				1822	Doing the same with chrt -r 5 and function-trace set.
				1823	::
				1824
				1825	echo 1 > options/function-trace
				1826
				1827	# tracer: wakeup_rt
				1828	#
				1829	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1830	# --------------------------------------------------------------------
				1831	# latency: 29 us, #85/85, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1832	# -----------------
				1833	# \| task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
				1834	# -----------------
				1835	#
				1836	# _------=> CPU#
				1837	# / _-----=> irqs-off
				1838	# \| / _----=> need-resched
				1839	# \|\| / _---=> hardirq/softirq
				1840	# \|\|\| / _--=> preempt-depth
				1841	# \|\|\|\| / delay
				1842	# cmd pid \|\|\|\|\| time \| caller
				1843	# \ / \|\|\|\|\| \ \| /
				1844	<idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
				1845	<idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
				1846	<idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
				1847	<idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr
				1848	<idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
				1849	<idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
				1850	<idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
				1851	<idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
				1852	<idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
				1853	<idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1854	<idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
				1855	<idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
				1856	<idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
				1857	<idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
				1858	<idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
				1859	<idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
				1860	<idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
				1861	<idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
				1862	<idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
				1863	<idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
				1864	<idle>-0 3dN.2 9us : idle_cpu <-irq_exit
				1865	<idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
				1866	<idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
				1867	<idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
				1868	<idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
				1869	<idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
				1870	<idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
				1871	<idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
				1872	<idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
				1873	<idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
				1874	<idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit
				1875	<idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz
				1876	<idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
				1877	<idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz
				1878	<idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update
				1879	<idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz
				1880	<idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
				1881	<idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit
				1882	<idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
				1883	<idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
				1884	<idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
				1885	<idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
				1886	<idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
				1887	<idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
				1888	<idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
				1889	<idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
				1890	<idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
				1891	<idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
				1892	<idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
				1893	<idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
				1894	<idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
				1895	<idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1896	<idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
				1897	<idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
				1898	<idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
				1899	<idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
				1900	<idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
				1901	<idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
				1902	<idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
				1903	<idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
				1904	<idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
				1905	<idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
				1906	<idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
				1907	<idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
				1908	<idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
				1909	<idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
				1910	<idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
				1911	<idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1912	<idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
				1913	<idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
				1914	<idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
				1915	<idle>-0 3.N.. 25us : schedule <-cpu_idle
				1916	<idle>-0 3.N.. 25us : __schedule <-preempt_schedule
				1917	<idle>-0 3.N.. 26us : add_preempt_count <-__schedule
				1918	<idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
				1919	<idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
				1920	<idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
				1921	<idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
				1922	<idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
				1923	<idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
				1924	<idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
				1925	<idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
				1926	<idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
				1927	<idle>-0 3d..3 29us : __schedule <-preempt_schedule
				1928	<idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
				1929
				1930	This isn't that big of a trace, even with function tracing enabled,
				1931	so I included the entire trace.
				1932
				1933	The interrupt went off while when the system was idle. Somewhere
				1934	before task_woken_rt() was called, the NEED_RESCHED flag was set,
				1935	this is indicated by the first occurrence of the 'N' flag.
				1936
				1937	Latency tracing and events
				1938	--------------------------
				1939	As function tracing can induce a much larger latency, but without
				1940	seeing what happens within the latency it is hard to know what
				1941	caused it. There is a middle ground, and that is with enabling
				1942	events.
				1943	::
				1944
				1945	# echo 0 > options/function-trace
				1946	# echo wakeup_rt > current_tracer
				1947	# echo 1 > events/enable
				1948	# echo 1 > tracing_on
				1949	# echo 0 > tracing_max_latency
				1950	# chrt -f 5 sleep 1
				1951	# echo 0 > tracing_on
				1952	# cat trace
				1953	# tracer: wakeup_rt
				1954	#
				1955	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1956	# --------------------------------------------------------------------
				1957	# latency: 6 us, #12/12, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1958	# -----------------
				1959	# \| task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
				1960	# -----------------
				1961	#
				1962	# _------=> CPU#
				1963	# / _-----=> irqs-off
				1964	# \| / _----=> need-resched
				1965	# \|\| / _---=> hardirq/softirq
				1966	# \|\|\| / _--=> preempt-depth
				1967	# \|\|\|\| / delay
				1968	# cmd pid \|\|\|\|\| time \| caller
				1969	# \ / \|\|\|\|\| \ \| /
				1970	<idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
				1971	<idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
				1972	<idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
				1973	<idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
				1974	<idle>-0 2.N.2 2us : power_end: cpu_id=2
				1975	<idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
				1976	<idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
				1977	<idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
				1978	<idle>-0 2.N.2 5us : rcu_utilization: Start context switch
				1979	<idle>-0 2.N.2 5us : rcu_utilization: End context switch
				1980	<idle>-0 2d..3 6us : __schedule <-schedule
				1981	<idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
				1982
				1983
				1984	Hardware Latency Detector
				1985	-------------------------
				1986
				1987	The hardware latency detector is executed by enabling the "hwlat" tracer.
				1988
				1989	NOTE, this tracer will affect the performance of the system as it will
				1990	periodically make a CPU constantly busy with interrupts disabled.
				1991	::
				1992
				1993	# echo hwlat > current_tracer
				1994	# sleep 100
				1995	# cat trace
				1996	# tracer: hwlat
				1997	#
				1998	# _-----=> irqs-off
				1999	# / _----=> need-resched
				2000	# \| / _---=> hardirq/softirq
				2001	# \|\| / _--=> preempt-depth
				2002	# \|\|\| / delay
				2003	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2004	# \| \| \| \|\|\|\| \| \|
				2005	<...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
				2006	<...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
				2007	<...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
				2008	<...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
				2009	<...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
				2010	<...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
				2011
				2012
				2013	The above output is somewhat the same in the header. All events will have
				2014	interrupts disabled 'd'. Under the FUNCTION title there is:
				2015
				2016	#1
				2017	This is the count of events recorded that were greater than the
				2018	tracing_threshold (See below).
				2019
				2020	inner/outer(us): 12/14
				2021
				2022	This shows two numbers as "inner latency" and "outer latency". The test
				2023	runs in a loop checking a timestamp twice. The latency detected within
				2024	the two timestamps is the "inner latency" and the latency detected
				2025	after the previous timestamp and the next timestamp in the loop is
				2026	the "outer latency".
				2027
				2028	ts:1499801089.066141940
				2029
				2030	The absolute timestamp that the event happened.
				2031
				2032	nmi-total:4 nmi-count:1
				2033
				2034	On architectures that support it, if an NMI comes in during the
				2035	test, the time spent in NMI is reported in "nmi-total" (in
				2036	microseconds).
				2037
				2038	All architectures that have NMIs will show the "nmi-count" if an
				2039	NMI comes in during the test.
				2040
				2041	hwlat files:
				2042
				2043	tracing_threshold
				2044	This gets automatically set to "10" to represent 10
				2045	microseconds. This is the threshold of latency that
				2046	needs to be detected before the trace will be recorded.
				2047
				2048	Note, when hwlat tracer is finished (another tracer is
				2049	written into "current_tracer"), the original value for
				2050	tracing_threshold is placed back into this file.
				2051
				2052	hwlat_detector/width
				2053	The length of time the test runs with interrupts disabled.
				2054
				2055	hwlat_detector/window
				2056	The length of time of the window which the test
				2057	runs. That is, the test will run for "width"
				2058	microseconds per "window" microseconds
				2059
				2060	tracing_cpumask
				2061	When the test is started. A kernel thread is created that
				2062	runs the test. This thread will alternate between CPUs
				2063	listed in the tracing_cpumask between each period
				2064	(one "window"). To limit the test to specific CPUs
				2065	set the mask in this file to only the CPUs that the test
				2066	should run on.
				2067
				2068	function
				2069	--------
				2070
				2071	This tracer is the function tracer. Enabling the function tracer
				2072	can be done from the debug file system. Make sure the
				2073	ftrace_enabled is set; otherwise this tracer is a nop.
				2074	See the "ftrace_enabled" section below.
				2075	::
				2076
				2077	# sysctl kernel.ftrace_enabled=1
				2078	# echo function > current_tracer
				2079	# echo 1 > tracing_on
				2080	# usleep 1
				2081	# echo 0 > tracing_on
				2082	# cat trace
				2083	# tracer: function
				2084	#
				2085	# entries-in-buffer/entries-written: 24799/24799 #P:4
				2086	#
				2087	# _-----=> irqs-off
				2088	# / _----=> need-resched
				2089	# \| / _---=> hardirq/softirq
				2090	# \|\| / _--=> preempt-depth
				2091	# \|\|\| / delay
				2092	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2093	# \| \| \| \|\|\|\| \| \|
				2094	bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
				2095	bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
				2096	bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
				2097	bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
				2098	bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
				2099	bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
				2100	bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
				2101	bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
				2102	[...]
				2103
				2104
				2105	Note: function tracer uses ring buffers to store the above
				2106	entries. The newest data may overwrite the oldest data.
				2107	Sometimes using echo to stop the trace is not sufficient because
				2108	the tracing could have overwritten the data that you wanted to
				2109	record. For this reason, it is sometimes better to disable
				2110	tracing directly from a program. This allows you to stop the
				2111	tracing at the point that you hit the part that you are
				2112	interested in. To disable the tracing directly from a C program,
				2113	something like following code snippet can be used::
				2114
				2115	int trace_fd;
				2116	[...]
				2117	int main(int argc, char *argv[]) {
				2118	[...]
				2119	trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
				2120	[...]
				2121	if (condition_hit()) {
				2122	write(trace_fd, "0", 1);
				2123	}
				2124	[...]
				2125	}
				2126
				2127
				2128	Single thread tracing
				2129	---------------------
				2130
				2131	By writing into set_ftrace_pid you can trace a
				2132	single thread. For example::
				2133
				2134	# cat set_ftrace_pid
				2135	no pid
				2136	# echo 3111 > set_ftrace_pid
				2137	# cat set_ftrace_pid
				2138	3111
				2139	# echo function > current_tracer
				2140	# cat trace \| head
				2141	# tracer: function
				2142	#
				2143	# TASK-PID CPU# TIMESTAMP FUNCTION
				2144	# \| \| \| \| \|
				2145	yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
				2146	yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
				2147	yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
				2148	yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
				2149	yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
				2150	yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
				2151	# echo > set_ftrace_pid
				2152	# cat trace \|head
				2153	# tracer: function
				2154	#
				2155	# TASK-PID CPU# TIMESTAMP FUNCTION
				2156	# \| \| \| \| \|
				2157	##### CPU 3 buffer started ####
				2158	yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
				2159	yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
				2160	yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
				2161	yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
				2162	yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
				2163
				2164	If you want to trace a function when executing, you could use
				2165	something like this simple program.
				2166	::
				2167
				2168	#include <stdio.h>
				2169	#include <stdlib.h>
				2170	#include <sys/types.h>
				2171	#include <sys/stat.h>
				2172	#include <fcntl.h>
				2173	#include <unistd.h>
				2174	#include <string.h>
				2175
				2176	#define _STR(x) #x
				2177	#define STR(x) _STR(x)
				2178	#define MAX_PATH 256
				2179
				2180	const char *find_tracefs(void)
				2181	{
				2182	static char tracefs[MAX_PATH+1];
				2183	static int tracefs_found;
				2184	char type[100];
				2185	FILE *fp;
				2186
				2187	if (tracefs_found)
				2188	return tracefs;
				2189
				2190	if ((fp = fopen("/proc/mounts","r")) == NULL) {
				2191	perror("/proc/mounts");
				2192	return NULL;
				2193	}
				2194
				2195	while (fscanf(fp, "%*s %"
				2196	STR(MAX_PATH)
				2197	"s %99s %s %d %*d\n",
				2198	tracefs, type) == 2) {
				2199	if (strcmp(type, "tracefs") == 0)
				2200	break;
				2201	}
				2202	fclose(fp);
				2203
				2204	if (strcmp(type, "tracefs") != 0) {
				2205	fprintf(stderr, "tracefs not mounted");
				2206	return NULL;
				2207	}
				2208
				2209	strcat(tracefs, "/tracing/");
				2210	tracefs_found = 1;
				2211
				2212	return tracefs;
				2213	}
				2214
				2215	const char tracing_file(const char file_name)
				2216	{
				2217	static char trace_file[MAX_PATH+1];
				2218	snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
				2219	return trace_file;
				2220	}
				2221
				2222	int main (int argc, char **argv)
				2223	{
				2224	if (argc < 1)
				2225	exit(-1);
				2226
				2227	if (fork() > 0) {
				2228	int fd, ffd;
				2229	char line[64];
				2230	int s;
				2231
				2232	ffd = open(tracing_file("current_tracer"), O_WRONLY);
				2233	if (ffd < 0)
				2234	exit(-1);
				2235	write(ffd, "nop", 3);
				2236
				2237	fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
				2238	s = sprintf(line, "%d\n", getpid());
				2239	write(fd, line, s);
				2240
				2241	write(ffd, "function", 8);
				2242
				2243	close(fd);
				2244	close(ffd);
				2245
				2246	execvp(argv[1], argv+1);
				2247	}
				2248
				2249	return 0;
				2250	}
				2251
				2252	Or this simple script!
				2253	::
				2254
				2255	#!/bin/bash
				2256
				2257	tracefs=`sed -ne 's/^tracefs $.$ tracefs./\1/p' /proc/mounts`
				2258	echo nop > $tracefs/tracing/current_tracer
				2259	echo 0 > $tracefs/tracing/tracing_on
				2260	echo $$ > $tracefs/tracing/set_ftrace_pid
				2261	echo function > $tracefs/tracing/current_tracer
				2262	echo 1 > $tracefs/tracing/tracing_on
				2263	exec "$@"
				2264
				2265
				2266	function graph tracer
				2267	---------------------------
				2268
				2269	This tracer is similar to the function tracer except that it
				2270	probes a function on its entry and its exit. This is done by
				2271	using a dynamically allocated stack of return addresses in each
				2272	task_struct. On function entry the tracer overwrites the return
				2273	address of each function traced to set a custom probe. Thus the
				2274	original return address is stored on the stack of return address
				2275	in the task_struct.
				2276
				2277	Probing on both ends of a function leads to special features
				2278	such as:
				2279
				2280	- measure of a function's time execution
				2281	- having a reliable call stack to draw function calls graph
				2282
				2283	This tracer is useful in several situations:
				2284
				2285	- you want to find the reason of a strange kernel behavior and
				2286	need to see what happens in detail on any areas (or specific
				2287	ones).
				2288
				2289	- you are experiencing weird latencies but it's difficult to
				2290	find its origin.
				2291
				2292	- you want to find quickly which path is taken by a specific
				2293	function
				2294
				2295	- you just want to peek inside a working kernel and want to see
				2296	what happens there.
				2297
				2298	::
				2299
				2300	# tracer: function_graph
				2301	#
				2302	# CPU DURATION FUNCTION CALLS
				2303	# \| \| \| \| \| \| \|
				2304
				2305	0) \| sys_open() {
				2306	0) \| do_sys_open() {
				2307	0) \| getname() {
				2308	0) \| kmem_cache_alloc() {
				2309	0) 1.382 us \| __might_sleep();
				2310	0) 2.478 us \| }
				2311	0) \| strncpy_from_user() {
				2312	0) \| might_fault() {
				2313	0) 1.389 us \| __might_sleep();
				2314	0) 2.553 us \| }
				2315	0) 3.807 us \| }
				2316	0) 7.876 us \| }
				2317	0) \| alloc_fd() {
				2318	0) 0.668 us \| _spin_lock();
				2319	0) 0.570 us \| expand_files();
				2320	0) 0.586 us \| _spin_unlock();
				2321
				2322
				2323	There are several columns that can be dynamically
				2324	enabled/disabled. You can use every combination of options you
				2325	want, depending on your needs.
				2326
				2327	- The cpu number on which the function executed is default
				2328	enabled. It is sometimes better to only trace one cpu (see
				2329	tracing_cpu_mask file) or you might sometimes see unordered
				2330	function calls while cpu tracing switch.
				2331
				2332	- hide: echo nofuncgraph-cpu > trace_options
				2333	- show: echo funcgraph-cpu > trace_options
				2334
				2335	- The duration (function's time of execution) is displayed on
				2336	the closing bracket line of a function or on the same line
				2337	than the current function in case of a leaf one. It is default
				2338	enabled.
				2339
				2340	- hide: echo nofuncgraph-duration > trace_options
				2341	- show: echo funcgraph-duration > trace_options
				2342
				2343	- The overhead field precedes the duration field in case of
				2344	reached duration thresholds.
				2345
				2346	- hide: echo nofuncgraph-overhead > trace_options
				2347	- show: echo funcgraph-overhead > trace_options
				2348	- depends on: funcgraph-duration
				2349
				2350	ie::
				2351
				2352	3) # 1837.709 us \| } /* __switch_to */
				2353	3) \| finish_task_switch() {
				2354	3) 0.313 us \| _raw_spin_unlock_irq();
				2355	3) 3.177 us \| }
				2356	3) # 1889.063 us \| } /* __schedule */
				2357	3) ! 140.417 us \| } /* __schedule */
				2358	3) # 2034.948 us \| } /* schedule */
				2359	3) * 33998.59 us \| } /* schedule_preempt_disabled */
				2360
				2361	[...]
				2362
				2363	1) 0.260 us \| msecs_to_jiffies();
				2364	1) 0.313 us \| __rcu_read_unlock();
				2365	1) + 61.770 us \| }
				2366	1) + 64.479 us \| }
				2367	1) 0.313 us \| rcu_bh_qs();
				2368	1) 0.313 us \| __local_bh_enable();
				2369	1) ! 217.240 us \| }
				2370	1) 0.365 us \| idle_cpu();
				2371	1) \| rcu_irq_exit() {
				2372	1) 0.417 us \| rcu_eqs_enter_common.isra.47();
				2373	1) 3.125 us \| }
				2374	1) ! 227.812 us \| }
				2375	1) ! 457.395 us \| }
				2376	1) @ 119760.2 us \| }
				2377
				2378	[...]
				2379
				2380	2) \| handle_IPI() {
				2381	1) 6.979 us \| }
				2382	2) 0.417 us \| scheduler_ipi();
				2383	1) 9.791 us \| }
				2384	1) + 12.917 us \| }
				2385	2) 3.490 us \| }
				2386	1) + 15.729 us \| }
				2387	1) + 18.542 us \| }
				2388	2) $ 3594274 us \| }
				2389
				2390	Flags::
				2391
				2392	+ means that the function exceeded 10 usecs.
				2393	! means that the function exceeded 100 usecs.
				2394	# means that the function exceeded 1000 usecs.
				2395	* means that the function exceeded 10 msecs.
				2396	@ means that the function exceeded 100 msecs.
				2397	$ means that the function exceeded 1 sec.
				2398
				2399
				2400	- The task/pid field displays the thread cmdline and pid which
				2401	executed the function. It is default disabled.
				2402
				2403	- hide: echo nofuncgraph-proc > trace_options
				2404	- show: echo funcgraph-proc > trace_options
				2405
				2406	ie::
				2407
				2408	# tracer: function_graph
				2409	#
				2410	# CPU TASK/PID DURATION FUNCTION CALLS
				2411	# \| \| \| \| \| \| \| \| \|
				2412	0) sh-4802 \| \| d_free() {
				2413	0) sh-4802 \| \| call_rcu() {
				2414	0) sh-4802 \| \| __call_rcu() {
				2415	0) sh-4802 \| 0.616 us \| rcu_process_gp_end();
				2416	0) sh-4802 \| 0.586 us \| check_for_new_grace_period();
				2417	0) sh-4802 \| 2.899 us \| }
				2418	0) sh-4802 \| 4.040 us \| }
				2419	0) sh-4802 \| 5.151 us \| }
				2420	0) sh-4802 \| + 49.370 us \| }
				2421
				2422
				2423	- The absolute time field is an absolute timestamp given by the
				2424	system clock since it started. A snapshot of this time is
				2425	given on each entry/exit of functions
				2426
				2427	- hide: echo nofuncgraph-abstime > trace_options
				2428	- show: echo funcgraph-abstime > trace_options
				2429
				2430	ie::
				2431
				2432	#
				2433	# TIME CPU DURATION FUNCTION CALLS
				2434	# \| \| \| \| \| \| \| \|
				2435	360.774522 \| 1) 0.541 us \| }
				2436	360.774522 \| 1) 4.663 us \| }
				2437	360.774523 \| 1) 0.541 us \| __wake_up_bit();
				2438	360.774524 \| 1) 6.796 us \| }
				2439	360.774524 \| 1) 7.952 us \| }
				2440	360.774525 \| 1) 9.063 us \| }
				2441	360.774525 \| 1) 0.615 us \| journal_mark_dirty();
				2442	360.774527 \| 1) 0.578 us \| __brelse();
				2443	360.774528 \| 1) \| reiserfs_prepare_for_journal() {
				2444	360.774528 \| 1) \| unlock_buffer() {
				2445	360.774529 \| 1) \| wake_up_bit() {
				2446	360.774529 \| 1) \| bit_waitqueue() {
				2447	360.774530 \| 1) 0.594 us \| __phys_addr();
				2448
				2449
				2450	The function name is always displayed after the closing bracket
				2451	for a function if the start of that function is not in the
				2452	trace buffer.
				2453
				2454	Display of the function name after the closing bracket may be
				2455	enabled for functions whose start is in the trace buffer,
				2456	allowing easier searching with grep for function durations.
				2457	It is default disabled.
				2458
				2459	- hide: echo nofuncgraph-tail > trace_options
				2460	- show: echo funcgraph-tail > trace_options
				2461
				2462	Example with nofuncgraph-tail (default)::
				2463
				2464	0) \| putname() {
				2465	0) \| kmem_cache_free() {
				2466	0) 0.518 us \| __phys_addr();
				2467	0) 1.757 us \| }
				2468	0) 2.861 us \| }
				2469
				2470	Example with funcgraph-tail::
				2471
				2472	0) \| putname() {
				2473	0) \| kmem_cache_free() {
				2474	0) 0.518 us \| __phys_addr();
				2475	0) 1.757 us \| } /* kmem_cache_free() */
				2476	0) 2.861 us \| } /* putname() */
				2477
				2478	You can put some comments on specific functions by using
				2479	trace_printk() For example, if you want to put a comment inside
				2480	the __might_sleep() function, you just have to include
				2481	<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
				2482
				2483	trace_printk("I'm a comment!\n")
				2484
				2485	will produce::
				2486
				2487	1) \| __might_sleep() {
				2488	1) \| /* I'm a comment! */
				2489	1) 1.449 us \| }
				2490
				2491
				2492	You might find other useful features for this tracer in the
				2493	following "dynamic ftrace" section such as tracing only specific
				2494	functions or tasks.
				2495
				2496	dynamic ftrace
				2497	--------------
				2498
				2499	If CONFIG_DYNAMIC_FTRACE is set, the system will run with
				2500	virtually no overhead when function tracing is disabled. The way
				2501	this works is the mcount function call (placed at the start of
				2502	every kernel function, produced by the -pg switch in gcc),
				2503	starts of pointing to a simple return. (Enabling FTRACE will
				2504	include the -pg switch in the compiling of the kernel.)
				2505
				2506	At compile time every C file object is run through the
				2507	recordmcount program (located in the scripts directory). This
				2508	program will parse the ELF headers in the C object to find all
				2509	the locations in the .text section that call mcount. Starting
				2510	with gcc verson 4.6, the -mfentry has been added for x86, which
				2511	calls "__fentry__" instead of "mcount". Which is called before
				2512	the creation of the stack frame.
				2513
				2514	Note, not all sections are traced. They may be prevented by either
				2515	a notrace, or blocked another way and all inline functions are not
				2516	traced. Check the "available_filter_functions" file to see what functions
				2517	can be traced.
				2518
				2519	A section called "__mcount_loc" is created that holds
				2520	references to all the mcount/fentry call sites in the .text section.
				2521	The recordmcount program re-links this section back into the
				2522	original object. The final linking stage of the kernel will add all these
				2523	references into a single table.
				2524
				2525	On boot up, before SMP is initialized, the dynamic ftrace code
				2526	scans this table and updates all the locations into nops. It
				2527	also records the locations, which are added to the
				2528	available_filter_functions list. Modules are processed as they
				2529	are loaded and before they are executed. When a module is
				2530	unloaded, it also removes its functions from the ftrace function
				2531	list. This is automatic in the module unload code, and the
				2532	module author does not need to worry about it.
				2533
				2534	When tracing is enabled, the process of modifying the function
				2535	tracepoints is dependent on architecture. The old method is to use
				2536	kstop_machine to prevent races with the CPUs executing code being
				2537	modified (which can cause the CPU to do undesirable things, especially
				2538	if the modified code crosses cache (or page) boundaries), and the nops are
				2539	patched back to calls. But this time, they do not call mcount
				2540	(which is just a function stub). They now call into the ftrace
				2541	infrastructure.
				2542
				2543	The new method of modifying the function tracepoints is to place
				2544	a breakpoint at the location to be modified, sync all CPUs, modify
				2545	the rest of the instruction not covered by the breakpoint. Sync
				2546	all CPUs again, and then remove the breakpoint with the finished
				2547	version to the ftrace call site.
				2548
				2549	Some archs do not even need to monkey around with the synchronization,
				2550	and can just slap the new code on top of the old without any
				2551	problems with other CPUs executing it at the same time.
				2552
				2553	One special side-effect to the recording of the functions being
				2554	traced is that we can now selectively choose which functions we
				2555	wish to trace and which ones we want the mcount calls to remain
				2556	as nops.
				2557
				2558	Two files are used, one for enabling and one for disabling the
				2559	tracing of specified functions. They are:
				2560
				2561	set_ftrace_filter
				2562
				2563	and
				2564
				2565	set_ftrace_notrace
				2566
				2567	A list of available functions that you can add to these files is
				2568	listed in:
				2569
				2570	available_filter_functions
				2571
				2572	::
				2573
				2574	# cat available_filter_functions
				2575	put_prev_task_idle
				2576	kmem_cache_create
				2577	pick_next_task_rt
				2578	get_online_cpus
				2579	pick_next_task_fair
				2580	mutex_lock
				2581	[...]
				2582
				2583	If I am only interested in sys_nanosleep and hrtimer_interrupt::
				2584
				2585	# echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
				2586	# echo function > current_tracer
				2587	# echo 1 > tracing_on
				2588	# usleep 1
				2589	# echo 0 > tracing_on
				2590	# cat trace
				2591	# tracer: function
				2592	#
				2593	# entries-in-buffer/entries-written: 5/5 #P:4
				2594	#
				2595	# _-----=> irqs-off
				2596	# / _----=> need-resched
				2597	# \| / _---=> hardirq/softirq
				2598	# \|\| / _--=> preempt-depth
				2599	# \|\|\| / delay
				2600	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2601	# \| \| \| \|\|\|\| \| \|
				2602	usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
				2603	<idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
				2604	usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
				2605	<idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
				2606	<idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
				2607
				2608	To see which functions are being traced, you can cat the file:
				2609	::
				2610
				2611	# cat set_ftrace_filter
				2612	hrtimer_interrupt
				2613	sys_nanosleep
				2614
				2615
				2616	Perhaps this is not enough. The filters also allow glob(7) matching.
				2617
				2618	<match>*
				2619	will match functions that begin with <match>
				2620	*<match>
				2621	will match functions that end with <match>
				2622	<match>
				2623	will match functions that have <match> in it
				2624	<match1>*<match2>
				2625	will match functions that begin with <match1> and end with <match2>
				2626
				2627	.. note::
				2628	It is better to use quotes to enclose the wild cards,
				2629	otherwise the shell may expand the parameters into names
				2630	of files in the local directory.
				2631
				2632	::
				2633
				2634	# echo 'hrtimer_*' > set_ftrace_filter
				2635
				2636	Produces::
				2637
				2638	# tracer: function
				2639	#
				2640	# entries-in-buffer/entries-written: 897/897 #P:4
				2641	#
				2642	# _-----=> irqs-off
				2643	# / _----=> need-resched
				2644	# \| / _---=> hardirq/softirq
				2645	# \|\| / _--=> preempt-depth
				2646	# \|\|\| / delay
				2647	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2648	# \| \| \| \|\|\|\| \| \|
				2649	<idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
				2650	<idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
				2651	<idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
				2652	<idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
				2653	<idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
				2654	<idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
				2655	<idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
				2656	<idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
				2657
				2658	Notice that we lost the sys_nanosleep.
				2659	::
				2660
				2661	# cat set_ftrace_filter
				2662	hrtimer_run_queues
				2663	hrtimer_run_pending
				2664	hrtimer_init
				2665	hrtimer_cancel
				2666	hrtimer_try_to_cancel
				2667	hrtimer_forward
				2668	hrtimer_start
				2669	hrtimer_reprogram
				2670	hrtimer_force_reprogram
				2671	hrtimer_get_next_event
				2672	hrtimer_interrupt
				2673	hrtimer_nanosleep
				2674	hrtimer_wakeup
				2675	hrtimer_get_remaining
				2676	hrtimer_get_res
				2677	hrtimer_init_sleeper
				2678
				2679
				2680	This is because the '>' and '>>' act just like they do in bash.
				2681	To rewrite the filters, use '>'
				2682	To append to the filters, use '>>'
				2683
				2684	To clear out a filter so that all functions will be recorded
				2685	again::
				2686
				2687	# echo > set_ftrace_filter
				2688	# cat set_ftrace_filter
				2689	#
				2690
				2691	Again, now we want to append.
				2692
				2693	::
				2694
				2695	# echo sys_nanosleep > set_ftrace_filter
				2696	# cat set_ftrace_filter
				2697	sys_nanosleep
				2698	# echo 'hrtimer_*' >> set_ftrace_filter
				2699	# cat set_ftrace_filter
				2700	hrtimer_run_queues
				2701	hrtimer_run_pending
				2702	hrtimer_init
				2703	hrtimer_cancel
				2704	hrtimer_try_to_cancel
				2705	hrtimer_forward
				2706	hrtimer_start
				2707	hrtimer_reprogram
				2708	hrtimer_force_reprogram
				2709	hrtimer_get_next_event
				2710	hrtimer_interrupt
				2711	sys_nanosleep
				2712	hrtimer_nanosleep
				2713	hrtimer_wakeup
				2714	hrtimer_get_remaining
				2715	hrtimer_get_res
				2716	hrtimer_init_sleeper
				2717
				2718
				2719	The set_ftrace_notrace prevents those functions from being
				2720	traced.
				2721	::
				2722
				2723	# echo 'preempt' 'lock' > set_ftrace_notrace
				2724
				2725	Produces::
				2726
				2727	# tracer: function
				2728	#
				2729	# entries-in-buffer/entries-written: 39608/39608 #P:4
				2730	#
				2731	# _-----=> irqs-off
				2732	# / _----=> need-resched
				2733	# \| / _---=> hardirq/softirq
				2734	# \|\| / _--=> preempt-depth
				2735	# \|\|\| / delay
				2736	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2737	# \| \| \| \|\|\|\| \| \|
				2738	bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
				2739	bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
				2740	bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
				2741	bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
				2742	bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
				2743	bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
				2744	bash-1994 [000] .... 4342.324899: do_truncate <-do_last
				2745	bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
				2746	bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
				2747	bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
				2748	bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
				2749	bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
				2750
				2751	We can see that there's no more lock or preempt tracing.
				2752
				2753
				2754	Dynamic ftrace with the function graph tracer
				2755	---------------------------------------------
				2756
				2757	Although what has been explained above concerns both the
				2758	function tracer and the function-graph-tracer, there are some
				2759	special features only available in the function-graph tracer.
				2760
				2761	If you want to trace only one function and all of its children,
				2762	you just have to echo its name into set_graph_function::
				2763
				2764	echo __do_fault > set_graph_function
				2765
				2766	will produce the following "expanded" trace of the __do_fault()
				2767	function::
				2768
				2769	0) \| __do_fault() {
				2770	0) \| filemap_fault() {
				2771	0) \| find_lock_page() {
				2772	0) 0.804 us \| find_get_page();
				2773	0) \| __might_sleep() {
				2774	0) 1.329 us \| }
				2775	0) 3.904 us \| }
				2776	0) 4.979 us \| }
				2777	0) 0.653 us \| _spin_lock();
				2778	0) 0.578 us \| page_add_file_rmap();
				2779	0) 0.525 us \| native_set_pte_at();
				2780	0) 0.585 us \| _spin_unlock();
				2781	0) \| unlock_page() {
				2782	0) 0.541 us \| page_waitqueue();
				2783	0) 0.639 us \| __wake_up_bit();
				2784	0) 2.786 us \| }
				2785	0) + 14.237 us \| }
				2786	0) \| __do_fault() {
				2787	0) \| filemap_fault() {
				2788	0) \| find_lock_page() {
				2789	0) 0.698 us \| find_get_page();
				2790	0) \| __might_sleep() {
				2791	0) 1.412 us \| }
				2792	0) 3.950 us \| }
				2793	0) 5.098 us \| }
				2794	0) 0.631 us \| _spin_lock();
				2795	0) 0.571 us \| page_add_file_rmap();
				2796	0) 0.526 us \| native_set_pte_at();
				2797	0) 0.586 us \| _spin_unlock();
				2798	0) \| unlock_page() {
				2799	0) 0.533 us \| page_waitqueue();
				2800	0) 0.638 us \| __wake_up_bit();
				2801	0) 2.793 us \| }
				2802	0) + 14.012 us \| }
				2803
				2804	You can also expand several functions at once::
				2805
				2806	echo sys_open > set_graph_function
				2807	echo sys_close >> set_graph_function
				2808
				2809	Now if you want to go back to trace all functions you can clear
				2810	this special filter via::
				2811
				2812	echo > set_graph_function
				2813
				2814
				2815	ftrace_enabled
				2816	--------------
				2817
				2818	Note, the proc sysctl ftrace_enable is a big on/off switch for the
				2819	function tracer. By default it is enabled (when function tracing is
				2820	enabled in the kernel). If it is disabled, all function tracing is
				2821	disabled. This includes not only the function tracers for ftrace, but
				2822	also for any other uses (perf, kprobes, stack tracing, profiling, etc).
				2823
				2824	Please disable this with care.
				2825
				2826	This can be disable (and enabled) with::
				2827
				2828	sysctl kernel.ftrace_enabled=0
				2829	sysctl kernel.ftrace_enabled=1
				2830
				2831	or
				2832
				2833	echo 0 > /proc/sys/kernel/ftrace_enabled
				2834	echo 1 > /proc/sys/kernel/ftrace_enabled
				2835
				2836
				2837	Filter commands
				2838	---------------
				2839
				2840	A few commands are supported by the set_ftrace_filter interface.
				2841	Trace commands have the following format::
				2842
				2843	<function>:<command>:<parameter>
				2844
				2845	The following commands are supported:
				2846
				2847	- mod:
				2848	This command enables function filtering per module. The
				2849	parameter defines the module. For example, if only the write*
				2850	functions in the ext3 module are desired, run:
				2851
				2852	echo 'write*:mod:ext3' > set_ftrace_filter
				2853
				2854	This command interacts with the filter in the same way as
				2855	filtering based on function names. Thus, adding more functions
				2856	in a different module is accomplished by appending (>>) to the
				2857	filter file. Remove specific module functions by prepending
				2858	'!'::
				2859
				2860	echo '!writeback*:mod:ext3' >> set_ftrace_filter
				2861
				2862	Mod command supports module globbing. Disable tracing for all
				2863	functions except a specific module::
				2864
				2865	echo '!*:mod:!ext3' >> set_ftrace_filter
				2866
				2867	Disable tracing for all modules, but still trace kernel::
				2868
				2869	echo '!:mod:' >> set_ftrace_filter
				2870
				2871	Enable filter only for kernel::
				2872
				2873	echo 'write:mod:!*' >> set_ftrace_filter
				2874
				2875	Enable filter for module globbing::
				2876
				2877	echo 'write:mod:snd' >> set_ftrace_filter
				2878
				2879	- traceon/traceoff:
				2880	These commands turn tracing on and off when the specified
				2881	functions are hit. The parameter determines how many times the
				2882	tracing system is turned on and off. If unspecified, there is
				2883	no limit. For example, to disable tracing when a schedule bug
				2884	is hit the first 5 times, run::
				2885
				2886	echo '__schedule_bug:traceoff:5' > set_ftrace_filter
				2887
				2888	To always disable tracing when __schedule_bug is hit::
				2889
				2890	echo '__schedule_bug:traceoff' > set_ftrace_filter
				2891
				2892	These commands are cumulative whether or not they are appended
				2893	to set_ftrace_filter. To remove a command, prepend it by '!'
				2894	and drop the parameter::
				2895
				2896	echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
				2897
				2898	The above removes the traceoff command for __schedule_bug
				2899	that have a counter. To remove commands without counters::
				2900
				2901	echo '!__schedule_bug:traceoff' > set_ftrace_filter
				2902
				2903	- snapshot:
				2904	Will cause a snapshot to be triggered when the function is hit.
				2905	::
				2906
				2907	echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
				2908
				2909	To only snapshot once:
				2910	::
				2911
				2912	echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
				2913
				2914	To remove the above commands::
				2915
				2916	echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
				2917	echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
				2918
				2919	- enable_event/disable_event:
				2920	These commands can enable or disable a trace event. Note, because
				2921	function tracing callbacks are very sensitive, when these commands
				2922	are registered, the trace point is activated, but disabled in
				2923	a "soft" mode. That is, the tracepoint will be called, but
				2924	just will not be traced. The event tracepoint stays in this mode
				2925	as long as there's a command that triggers it.
				2926	::
				2927
				2928	echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
				2929	set_ftrace_filter
				2930
				2931	The format is::
				2932
				2933	<function>:enable_event:<system>:<event>[:count]
				2934	<function>:disable_event:<system>:<event>[:count]
				2935
				2936	To remove the events commands::
				2937
				2938	echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
				2939	set_ftrace_filter
				2940	echo '!schedule:disable_event:sched:sched_switch' > \
				2941	set_ftrace_filter
				2942
				2943	- dump:
				2944	When the function is hit, it will dump the contents of the ftrace
				2945	ring buffer to the console. This is useful if you need to debug
				2946	something, and want to dump the trace when a certain function
				2947	is hit. Perhaps its a function that is called before a tripple
				2948	fault happens and does not allow you to get a regular dump.
				2949
				2950	- cpudump:
				2951	When the function is hit, it will dump the contents of the ftrace
				2952	ring buffer for the current CPU to the console. Unlike the "dump"
				2953	command, it only prints out the contents of the ring buffer for the
				2954	CPU that executed the function that triggered the dump.
				2955
				2956	trace_pipe
				2957	----------
				2958
				2959	The trace_pipe outputs the same content as the trace file, but
				2960	the effect on the tracing is different. Every read from
				2961	trace_pipe is consumed. This means that subsequent reads will be
				2962	different. The trace is live.
				2963	::
				2964
				2965	# echo function > current_tracer
				2966	# cat trace_pipe > /tmp/trace.out &
				2967	[1] 4153
				2968	# echo 1 > tracing_on
				2969	# usleep 1
				2970	# echo 0 > tracing_on
				2971	# cat trace
				2972	# tracer: function
				2973	#
				2974	# entries-in-buffer/entries-written: 0/0 #P:4
				2975	#
				2976	# _-----=> irqs-off
				2977	# / _----=> need-resched
				2978	# \| / _---=> hardirq/softirq
				2979	# \|\| / _--=> preempt-depth
				2980	# \|\|\| / delay
				2981	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2982	# \| \| \| \|\|\|\| \| \|
				2983
				2984	#
				2985	# cat /tmp/trace.out
				2986	bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
				2987	bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
				2988	bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
				2989	bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
				2990	bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
				2991	bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
				2992	bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
				2993	bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
				2994	bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
				2995
				2996
				2997	Note, reading the trace_pipe file will block until more input is
				2998	added.
				2999
				3000	trace entries
				3001	-------------
				3002
				3003	Having too much or not enough data can be troublesome in
				3004	diagnosing an issue in the kernel. The file buffer_size_kb is
				3005	used to modify the size of the internal trace buffers. The
				3006	number listed is the number of entries that can be recorded per
				3007	CPU. To know the full size, multiply the number of possible CPUs
				3008	with the number of entries.
				3009	::
				3010
				3011	# cat buffer_size_kb
				3012	1408 (units kilobytes)
				3013
				3014	Or simply read buffer_total_size_kb
				3015	::
				3016
				3017	# cat buffer_total_size_kb
				3018	5632
				3019
				3020	To modify the buffer, simple echo in a number (in 1024 byte segments).
				3021	::
				3022
				3023	# echo 10000 > buffer_size_kb
				3024	# cat buffer_size_kb
				3025	10000 (units kilobytes)
				3026
				3027	It will try to allocate as much as possible. If you allocate too
				3028	much, it can cause Out-Of-Memory to trigger.
				3029	::
				3030
				3031	# echo 1000000000000 > buffer_size_kb
				3032	-bash: echo: write error: Cannot allocate memory
				3033	# cat buffer_size_kb
				3034	85
				3035
				3036	The per_cpu buffers can be changed individually as well:
				3037	::
				3038
				3039	# echo 10000 > per_cpu/cpu0/buffer_size_kb
				3040	# echo 100 > per_cpu/cpu1/buffer_size_kb
				3041
				3042	When the per_cpu buffers are not the same, the buffer_size_kb
				3043	at the top level will just show an X
				3044	::
				3045
				3046	# cat buffer_size_kb
				3047	X
				3048
				3049	This is where the buffer_total_size_kb is useful:
				3050	::
				3051
				3052	# cat buffer_total_size_kb
				3053	12916
				3054
				3055	Writing to the top level buffer_size_kb will reset all the buffers
				3056	to be the same again.
				3057
				3058	Snapshot
				3059	--------
				3060	CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
				3061	available to all non latency tracers. (Latency tracers which
				3062	record max latency, such as "irqsoff" or "wakeup", can't use
				3063	this feature, since those are already using the snapshot
				3064	mechanism internally.)
				3065
				3066	Snapshot preserves a current trace buffer at a particular point
				3067	in time without stopping tracing. Ftrace swaps the current
				3068	buffer with a spare buffer, and tracing continues in the new
				3069	current (=previous spare) buffer.
				3070
				3071	The following tracefs files in "tracing" are related to this
				3072	feature:
				3073
				3074	snapshot:
				3075
				3076	This is used to take a snapshot and to read the output
				3077	of the snapshot. Echo 1 into this file to allocate a
				3078	spare buffer and to take a snapshot (swap), then read
				3079	the snapshot from this file in the same format as
				3080	"trace" (described above in the section "The File
				3081	System"). Both reads snapshot and tracing are executable
				3082	in parallel. When the spare buffer is allocated, echoing
				3083	0 frees it, and echoing else (positive) values clear the
				3084	snapshot contents.
				3085	More details are shown in the table below.
				3086
				3087	+--------------+------------+------------+------------+
				3088	\|status\\input \| 0 \| 1 \| else \|
				3089	+==============+============+============+============+
				3090	\|not allocated \|(do nothing)\| alloc+swap \|(do nothing)\|
				3091	+--------------+------------+------------+------------+
				3092	\|allocated \| free \| swap \| clear \|
				3093	+--------------+------------+------------+------------+
				3094
				3095	Here is an example of using the snapshot feature.
				3096	::
				3097
				3098	# echo 1 > events/sched/enable
				3099	# echo 1 > snapshot
				3100	# cat snapshot
				3101	# tracer: nop
				3102	#
				3103	# entries-in-buffer/entries-written: 71/71 #P:8
				3104	#
				3105	# _-----=> irqs-off
				3106	# / _----=> need-resched
				3107	# \| / _---=> hardirq/softirq
				3108	# \|\| / _--=> preempt-depth
				3109	# \|\|\| / delay
				3110	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3111	# \| \| \| \|\|\|\| \| \|
				3112	<idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
				3113	sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
				3114	[...]
				3115	<idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
				3116
				3117	# cat trace
				3118	# tracer: nop
				3119	#
				3120	# entries-in-buffer/entries-written: 77/77 #P:8
				3121	#
				3122	# _-----=> irqs-off
				3123	# / _----=> need-resched
				3124	# \| / _---=> hardirq/softirq
				3125	# \|\| / _--=> preempt-depth
				3126	# \|\|\| / delay
				3127	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3128	# \| \| \| \|\|\|\| \| \|
				3129	<idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
				3130	snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
				3131	[...]
				3132
				3133
				3134	If you try to use this snapshot feature when current tracer is
				3135	one of the latency tracers, you will get the following results.
				3136	::
				3137
				3138	# echo wakeup > current_tracer
				3139	# echo 1 > snapshot
				3140	bash: echo: write error: Device or resource busy
				3141	# cat snapshot
				3142	cat: snapshot: Device or resource busy
				3143
				3144
				3145	Instances
				3146	---------
				3147	In the tracefs tracing directory is a directory called "instances".
				3148	This directory can have new directories created inside of it using
				3149	mkdir, and removing directories with rmdir. The directory created
				3150	with mkdir in this directory will already contain files and other
				3151	directories after it is created.
				3152	::
				3153
				3154	# mkdir instances/foo
				3155	# ls instances/foo
				3156	buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
				3157	set_event snapshot trace trace_clock trace_marker trace_options
				3158	trace_pipe tracing_on
				3159
				3160	As you can see, the new directory looks similar to the tracing directory
				3161	itself. In fact, it is very similar, except that the buffer and
				3162	events are agnostic from the main director, or from any other
				3163	instances that are created.
				3164
				3165	The files in the new directory work just like the files with the
				3166	same name in the tracing directory except the buffer that is used
				3167	is a separate and new buffer. The files affect that buffer but do not
				3168	affect the main buffer with the exception of trace_options. Currently,
				3169	the trace_options affect all instances and the top level buffer
				3170	the same, but this may change in future releases. That is, options
				3171	may become specific to the instance they reside in.
				3172
				3173	Notice that none of the function tracer files are there, nor is
				3174	current_tracer and available_tracers. This is because the buffers
				3175	can currently only have events enabled for them.
				3176	::
				3177
				3178	# mkdir instances/foo
				3179	# mkdir instances/bar
				3180	# mkdir instances/zoot
				3181	# echo 100000 > buffer_size_kb
				3182	# echo 1000 > instances/foo/buffer_size_kb
				3183	# echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
				3184	# echo function > current_trace
				3185	# echo 1 > instances/foo/events/sched/sched_wakeup/enable
				3186	# echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
				3187	# echo 1 > instances/foo/events/sched/sched_switch/enable
				3188	# echo 1 > instances/bar/events/irq/enable
				3189	# echo 1 > instances/zoot/events/syscalls/enable
				3190	# cat trace_pipe
				3191	CPU:2 [LOST 11745 EVENTS]
				3192	bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
				3193	bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
				3194	bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
				3195	bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
				3196	bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
				3197	bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
				3198	bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
				3199	bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
				3200	bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
				3201	bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
				3202	bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
				3203	[...]
				3204
				3205	# cat instances/foo/trace_pipe
				3206	bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
				3207	bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
				3208	<idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
				3209	<idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
				3210	rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
				3211	bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
				3212	bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
				3213	bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
				3214	kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
				3215	kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
				3216	[...]
				3217
				3218	# cat instances/bar/trace_pipe
				3219	migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
				3220	<idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
				3221	bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
				3222	bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
				3223	bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
				3224	bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
				3225	bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
				3226	bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
				3227	sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
				3228	sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
				3229	sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
				3230	sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
				3231	[...]
				3232
				3233	# cat instances/zoot/trace
				3234	# tracer: nop
				3235	#
				3236	# entries-in-buffer/entries-written: 18996/18996 #P:4
				3237	#
				3238	# _-----=> irqs-off
				3239	# / _----=> need-resched
				3240	# \| / _---=> hardirq/softirq
				3241	# \|\| / _--=> preempt-depth
				3242	# \|\|\| / delay
				3243	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3244	# \| \| \| \|\|\|\| \| \|
				3245	bash-1998 [000] d... 140.733501: sys_write -> 0x2
				3246	bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
				3247	bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
				3248	bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
				3249	bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
				3250	bash-1998 [000] d... 140.733510: sys_close(fd: a)
				3251	bash-1998 [000] d... 140.733510: sys_close -> 0x0
				3252	bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
				3253	bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
				3254	bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
				3255	bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
				3256
				3257	You can see that the trace of the top most trace buffer shows only
				3258	the function tracing. The foo instance displays wakeups and task
				3259	switches.
				3260
				3261	To remove the instances, simply delete their directories:
				3262	::
				3263
				3264	# rmdir instances/foo
				3265	# rmdir instances/bar
				3266	# rmdir instances/zoot
				3267
				3268	Note, if a process has a trace file open in one of the instance
				3269	directories, the rmdir will fail with EBUSY.
				3270
				3271
				3272	Stack trace
				3273	-----------
				3274	Since the kernel has a fixed sized stack, it is important not to
				3275	waste it in functions. A kernel developer must be conscience of
				3276	what they allocate on the stack. If they add too much, the system
				3277	can be in danger of a stack overflow, and corruption will occur,
				3278	usually leading to a system panic.
				3279
				3280	There are some tools that check this, usually with interrupts
				3281	periodically checking usage. But if you can perform a check
				3282	at every function call that will become very useful. As ftrace provides
				3283	a function tracer, it makes it convenient to check the stack size
				3284	at every function call. This is enabled via the stack tracer.
				3285
				3286	CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
				3287	To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
				3288	::
				3289
				3290	# echo 1 > /proc/sys/kernel/stack_tracer_enabled
				3291
				3292	You can also enable it from the kernel command line to trace
				3293	the stack size of the kernel during boot up, by adding "stacktrace"
				3294	to the kernel command line parameter.
				3295
				3296	After running it for a few minutes, the output looks like:
				3297	::
				3298
				3299	# cat stack_max_size
				3300	2928
				3301
				3302	# cat stack_trace
				3303	Depth Size Location (18 entries)
				3304	----- ---- --------
				3305	0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
				3306	1) 2704 160 find_busiest_group+0x31/0x1f1
				3307	2) 2544 256 load_balance+0xd9/0x662
				3308	3) 2288 80 idle_balance+0xbb/0x130
				3309	4) 2208 128 __schedule+0x26e/0x5b9
				3310	5) 2080 16 schedule+0x64/0x66
				3311	6) 2064 128 schedule_timeout+0x34/0xe0
				3312	7) 1936 112 wait_for_common+0x97/0xf1
				3313	8) 1824 16 wait_for_completion+0x1d/0x1f
				3314	9) 1808 128 flush_work+0xfe/0x119
				3315	10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
				3316	11) 1664 48 input_available_p+0x1d/0x5c
				3317	12) 1616 48 n_tty_poll+0x6d/0x134
				3318	13) 1568 64 tty_poll+0x64/0x7f
				3319	14) 1504 880 do_select+0x31e/0x511
				3320	15) 624 400 core_sys_select+0x177/0x216
				3321	16) 224 96 sys_select+0x91/0xb9
				3322	17) 128 128 system_call_fastpath+0x16/0x1b
				3323
				3324	Note, if -mfentry is being used by gcc, functions get traced before
				3325	they set up the stack frame. This means that leaf level functions
				3326	are not tested by the stack tracer when -mfentry is used.
				3327
				3328	Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
				3329
				3330	More
				3331	----
				3332	More details can be found in the source code, in the `kernel/trace/*.c` files.