Blame - Documentation/ftrace.txt - kernel/msm-4.19

blob: 13e4bf054c383523fde7726ae4dafde4838cad93 [file] [log] [blame]

Steven Rostedt	eb6d42e	2008-07-10 12:46:01 -0400	[diff] [blame]	1	ftrace - Function Tracer
				2	========================
				3
				4	Copyright 2008 Red Hat Inc.
				5	Author: Steven Rostedt <srostedt@redhat.com>
				6
				7
				8	Introduction
				9	------------
				10
				11	Ftrace is an internal tracer designed to help out developers and
				12	designers of systems to find what is going on inside the kernel.
				13	It can be used for debugging or analyzing latencies and performance
				14	issues that take place outside of user-space.
				15
				16	Although ftrace is the function tracer, it also includes an
				17	infrastructure that allows for other types of tracing. Some of the
				18	tracers that are currently in ftrace is a tracer to trace
				19	context switches, the time it takes for a high priority task to
				20	run after it was woken up, the time interrupts are disabled, and
				21	more.
				22
				23
				24	The File System
				25	---------------
				26
				27	Ftrace uses the debugfs file system to hold the control files as well
				28	as the files to display output.
				29
				30	To mount the debugfs system:
				31
				32	# mkdir /debug
				33	# mount -t debugfs nodev /debug
				34
				35
				36	That's it! (assuming that you have ftrace configured into your kernel)
				37
				38	After mounting the debugfs, you can see a directory called
				39	"tracing". This directory contains the control and output files
				40	of ftrace. Here is a list of some of the key files:
				41
				42
				43	Note: all time values are in microseconds.
				44
				45	current_tracer : This is used to set or display the current tracer
				46	that is configured.
				47
				48	available_tracers : This holds the different types of tracers that
				49	has been compiled into the kernel. The tracers
				50	listed here can be configured by echoing in their
				51	name into current_tracer.
				52
				53	tracing_enabled : This sets or displays whether the current_tracer
				54	is activated and tracing or not. Echo 0 into this
				55	file to disable the tracer or 1 (or non-zero) to
				56	enable it.
				57
				58	trace : This file holds the output of the trace in a human readable
				59	format.
				60
				61	latency_trace : This file shows the same trace but the information
				62	is organized more to display possible latencies
				63	in the system.
				64
				65	trace_pipe : The output is the same as the "trace" file but this
				66	file is meant to be streamed with live tracing.
				67	Reads from this file will block until new data
				68	is retrieved. Unlike the "trace" and "latency_trace"
				69	files, this file is a consumer. This means reading
				70	from this file causes sequential reads to display
				71	more current data. Once data is read from this
				72	file, it is consumed, and will not be read
				73	again with a sequential read. The "trace" and
				74	"latency_trace" files are static, and if the
				75	tracer isn't adding more data, they will display
				76	the same information every time they are read.
				77
				78	iter_ctrl : This file lets the user control the amount of data
				79	that is displayed in one of the above output
				80	files.
				81
				82	trace_max_latency : Some of the tracers record the max latency.
				83	For example, the time interrupts are disabled.
				84	This time is saved in this file. The max trace
				85	will also be stored, and displayed by either
				86	"trace" or "latency_trace". A new max trace will
				87	only be recorded if the latency is greater than
				88	the value in this file. (in microseconds)
				89
				90	trace_entries : This sets or displays the number of trace
				91	entries each CPU buffer can hold. The tracer buffers
				92	are the same size for each CPU, so care must be
				93	taken when modifying the trace_entries. The number
				94	of actually entries will be the number given
				95	times the number of possible CPUS. The buffers
				96	are saved as individual pages, and the actual entries
				97	will always be rounded up to entries per page.
				98
				99	This can only be updated when the current_tracer
				100	is set to "none".
				101
				102	NOTE: It is planned on changing the allocated buffers
				103	from being the number of possible CPUS to
				104	the number of online CPUS.
				105
				106	tracing_cpumask : This is a mask that lets the user only trace
				107	on specified CPUS. The format is a hex string
				108	representing the CPUS.
				109
				110	set_ftrace_filter : When dynamic ftrace is configured in, the
				111	code is dynamically modified to disable calling
				112	of the function profiler (mcount). This lets
				113	tracing be configured in with practically no overhead
				114	in performance. This also has a side effect of
				115	enabling or disabling specific functions to be
				116	traced. Echoing in names of functions into this
				117	file will limit the trace to only those files.
				118
				119	set_ftrace_notrace: This has the opposite effect that
				120	set_ftrace_filter has. Any function that is added
				121	here will not be traced. If a function exists
				122	in both set_ftrace_filter and set_ftrace_notrace
				123	the function will _not_ bet traced.
				124
				125	available_filter_functions : When a function is encountered the first
				126	time by the dynamic tracer, it is recorded and
				127	later the call is converted into a nop. This file
				128	lists the functions that have been recorded
				129	by the dynamic tracer and these functions can
				130	be used to set the ftrace filter by the above
				131	"set_ftrace_filter" file.
				132
				133
				134	The Tracers
				135	-----------
				136
				137	Here are the list of current tracers that can be configured.
				138
				139	ftrace - function tracer that uses mcount to trace all functions.
				140	It is possible to filter out which functions that are
				141	traced when dynamic ftrace is configured in.
				142
				143	sched_switch - traces the context switches between tasks.
				144
				145	irqsoff - traces the areas that disable interrupts and saves off
				146	the trace with the longest max latency.
				147	See tracing_max_latency. When a new max is recorded,
				148	it replaces the old trace. It is best to view this
				149	trace with the latency_trace file.
				150
				151	preemptoff - Similar to irqsoff but traces and records the time
				152	preemption is disabled.
				153
				154	preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
				155	records the largest time irqs and/or preemption is
				156	disabled.
				157
				158	wakeup - Traces and records the max latency that it takes for
				159	the highest priority task to get scheduled after
				160	it has been woken up.
				161
				162	none - This is not a tracer. To remove all tracers from tracing
				163	simply echo "none" into current_tracer.
				164
				165
				166	Examples of using the tracer
				167	----------------------------
				168
				169	Here are typical examples of using the tracers with only controlling
				170	them with the debugfs interface (without using any user-land utilities).
				171
				172	Output format:
				173	--------------
				174
				175	Here's an example of the output format of the file "trace"
				176
				177	--------
				178	# tracer: ftrace
				179	#
				180	# TASK-PID CPU# TIMESTAMP FUNCTION
				181	# \| \| \| \| \|
				182	bash-4251 [01] 10152.583854: path_put <-path_walk
				183	bash-4251 [01] 10152.583855: dput <-path_put
				184	bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
				185	--------
				186
				187	A header is printed with the trace that is represented. In this case
				188	the tracer is "ftrace". Then a header showing the format. Task name
				189	"bash", the task PID "4251", the CPU that it was running on
				190	"01", the timestamp in <secs>.<usecs> format, the function name that was
				191	traced "path_put" and the parent function that called this function
				192	"path_walk".
				193
				194	The sched_switch tracer also includes tracing of task wake ups and
				195	context switches.
				196
				197	ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
				198	ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S
				199	ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R
				200	events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R
				201	kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
				202	ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
				203
				204	Wake ups are represented by a "+" and the context switches show
				205	"==>". The format is:
				206
				207	Context switches:
				208
				209	Previous task Next Task
				210
				211	<pid>:<prio>:<state> ==> <pid>:<prio>:<state>
				212
				213	Wake ups:
				214
				215	Current task Task waking up
				216
				217	<pid>:<prio>:<state> + <pid>:<prio>:<state>
				218
				219	The prio is the internal kernel priority, which is inverse to the
				220	priority that is usually displayed by user-space tools. Zero represents
				221	the highest priority (99). Prio 100 starts the "nice" priorities with
				222	100 being equal to nice -20 and 139 being nice 19. The prio "140" is
				223	reserved for the idle task which is the lowest priority thread (pid 0).
				224
				225
				226	Latency trace format
				227	--------------------
				228
				229	For traces that display latency times, the latency_trace file gives
				230	a bit more information to see why a latency happened. Here's a typical
				231	trace.
				232
				233	# tracer: irqsoff
				234	#
				235	irqsoff latency trace v1.1.5 on 2.6.26-rc8
				236	--------------------------------------------------------------------
				237	latency: 97 us, #3/3, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				238	-----------------
				239	\| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
				240	-----------------
				241	=> started at: apic_timer_interrupt
				242	=> ended at: do_softirq
				243
				244	# _------=> CPU#
				245	# / _-----=> irqs-off
				246	# \| / _----=> need-resched
				247	# \|\| / _---=> hardirq/softirq
				248	# \|\|\| / _--=> preempt-depth
				249	# \|\|\|\| /
				250	# \|\|\|\|\| delay
				251	# cmd pid \|\|\|\|\| time \| caller
				252	# \ / \|\|\|\|\| \ \| /
				253	<idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
				254	<idle>-0 0d.s. 97us : __do_softirq (do_softirq)
				255	<idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
				256
				257
				258	vim:ft=help
				259
				260
				261	This shows that the current tracer is "irqsoff" tracing the time
				262	interrupts are disabled. It gives the trace version and the kernel
				263	this was executed on (2.6.26-rc8). Then it displays the max latency
				264	in microsecs (97 us). The number of trace entries displayed
				265	by the total number recorded (both are three: #3/3). The type of
				266	preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
				267	and reserved for later use. #P is the number of online CPUS (#P:2).
				268
				269	The task is the process that was running when the latency happened.
				270	(swapper pid: 0).
				271
				272	The start and stop that caused the latencies:
				273
				274	apic_timer_interrupt is where the interrupts were disabled.
				275	do_softirq is where they were enabled again.
				276
				277	The next lines after the header are the trace itself. The header
				278	explains which is which.
				279
				280	cmd: The name of the process in the trace.
				281
				282	pid: The PID of that process.
				283
				284	CPU#: The CPU that the process was running on.
				285
				286	irqs-off: 'd' interrupts are disabled. '.' otherwise.
				287
				288	need-resched: 'N' task need_resched is set, '.' otherwise.
				289
				290	hardirq/softirq:
				291	'H' - hard irq happened inside a softirq.
				292	'h' - hard irq is running
				293	's' - soft irq is running
				294	'.' - normal context.
				295
				296	preempt-depth: The level of preempt_disabled
				297
				298	The above is mostly meaningful for kernel developers.
				299
				300	time: This differs from the trace output where as the trace output
				301	contained a absolute timestamp. This timestamp is relative
				302	to the start of the first entry in the the trace.
				303
				304	delay: This is just to help catch your eye a bit better. And
				305	needs to be fixed to be only relative to the same CPU.
				306	The marks is determined by the difference between this
				307	current trace and the next trace.
				308	'!' - greater than preempt_mark_thresh (default 100)
				309	'+' - greater than 1 microsecond
				310	' ' - less than or equal to 1 microsecond.
				311
				312	The rest is the same as the 'trace' file.
				313
				314
				315	iter_ctrl
				316	---------
				317
				318	The iter_ctrl file is used to control what gets printed in the trace
				319	output. To see what is available, simply cat the file:
				320
				321	cat /debug/tracing/iter_ctrl
				322	print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
				323	noblock nostacktrace nosched-tree
				324
				325	To disable one of the options, echo in the option appended with "no".
				326
				327	echo noprint-parent > /debug/tracing/iter_ctrl
				328
				329	To enable an option, leave off the "no".
				330
				331	echo sym-offest > /debug/tracing/iter_ctrl
				332
				333	Here are the available options:
				334
				335	print-parent - On function traces, display the calling function
				336	as well as the function being traced.
				337
				338	print-parent:
				339	bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul
				340
				341	noprint-parent:
				342	bash-4000 [01] 1477.606694: simple_strtoul
				343
				344
				345	sym-offset - Display not only the function name, but also the offset
				346	in the function. For example, instead of seeing just
				347	"ktime_get" you will see "ktime_get+0xb/0x20"
				348
				349	sym-offset:
				350	bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
				351
				352	sym-addr - this will also display the function address as well as
				353	the function name.
				354
				355	sym-addr:
				356	bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
				357
				358	verbose - This deals with the latency_trace file.
				359
				360	bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
				361	(+0.000ms): simple_strtoul (strict_strtoul)
				362
				363	raw - This will display raw numbers. This option is best for use with
				364	user applications that can translate the raw numbers better than
				365	having it done in the kernel.
				366
				367	hex - similar to raw, but the numbers will be in a hexadecimal format.
				368
				369	bin - This will print out the formats in raw binary.
				370
				371	block - TBD (needs update)
				372
				373	stacktrace - This is one of the options that changes the trace itself.
				374	When a trace is recorded, so is the stack of functions.
				375	This allows for back traces of trace sites.
				376
				377	sched-tree - TBD (any users??)
				378
				379
				380	sched_switch
				381	------------
				382
				383	This tracer simply records schedule switches. Here's an example
				384	on how to implement it.
				385
				386	# echo sched_switch > /debug/tracing/current_tracer
				387	# echo 1 > /debug/tracing/tracing_enabled
				388	# sleep 1
				389	# echo 0 > /debug/tracing/tracing_enabled
				390	# cat /debug/tracing/trace
				391
				392	# tracer: sched_switch
				393	#
				394	# TASK-PID CPU# TIMESTAMP FUNCTION
				395	# \| \| \| \| \|
				396	bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R
				397	bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R
				398	sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R
				399	bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S
				400	bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R
				401	sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R
				402	bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D
				403	bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R
				404	<idle>-0 [00] 240.132589: 0:140:R + 4:115:S
				405	<idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R
				406	ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R
				407	<idle>-0 [00] 240.132598: 0:140:R + 4:115:S
				408	<idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R
				409	ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R
				410	sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R
				411	[...]
				412
				413
				414	As we have discussed previously about this format, the header shows
				415	the name of the trace and points to the options. The "FUNCTION"
				416	is a misnomer since here it represents the wake ups and context
				417	switches.
				418
				419	The sched_switch only lists the wake ups (represented with '+')
				420	and context switches ('==>') with the previous task or current
				421	first followed by the next task or task waking up. The format for both
				422	of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
				423	is the inverse of the actual priority with zero (0) being the highest
				424	priority and the nice values starting at 100 (nice -20). Below is
				425	a quick chart to map the kernel priority to user land priorities.
				426
				427	Kernel priority: 0 to 99 ==> user RT priority 99 to 0
				428	Kernel priority: 100 to 139 ==> user nice -20 to 19
				429	Kernel priority: 140 ==> idle task priority
				430
				431	The task states are:
				432
				433	R - running : wants to run, may not actually be running
				434	S - sleep : process is waiting to be woken up (handles signals)
				435	D - deep sleep : process must be woken up (ignores signals)
				436	T - stopped : process suspended
				437	t - traced : process is being traced (with something like gdb)
				438	Z - zombie : process waiting to be cleaned up
				439	X - unknown
				440
				441
				442	ftrace_enabled
				443	--------------
				444
				445	The following tracers give different output depending on whether
				446	or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
				447	one can either use the sysctl function or set it via the proc
				448	file system interface.
				449
				450	sysctl kernel.ftrace_enabled=1
				451
				452	or
				453
				454	echo 1 > /proc/sys/kernel/ftrace_enabled
				455
				456	To disable ftrace_enabled simply replace the '1' with '0' in
				457	the above commands.
				458
				459	When ftrace_enabled is set the tracers will also record the functions
				460	that are within the trace. The descriptions of the tracers
				461	will also show an example with ftrace enabled.
				462
				463
				464	irqsoff
				465	-------
				466
				467	When interrupts are disabled, the CPU can not react to any other
				468	external event (besides NMIs and SMIs). This prevents the timer
				469	interrupt from triggering or the mouse interrupt from letting the
				470	kernel know of a new mouse event. The result is a latency with the
				471	reaction time.
				472
				473	The irqsoff tracer tracks the time interrupts are disabled and when
				474	they are re-enabled. When a new maximum latency is hit, it saves off
				475	the trace so that it may be retrieved at a later time. Every time a
				476	new maximum in reached, the old saved trace is discarded and the new
				477	trace is saved.
				478
				479	To reset the maximum, echo 0 into tracing_max_latency. Here's an
				480	example:
				481
				482	# echo irqsoff > /debug/tracing/current_tracer
				483	# echo 0 > /debug/tracing/tracing_max_latency
				484	# echo 1 > /debug/tracing/tracing_enabled
				485	# ls -ltr
				486	[...]
				487	# echo 0 > /debug/tracing/tracing_enabled
				488	# cat /debug/tracing/latency_trace
				489	# tracer: irqsoff
				490	#
				491	irqsoff latency trace v1.1.5 on 2.6.26-rc8
				492	--------------------------------------------------------------------
				493	latency: 6 us, #3/3, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				494	-----------------
				495	\| task: bash-4269 (uid:0 nice:0 policy:0 rt_prio:0)
				496	-----------------
				497	=> started at: copy_page_range
				498	=> ended at: copy_page_range
				499
				500	# _------=> CPU#
				501	# / _-----=> irqs-off
				502	# \| / _----=> need-resched
				503	# \|\| / _---=> hardirq/softirq
				504	# \|\|\| / _--=> preempt-depth
				505	# \|\|\|\| /
				506	# \|\|\|\|\| delay
				507	# cmd pid \|\|\|\|\| time \| caller
				508	# \ / \|\|\|\|\| \ \| /
				509	bash-4269 1...1 0us+: _spin_lock (copy_page_range)
				510	bash-4269 1...1 7us : _spin_unlock (copy_page_range)
				511	bash-4269 1...2 7us : trace_preempt_on (copy_page_range)
				512
				513
				514	vim:ft=help
				515
				516	Here we see that that we had a latency of 6 microsecs (which is
				517	very good). The spin_lock in copy_page_range disabled interrupts.
				518	The difference between the 6 and the displayed timestamp 7us is
				519	because the clock must have incremented between the time of recording
				520	the max latency and recording the function that had that latency.
				521
				522	Note the above had ftrace_enabled not set. If we set the ftrace_enabled
				523	we get a much larger output:
				524
				525	# tracer: irqsoff
				526	#
				527	irqsoff latency trace v1.1.5 on 2.6.26-rc8
				528	--------------------------------------------------------------------
				529	latency: 50 us, #101/101, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				530	-----------------
				531	\| task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
				532	-----------------
				533	=> started at: __alloc_pages_internal
				534	=> ended at: __alloc_pages_internal
				535
				536	# _------=> CPU#
				537	# / _-----=> irqs-off
				538	# \| / _----=> need-resched
				539	# \|\| / _---=> hardirq/softirq
				540	# \|\|\| / _--=> preempt-depth
				541	# \|\|\|\| /
				542	# \|\|\|\|\| delay
				543	# cmd pid \|\|\|\|\| time \| caller
				544	# \ / \|\|\|\|\| \ \| /
				545	ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
				546	ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
				547	ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
				548	ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
				549	ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
				550	ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
				551	ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
				552	ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
				553	ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
				554	ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
				555	ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
				556	ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
				557	[...]
				558	ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
				559	ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
				560	ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
				561	ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
				562	ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
				563	ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
				564	ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
				565	ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
				566	ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
				567
				568
				569	vim:ft=help
				570
				571
				572	Here we traced a 50 microsecond latency. But we also see all the
				573	functions that were called during that time. Note that enabling
				574	function tracing we endure an added overhead. This overhead may
				575	extend the latency times. But never the less, this trace has provided
				576	some very helpful debugging.
				577
				578
				579	preemptoff
				580	----------
				581
				582	When preemption is disabled we may be able to receive interrupts but
				583	the task can not be preempted and a higher priority task must wait
				584	for preemption to be enabled again before it can preempt a lower
				585	priority task.
				586
				587	The preemptoff tracer traces the places that disables preemption.
				588	Like the irqsoff, it records the maximum latency that preemption
				589	was disabled. The control of preemptoff is much like the irqsoff.
				590
				591	# echo preemptoff > /debug/tracing/current_tracer
				592	# echo 0 > /debug/tracing/tracing_max_latency
				593	# echo 1 > /debug/tracing/tracing_enabled
				594	# ls -ltr
				595	[...]
				596	# echo 0 > /debug/tracing/tracing_enabled
				597	# cat /debug/tracing/latency_trace
				598	# tracer: preemptoff
				599	#
				600	preemptoff latency trace v1.1.5 on 2.6.26-rc8
				601	--------------------------------------------------------------------
				602	latency: 29 us, #3/3, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				603	-----------------
				604	\| task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
				605	-----------------
				606	=> started at: do_IRQ
				607	=> ended at: __do_softirq
				608
				609	# _------=> CPU#
				610	# / _-----=> irqs-off
				611	# \| / _----=> need-resched
				612	# \|\| / _---=> hardirq/softirq
				613	# \|\|\| / _--=> preempt-depth
				614	# \|\|\|\| /
				615	# \|\|\|\|\| delay
				616	# cmd pid \|\|\|\|\| time \| caller
				617	# \ / \|\|\|\|\| \ \| /
				618	sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
				619	sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
				620	sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
				621
				622
				623	vim:ft=help
				624
				625	This has some more changes. Preemption was disabled when an interrupt
				626	came in (notice the 'h'), and was enabled while doing a softirq.
				627	(notice the 's'). But we also see that interrupts have been disabled
				628	when entering the preempt off section and leaving it (the 'd').
				629	We do not know if interrupts were enabled in the mean time.
				630
				631	# tracer: preemptoff
				632	#
				633	preemptoff latency trace v1.1.5 on 2.6.26-rc8
				634	--------------------------------------------------------------------
				635	latency: 63 us, #87/87, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				636	-----------------
				637	\| task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
				638	-----------------
				639	=> started at: remove_wait_queue
				640	=> ended at: __do_softirq
				641
				642	# _------=> CPU#
				643	# / _-----=> irqs-off
				644	# \| / _----=> need-resched
				645	# \|\| / _---=> hardirq/softirq
				646	# \|\|\| / _--=> preempt-depth
				647	# \|\|\|\| /
				648	# \|\|\|\|\| delay
				649	# cmd pid \|\|\|\|\| time \| caller
				650	# \ / \|\|\|\|\| \ \| /
				651	sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
				652	sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
				653	sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
				654	sshd-4261 0d..1 2us : irq_enter (do_IRQ)
				655	sshd-4261 0d..1 2us : idle_cpu (irq_enter)
				656	sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
				657	sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
				658	sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
				659	[...]
				660	sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
				661	sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
				662	sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
				663	sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
				664	sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
				665	sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
				666	sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
				667	sshd-4261 0d..2 15us : do_softirq (irq_exit)
				668	sshd-4261 0d... 15us : __do_softirq (do_softirq)
				669	sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
				670	sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
				671	sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
				672	sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
				673	sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
				674	[...]
				675	sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
				676	sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
				677	sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
				678	sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
				679	sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
				680	sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
				681	sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
				682	sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
				683	[...]
				684	sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
				685	sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
				686
				687
				688	The above is an example of the preemptoff trace with ftrace_enabled
				689	set. Here we see that interrupts were disabled the entire time.
				690	The irq_enter code lets us know that we entered an interrupt 'h'.
				691	Before that, the functions being traced still show that it is not
				692	in an interrupt, but we can see by the functions themselves that
				693	this is not the case.
				694
				695	Notice that the __do_softirq when called doesn't have a preempt_count.
				696	It may seem that we missed a preempt enabled. What really happened
				697	is that the preempt count is held on the threads stack and we
				698	switched to the softirq stack (4K stacks in effect). The code
				699	does not copy the preempt count, but because interrupts are disabled
				700	we don't need to worry about it. Having a tracer like this is good
				701	to let people know what really happens inside the kernel.
				702
				703
				704	preemptirqsoff
				705	--------------
				706
				707	Knowing the locations that have interrupts disabled or preemption
				708	disabled for the longest times is helpful. But sometimes we would
				709	like to know when either preemption and/or interrupts are disabled.
				710
				711	The following code:
				712
				713	local_irq_disable();
				714	call_function_with_irqs_off();
				715	preempt_disable();
				716	call_function_with_irqs_and_preemption_off();
				717	local_irq_enable();
				718	call_function_with_preemption_off();
				719	preempt_enable();
				720
				721	The irqsoff tracer will record the total length of
				722	call_function_with_irqs_off() and
				723	call_function_with_irqs_and_preemption_off().
				724
				725	The preemptoff tracer will record the total length of
				726	call_function_with_irqs_and_preemption_off() and
				727	call_function_with_preemption_off().
				728
				729	But neither will trace the time that interrupts and/or preemption
				730	is disabled. This total time is the time that we can not schedule.
				731	To record this time, use the preemptirqsoff tracer.
				732
				733	Again, using this trace is much like the irqsoff and preemptoff tracers.
				734
				735	# echo preemptoff > /debug/tracing/current_tracer
				736	# echo 0 > /debug/tracing/tracing_max_latency
				737	# echo 1 > /debug/tracing/tracing_enabled
				738	# ls -ltr
				739	[...]
				740	# echo 0 > /debug/tracing/tracing_enabled
				741	# cat /debug/tracing/latency_trace
				742	# tracer: preemptirqsoff
				743	#
				744	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
				745	--------------------------------------------------------------------
				746	latency: 293 us, #3/3, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				747	-----------------
				748	\| task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
				749	-----------------
				750	=> started at: apic_timer_interrupt
				751	=> ended at: __do_softirq
				752
				753	# _------=> CPU#
				754	# / _-----=> irqs-off
				755	# \| / _----=> need-resched
				756	# \|\| / _---=> hardirq/softirq
				757	# \|\|\| / _--=> preempt-depth
				758	# \|\|\|\| /
				759	# \|\|\|\|\| delay
				760	# cmd pid \|\|\|\|\| time \| caller
				761	# \ / \|\|\|\|\| \ \| /
				762	ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
				763	ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
				764	ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
				765
				766
				767	vim:ft=help
				768
				769
				770	The trace_hardirqs_off_thunk is called from assembly on x86 when
				771	interrupts are disabled in the assembly code. Without the function
				772	tracing, we don't know if interrupts were enabled within the preemption
				773	points. We do see that it started with preemption enabled.
				774
				775	Here is a trace with ftrace_enabled set:
				776
				777
				778	# tracer: preemptirqsoff
				779	#
				780	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
				781	--------------------------------------------------------------------
				782	latency: 105 us, #183/183, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				783	-----------------
				784	\| task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
				785	-----------------
				786	=> started at: write_chan
				787	=> ended at: __do_softirq
				788
				789	# _------=> CPU#
				790	# / _-----=> irqs-off
				791	# \| / _----=> need-resched
				792	# \|\| / _---=> hardirq/softirq
				793	# \|\|\| / _--=> preempt-depth
				794	# \|\|\|\| /
				795	# \|\|\|\|\| delay
				796	# cmd pid \|\|\|\|\| time \| caller
				797	# \ / \|\|\|\|\| \ \| /
				798	ls-4473 0.N.. 0us : preempt_schedule (write_chan)
				799	ls-4473 0dN.1 1us : _spin_lock (schedule)
				800	ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
				801	ls-4473 0d..2 2us : put_prev_task_fair (schedule)
				802	[...]
				803	ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
				804	ls-4473 0d..2 13us : __switch_to (schedule)
				805	sshd-4261 0d..2 14us : finish_task_switch (schedule)
				806	sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
				807	sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
				808	sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
				809	sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
				810	sshd-4261 0d..2 17us : irq_enter (do_IRQ)
				811	sshd-4261 0d..2 17us : idle_cpu (irq_enter)
				812	sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
				813	sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
				814	sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
				815	sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
				816	sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
				817	sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
				818	sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
				819	[...]
				820	sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
				821	sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
				822	sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
				823	sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
				824	sshd-4261 0d..3 30us : do_softirq (irq_exit)
				825	sshd-4261 0d... 30us : __do_softirq (do_softirq)
				826	sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
				827	sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
				828	sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
				829	[...]
				830	sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
				831	sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
				832	sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
				833	sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
				834	sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
				835	sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
				836	sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
				837	sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
				838	sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
				839	[...]
				840	sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
				841	sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
				842	sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
				843	sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
				844	sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
				845	sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
				846	sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
				847	sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
				848	sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
				849	sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
				850	sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
				851	[...]
				852	sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
				853	sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
				854	sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
				855	sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
				856	sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
				857	sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
				858	sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
				859
				860
				861	This is a very interesting trace. It started with the preemption of
				862	the ls task. We see that the task had the "need_resched" bit set
				863	with the 'N' in the trace. Interrupts are disabled in the spin_lock
				864	and the trace started. We see that a schedule took place to run
				865	sshd. When the interrupts were enabled we took an interrupt.
				866	On return of the interrupt the softirq ran. We took another interrupt
				867	while running the softirq as we see with the capital 'H'.
				868
				869
				870	wakeup
				871	------
				872
				873	In Real-Time environment it is very important to know the wakeup
				874	time it takes for the highest priority task that wakes up to the
				875	time it executes. This is also known as "schedule latency".
				876	I stress the point that this is about RT tasks. It is also important
				877	to know the scheduling latency of non-RT tasks, but the average
				878	schedule latency is better for non-RT tasks. Tools like
				879	LatencyTop is more appropriate for such measurements.
				880
				881	Real-Time environments is interested in the worst case latency.
				882	That is the longest latency it takes for something to happen, and
				883	not the average. We can have a very fast scheduler that may only
				884	have a large latency once in a while, but that would not work well
				885	with Real-Time tasks. The wakeup tracer was designed to record
				886	the worst case wakeups of RT tasks. Non-RT tasks are not recorded
				887	because the tracer only records one worst case and tracing non-RT
				888	tasks that are unpredictable will overwrite the worst case latency
				889	of RT tasks.
				890
				891	Since this tracer only deals with RT tasks, we will run this slightly
				892	different than we did with the previous tracers. Instead of performing
				893	an 'ls' we will run 'sleep 1' under 'chrt' which changes the
				894	priority of the task.
				895
				896	# echo wakeup > /debug/tracing/current_tracer
				897	# echo 0 > /debug/tracing/tracing_max_latency
				898	# echo 1 > /debug/tracing/tracing_enabled
				899	# chrt -f 5 sleep 1
				900	# echo 0 > /debug/tracing/tracing_enabled
				901	# cat /debug/tracing/latency_trace
				902	# tracer: wakeup
				903	#
				904	wakeup latency trace v1.1.5 on 2.6.26-rc8
				905	--------------------------------------------------------------------
				906	latency: 4 us, #2/2, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				907	-----------------
				908	\| task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
				909	-----------------
				910
				911	# _------=> CPU#
				912	# / _-----=> irqs-off
				913	# \| / _----=> need-resched
				914	# \|\| / _---=> hardirq/softirq
				915	# \|\|\| / _--=> preempt-depth
				916	# \|\|\|\| /
				917	# \|\|\|\|\| delay
				918	# cmd pid \|\|\|\|\| time \| caller
				919	# \ / \|\|\|\|\| \ \| /
				920	<idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
				921	<idle>-0 1d..4 4us : schedule (cpu_idle)
				922
				923
				924	vim:ft=help
				925
				926
				927	Running this on an idle system we see that it only took 4 microseconds
				928	to perform the task switch. Note, since the trace marker in the
				929	schedule is before the actual "switch" we stop the tracing when
				930	the recorded task is about to schedule in. This may change if
				931	we add a new marker at the end of the scheduler.
				932
				933	Notice that the recorded task is 'sleep' with the PID of 4901 and it
				934	has an rt_prio of 5. This priority is user-space priority and not
				935	the internal kernel priority. The policy is 1 for SCHED_FIFO and 2
				936	for SCHED_RR.
				937
				938	Doing the same with chrt -r 5 and ftrace_enabled set.
				939
				940	# tracer: wakeup
				941	#
				942	wakeup latency trace v1.1.5 on 2.6.26-rc8
				943	--------------------------------------------------------------------
				944	latency: 50 us, #60/60, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
				945	-----------------
				946	\| task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
				947	-----------------
				948
				949	# _------=> CPU#
				950	# / _-----=> irqs-off
				951	# \| / _----=> need-resched
				952	# \|\| / _---=> hardirq/softirq
				953	# \|\|\| / _--=> preempt-depth
				954	# \|\|\|\| /
				955	# \|\|\|\|\| delay
				956	# cmd pid \|\|\|\|\| time \| caller
				957	# \ / \|\|\|\|\| \ \| /
				958	ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
				959	ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
				960	ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
				961	ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
				962	ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
				963	ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
				964	ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
				965	ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
				966	[...]
				967	ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
				968	ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
				969	ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
				970	ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
				971	[...]
				972	ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
				973	ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
				974	ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
				975	ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
				976	ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
				977	ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
				978	ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
				979	ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
				980	ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
				981	ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
				982	ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
				983	ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
				984	ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
				985	ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
				986	[...]
				987	ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
				988	ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
				989	ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
				990	ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
				991	ksoftirq-7 1d..4 50us : schedule (__cond_resched)
				992
				993	The interrupt went off while running ksoftirqd. This task runs at
				994	SCHED_OTHER. Why didn't we see the 'N' set early? This may be
				995	a harmless bug with x86_32 and 4K stacks. The need_reched() function
				996	that tests if we need to reschedule looks on the actual stack.
				997	Where as the setting of the NEED_RESCHED bit happens on the
				998	task's stack. But because we are in a hard interrupt, the test
				999	is with the interrupts stack which has that to be false. We don't
				1000	see the 'N' until we switch back to the task's stack.
				1001
				1002	ftrace
				1003	------
				1004
				1005	ftrace is not only the name of the tracing infrastructure, but it
				1006	is also a name of one of the tracers. The tracer is the function
				1007	tracer. Enabling the function tracer can be done from the
				1008	debug file system. Make sure the ftrace_enabled is set otherwise
				1009	this tracer is a nop.
				1010
				1011	# sysctl kernel.ftrace_enabled=1
				1012	# echo ftrace > /debug/tracing/current_tracer
				1013	# echo 1 > /debug/tracing/tracing_enabled
				1014	# usleep 1
				1015	# echo 0 > /debug/tracing/tracing_enabled
				1016	# cat /debug/tracing/trace
				1017	# tracer: ftrace
				1018	#
				1019	# TASK-PID CPU# TIMESTAMP FUNCTION
				1020	# \| \| \| \| \|
				1021	bash-4003 [00] 123.638713: finish_task_switch <-schedule
				1022	bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
				1023	bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
				1024	bash-4003 [00] 123.638715: hrtick_set <-schedule
				1025	bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
				1026	bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
				1027	bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
				1028	bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
				1029	bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
				1030	bash-4003 [00] 123.638718: sub_preempt_count <-schedule
				1031	bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
				1032	bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
				1033	bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
				1034	bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
				1035	bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
				1036	[...]
				1037
				1038
				1039	Note: It is sometimes better to enable or disable tracing directly from
				1040	a program, because the buffer may be overflowed by the echo commands
				1041	before you get to the point you want to trace. It is also easier to
				1042	stop the tracing at the point that you hit the part that you are
				1043	interested in. Since the ftrace buffer is a ring buffer with the
				1044	oldest data being overwritten, usually it is sufficient to start the
				1045	tracer with an echo command but have you code stop it. Something
				1046	like the following is usually appropriate for this.
				1047
				1048	int trace_fd;
				1049	[...]
				1050	int main(int argc, char *argv[]) {
				1051	[...]
				1052	trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
				1053	[...]
				1054	if (condition_hit()) {
				1055	write(trace_fd, "0", 1);
				1056	}
				1057	[...]
				1058	}
				1059
				1060
				1061	dynamic ftrace
				1062	--------------
				1063
				1064	If CONFIG_DYNAMIC_FTRACE is set, then the system will run with
				1065	virtually no overhead when function tracing is disabled. The way
				1066	this works is the mcount function call (placed at the start of
				1067	every kernel function, produced by the -pg switch in gcc), starts
				1068	of pointing to a simple return.
				1069
				1070	When dynamic ftrace is initialized, it calls kstop_machine to make it
				1071	act like a uniprocessor so that it can freely modify code without
				1072	worrying about other processors executing that same code. At
				1073	initialization, the mcount calls are change to call a "record_ip"
				1074	function. After this, the first time a kernel function is called,
				1075	it has the calling address saved in a hash table.
				1076
				1077	Later on the ftraced kernel thread is awoken and will again call
				1078	kstop_machine if new functions have been recorded. The ftraced thread
				1079	will change all calls to mcount to "nop". Just calling mcount
				1080	and having mcount return has shown a 10% overhead. By converting
				1081	it to a nop, there is no recordable overhead to the system.
				1082
				1083	One special side-effect to the recording of the functions being
				1084	traced, is that we can now selectively choose which functions we
				1085	want to trace and which ones we want the mcount calls to remain as
				1086	nops.
				1087
				1088	Two files that contain to the enabling and disabling of recorded
				1089	functions are:
				1090
				1091	set_ftrace_filter
				1092
				1093	and
				1094
				1095	set_ftrace_notrace
				1096
				1097	A list of available functions that you can add to this files is listed
				1098	in:
				1099
				1100	available_filter_functions
				1101
				1102	# cat /debug/tracing/available_filter_functions
				1103	put_prev_task_idle
				1104	kmem_cache_create
				1105	pick_next_task_rt
				1106	get_online_cpus
				1107	pick_next_task_fair
				1108	mutex_lock
				1109	[...]
				1110
				1111	If I'm only interested in sys_nanosleep and hrtimer_interrupt:
				1112
				1113	# echo sys_nanosleep hrtimer_interrupt \
				1114	> /debug/tracing/set_ftrace_filter
				1115	# echo ftrace > /debug/tracing/current_tracer
				1116	# echo 1 > /debug/tracing/tracing_enabled
				1117	# usleep 1
				1118	# echo 0 > /debug/tracing/tracing_enabled
				1119	# cat /debug/tracing/trace
				1120	# tracer: ftrace
				1121	#
				1122	# TASK-PID CPU# TIMESTAMP FUNCTION
				1123	# \| \| \| \| \|
				1124	usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
				1125	usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
				1126	<idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
				1127
				1128	To see what functions are being traced, you can cat the file:
				1129
				1130	# cat /debug/tracing/set_ftrace_filter
				1131	hrtimer_interrupt
				1132	sys_nanosleep
				1133
				1134
				1135	Perhaps this isn't enough. The filters also allow simple wild cards.
				1136	Only the following is currently available
				1137
				1138	<match>* - will match functions that begins with <match>
				1139	*<match> - will match functions that end with <match>
				1140	<match> - will match functions that have <match> in it
				1141
				1142	Thats all the wild cards that are allowed.
				1143
				1144	<match>*<match> will not work.
				1145
				1146	# echo hrtimer_* > /debug/tracing/set_ftrace_filter
				1147
				1148	Produces:
				1149
				1150	# tracer: ftrace
				1151	#
				1152	# TASK-PID CPU# TIMESTAMP FUNCTION
				1153	# \| \| \| \| \|
				1154	bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
				1155	bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
				1156	bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
				1157	bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
				1158	<idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
				1159	<idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
				1160	<idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
				1161	<idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
				1162	<idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
				1163
				1164
				1165	Notice that we lost the sys_nanosleep.
				1166
				1167	# cat /debug/tracing/set_ftrace_filter
				1168	hrtimer_run_queues
				1169	hrtimer_run_pending
				1170	hrtimer_init
				1171	hrtimer_cancel
				1172	hrtimer_try_to_cancel
				1173	hrtimer_forward
				1174	hrtimer_start
				1175	hrtimer_reprogram
				1176	hrtimer_force_reprogram
				1177	hrtimer_get_next_event
				1178	hrtimer_interrupt
				1179	hrtimer_nanosleep
				1180	hrtimer_wakeup
				1181	hrtimer_get_remaining
				1182	hrtimer_get_res
				1183	hrtimer_init_sleeper
				1184
				1185
				1186	This is because the '>' and '>>' act just like they do in bash.
				1187	To rewrite the filters, use '>'
				1188	To append to the filters, use '>>'
				1189
				1190	To clear out a filter so that all functions will be recorded again.
				1191
				1192	# echo > /debug/tracing/set_ftrace_filter
				1193	# cat /debug/tracing/set_ftrace_filter
				1194	#
				1195
				1196	Again, now we want to append.
				1197
				1198	# echo sys_nanosleep > /debug/tracing/set_ftrace_filter
				1199	# cat /debug/tracing/set_ftrace_filter
				1200	sys_nanosleep
				1201	# echo hrtimer_* >> /debug/tracing/set_ftrace_filter
				1202	# cat /debug/tracing/set_ftrace_filter
				1203	hrtimer_run_queues
				1204	hrtimer_run_pending
				1205	hrtimer_init
				1206	hrtimer_cancel
				1207	hrtimer_try_to_cancel
				1208	hrtimer_forward
				1209	hrtimer_start
				1210	hrtimer_reprogram
				1211	hrtimer_force_reprogram
				1212	hrtimer_get_next_event
				1213	hrtimer_interrupt
				1214	sys_nanosleep
				1215	hrtimer_nanosleep
				1216	hrtimer_wakeup
				1217	hrtimer_get_remaining
				1218	hrtimer_get_res
				1219	hrtimer_init_sleeper
				1220
				1221
				1222	The set_ftrace_notrace prevents those functions from being traced.
				1223
				1224	# echo 'preempt' 'lock' > /debug/tracing/set_ftrace_notrace
				1225
				1226	Produces:
				1227
				1228	# tracer: ftrace
				1229	#
				1230	# TASK-PID CPU# TIMESTAMP FUNCTION
				1231	# \| \| \| \| \|
				1232	bash-4043 [01] 115.281644: finish_task_switch <-schedule
				1233	bash-4043 [01] 115.281645: hrtick_set <-schedule
				1234	bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
				1235	bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
				1236	bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
				1237	bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
				1238	bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
				1239	bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
				1240	bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
				1241
				1242	We can see that there's no more lock or preempt tracing.
				1243
				1244	ftraced
				1245	-------
				1246
				1247	As mentioned above, when dynamic ftrace is configured in, a kernel
				1248	thread wakes up once a second and checks to see if there are mcount
				1249	calls that need to be converted into nops. If there is not, then
				1250	it simply goes back to sleep. But if there is, it will call
				1251	kstop_machine to convert the calls to nops.
				1252
				1253	There may be a case that you do not want this added latency.
				1254	Perhaps you are doing some audio recording and this activity might
				1255	cause skips in the playback. There is an interface to disable
				1256	and enable the ftraced kernel thread.
				1257
				1258	# echo 0 > /debug/tracing/ftraced_enabled
				1259
				1260	This will disable the calling of the kstop_machine to update the
				1261	mcount calls to nops. Remember that there's a large overhead
				1262	to calling mcount. Without this kernel thread, that overhead will
				1263	exist.
				1264
				1265	Any write to the ftraced_enabled file will cause the kstop_machine
				1266	to run if there are recorded calls to mcount. This means that a
				1267	user can manually perform the updates when they want to by simply
				1268	echoing a '0' into the ftraced_enabled file.
				1269
				1270	The updates are also done at the beginning of enabling a tracer
				1271	that uses ftrace function recording.
				1272
				1273
				1274	trace_pipe
				1275	----------
				1276
				1277	The trace_pipe outputs the same as trace, but the effect on the
				1278	tracing is different. Every read from trace_pipe is consumed.
				1279	This means that subsequent reads will be different. The trace
				1280	is live.
				1281
				1282	# echo ftrace > /debug/tracing/current_tracer
				1283	# cat /debug/tracing/trace_pipe > /tmp/trace.out &
				1284	[1] 4153
				1285	# echo 1 > /debug/tracing/tracing_enabled
				1286	# usleep 1
				1287	# echo 0 > /debug/tracing/tracing_enabled
				1288	# cat /debug/tracing/trace
				1289	# tracer: ftrace
				1290	#
				1291	# TASK-PID CPU# TIMESTAMP FUNCTION
				1292	# \| \| \| \| \|
				1293
				1294	#
				1295	# cat /tmp/trace.out
				1296	bash-4043 [00] 41.267106: finish_task_switch <-schedule
				1297	bash-4043 [00] 41.267106: hrtick_set <-schedule
				1298	bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
				1299	bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
				1300	bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
				1301	bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
				1302	bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
				1303	bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
				1304	bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
				1305	bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
				1306
				1307
				1308	Note, reading the trace_pipe will block until more input is added.
				1309	By changing the tracer, trace_pipe will issue an EOF. We needed
				1310	to set the ftrace tracer _before_ cating the trace_pipe file.
				1311
				1312
				1313	trace entries
				1314	-------------
				1315
				1316	Having too much or not enough data can be troublesome in diagnosing
				1317	some issue in the kernel. The file trace_entries is used to modify
				1318	the size of the internal trace buffers. The numbers listed
				1319	is the number of entries that can be recorded per CPU. To know
				1320	the full size, multiply the number of possible CPUS with the
				1321	number of entries.
				1322
				1323	# cat /debug/tracing/trace_entries
				1324	65620
				1325
				1326	Note, to modify this you must have tracing fulling disabled. To do that,
				1327	echo "none" into the current_tracer.
				1328
				1329	# echo none > /debug/tracing/current_tracer
				1330	# echo 100000 > /debug/tracing/trace_entries
				1331	# cat /debug/tracing/trace_entries
				1332	100045
				1333
				1334
				1335	Notice that we echoed in 100,000 but the size is 100,045. The entries
				1336	are held by individual pages. It allocates the number of pages it takes
				1337	to fulfill the request. If more entries may fit on the last page
				1338	it will add them.
				1339
				1340	# echo 1 > /debug/tracing/trace_entries
				1341	# cat /debug/tracing/trace_entries
				1342	85
				1343
				1344	This shows us that 85 entries can fit on a single page.
				1345
				1346	The number of pages that will be allocated is a percentage of available
				1347	memory. Allocating too much will produces an error.
				1348
				1349	# echo 1000000000000 > /debug/tracing/trace_entries
				1350	-bash: echo: write error: Cannot allocate memory
				1351	# cat /debug/tracing/trace_entries
				1352	85
				1353