Blame - Documentation/BUG-HUNTING - kernel/msm-4.9

blob: 65022a87bf17902f9e04fe5ecff611a41ffaf4d8 [file] [log] [blame]

Ian McDonald	43019a5	2006-03-22 00:37:42 +0100	[diff] [blame]	1	Table of contents
				2	=================
				3
				4	Last updated: 20 December 2005
				5
				6	Contents
				7	========
				8
				9	- Introduction
				10	- Devices not appearing
				11	- Finding patch that caused a bug
				12	-- Finding using git-bisect
				13	-- Finding it the old way
				14	- Fixing the bug
				15
				16	Introduction
				17	============
				18
				19	Always try the latest kernel from kernel.org and build from source. If you are
				20	not confident in doing that please report the bug to your distribution vendor
				21	instead of to a kernel developer.
				22
				23	Finding bugs is not always easy. Have a go though. If you can't find it don't
				24	give up. Report as much as you have found to the relevant maintainer. See
				25	MAINTAINERS for who that is for the subsystem you have worked on.
				26
				27	Before you submit a bug report read REPORTING-BUGS.
				28
				29	Devices not appearing
				30	=====================
				31
				32	Often this is caused by udev. Check that first before blaming it on the
				33	kernel.
				34
				35	Finding patch that caused a bug
				36	===============================
				37
				38
				39
				40	Finding using git-bisect
				41	------------------------
				42
				43	Using the provided tools with git makes finding bugs easy provided the bug is
				44	reproducible.
				45
				46	Steps to do it:
				47	- start using git for the kernel source
				48	- read the man page for git-bisect
				49	- have fun
				50
				51	Finding it the old way
				52	----------------------
				53
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	54	[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
				55
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	56	This is how to track down a bug if you know nothing about kernel hacking.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	57	It's a brute force approach but it works pretty well.
				58
				59	You need:
				60
				61	. A reproducible bug - it has to happen predictably (sorry)
				62	. All the kernel tar files from a revision that worked to the
				63	revision that doesn't
				64
				65	You will then do:
				66
				67	. Rebuild a revision that you believe works, install, and verify that.
				68	. Do a binary search over the kernels to figure out which one
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	69	introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	70	you know that 1.3.69 does. Pick a kernel in the middle and build
				71	that, like 1.3.50. Build & test; if it works, pick the mid point
				72	between .50 and .69, else the mid point between .28 and .50.
				73	. You'll narrow it down to the kernel that introduced the bug. You
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	74	can probably do better than this but it gets tricky.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	75
				76	. Narrow it down to a subdirectory
				77
				78	- Copy kernel that works into "test". Let's say that 3.62 works,
				79	but 3.63 doesn't. So you diff -r those two kernels and come
				80	up with a list of directories that changed. For each of those
				81	directories:
				82
				83	Copy the non-working directory next to the working directory
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	84	as "dir.63".
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	85	One directory at time, try moving the working directory to
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	86	"dir.62" and mv dir.63 dir"time, try
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	87
				88	mv dir dir.62
				89	mv dir.63 dir
				90	find dir -name '*.[oa]' -print \| xargs rm -f
				91
				92	And then rebuild and retest. Assuming that all related
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	93	changes were contained in the sub directory, this should
				94	isolate the change to a directory.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	95
				96	Problems: changes in header files may have occurred; I've
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	97	found in my case that they were self explanatory - you may
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	98	or may not want to give up when that happens.
				99
				100	. Narrow it down to a file
				101
				102	- You can apply the same technique to each file in the directory,
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	103	hoping that the changes in that file are self contained.
				104
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	105	. Narrow it down to a routine
				106
				107	- You can take the old file and the new file and manually create
				108	a merged file that has
				109
				110	#ifdef VER62
				111	routine()
				112	{
				113	...
				114	}
				115	#else
				116	routine()
				117	{
				118	...
				119	}
				120	#endif
				121
				122	And then walk through that file, one routine at a time and
				123	prefix it with
				124
				125	#define VER62
				126	/* both routines here */
				127	#undef VER62
				128
				129	Then recompile, retest, move the ifdefs until you find the one
				130	that makes the difference.
				131
				132	Finally, you take all the info that you have, kernel revisions, bug
Clemens Koller	d81919c	2008-02-03 16:26:36 +0200	[diff] [blame]	133	description, the extent to which you have narrowed it down, and pass
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	134	that off to whomever you believe is the maintainer of that section.
				135	A post to linux.dev.kernel isn't such a bad idea if you've done some
				136	work to narrow it down.
				137
				138	If you get it down to a routine, you'll probably get a fix in 24 hours.
				139
				140	My apologies to Linus and the other kernel hackers for describing this
				141	brute force approach, it's hardly what a kernel hacker would do. However,
				142	it does work and it lets non-hackers help fix bugs. And it is cool
				143	because Linux snapshots will let you do this - something that you can't
				144	do with vendor supplied releases.
				145
Ian McDonald	43019a5	2006-03-22 00:37:42 +0100	[diff] [blame]	146	Fixing the bug
				147	==============
				148
				149	Nobody is going to tell you how to fix bugs. Seriously. You need to work it
				150	out. But below are some hints on how to use the tools.
				151
				152	To debug a kernel, use objdump and look for the hex offset from the crash
				153	output to find the valid line of code/assembler. Without debug symbols, you
				154	will see the assembler code for the routine shown, but if your kernel has
				155	debug symbols the C code will also be available. (Debug symbols can be enabled
				156	in the kernel hacking menu of the menu configuration.) For example:
				157
				158	objdump -r -S -l --disassemble net/dccp/ipv4.o
				159
				160	NB.: you need to be at the top level of the kernel tree for this to pick up
				161	your C files.
				162
				163	If you don't have access to the code you can also debug on some crash dumps
				164	e.g. crash dump output as shown by Dave Miller.
				165
				166	> EIP is at ip_queue_xmit+0x14/0x4c0
				167	> ...
				168	> Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
				169	> 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
				170	> <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
				171	>
				172	> Put the bytes into a "foo.s" file like this:
				173	>
				174	> .text
				175	> .globl foo
				176	> foo:
				177	> .byte .... /* bytes from Code: part of OOPS dump */
				178	>
				179	> Compile it with "gcc -c -o foo.o foo.s" then look at the output of
				180	> "objdump --disassemble foo.o".
				181	>
				182	> Output:
				183	>
				184	> ip_queue_xmit:
				185	> push %ebp
				186	> push %edi
				187	> push %esi
				188	> push %ebx
				189	> sub $0xbc, %esp
				190	> mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
				191	> mov 0x8(%ebp), %ebx ! %ebx = skb->sk
				192	> mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
				193
Pekka Enberg	926b289	2007-06-01 00:46:50 -0700	[diff] [blame]	194	In addition, you can use GDB to figure out the exact file and line
				195	number of the OOPS from the vmlinux file. If you have
				196	CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the
				197	OOPS:
				198
				199	EIP: 0060:[<c021e50e>] Not tainted VLI
				200
				201	And use GDB to translate that to human-readable form:
				202
				203	gdb vmlinux
				204	(gdb) l *0xc021e50e
				205
				206	If you don't have CONFIG_DEBUG_INFO enabled, you use the function
				207	offset from the OOPS:
				208
				209	EIP is at vt_ioctl+0xda8/0x1482
				210
				211	And recompile the kernel with CONFIG_DEBUG_INFO enabled:
				212
				213	make vmlinux
				214	gdb vmlinux
				215	(gdb) p vt_ioctl
				216	(gdb) l *(0x<address of vt_ioctl> + 0xda8)
Richard Kennedy	dcc85cb	2008-02-06 01:38:01 -0800	[diff] [blame]	217	or, as one command
				218	(gdb) l *(vt_ioctl + 0xda8)
				219
				220	If you have a call trace, such as :-
				221	>Call Trace:
				222	> [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
				223	> [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
				224	> [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
				225	> ...
				226	this shows the problem in the :jbd: module. You can load that module in gdb
				227	and list the relevant code.
				228	gdb fs/jbd/jbd.ko
				229	(gdb) p log_wait_commit
				230	(gdb) l *(0x<address> + 0xa3)
				231	or
				232	(gdb) l *(log_wait_commit + 0xa3)
				233
Pekka Enberg	926b289	2007-06-01 00:46:50 -0700	[diff] [blame]	234
Ian McDonald	43019a5	2006-03-22 00:37:42 +0100	[diff] [blame]	235	Another very useful option of the Kernel Hacking section in menuconfig is
				236	Debug memory allocations. This will help you see whether data has been
				237	initialised and not set before use etc. To see the values that get assigned
				238	with this look at mm/slab.c and search for POISON_INUSE. When using this an
				239	Oops will often show the poisoned data instead of zero which is the default.
				240
				241	Once you have worked out a fix please submit it upstream. After all open
				242	source is about sharing what you do and don't you want to be recognised for
				243	your genius?
				244
				245	Please do read Documentation/SubmittingPatches though to help your code get
				246	accepted.