| Bug hunting |
| +++++++++++ |
| |
| Last updated: 20 December 2005 |
| |
| Introduction |
| ============ |
| |
| Always try the latest kernel from kernel.org and build from source. If you are |
| not confident in doing that please report the bug to your distribution vendor |
| instead of to a kernel developer. |
| |
| Finding bugs is not always easy. Have a go though. If you can't find it don't |
| give up. Report as much as you have found to the relevant maintainer. See |
| MAINTAINERS for who that is for the subsystem you have worked on. |
| |
| Before you submit a bug report read |
| :ref:`Documentation/admin-guide/reporting-bugs.rst <reportingbugs>`. |
| |
| Devices not appearing |
| ===================== |
| |
| Often this is caused by udev. Check that first before blaming it on the |
| kernel. |
| |
| Finding patch that caused a bug |
| =============================== |
| |
| |
| |
| Finding using ``git-bisect`` |
| ---------------------------- |
| |
| Using the provided tools with ``git`` makes finding bugs easy provided the bug |
| is reproducible. |
| |
| Steps to do it: |
| |
| - start using git for the kernel source |
| - read the man page for ``git-bisect`` |
| - have fun |
| |
| Finding it the old way |
| ---------------------- |
| |
| [Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] |
| |
| This is how to track down a bug if you know nothing about kernel hacking. |
| It's a brute force approach but it works pretty well. |
| |
| You need: |
| |
| - A reproducible bug - it has to happen predictably (sorry) |
| - All the kernel tar files from a revision that worked to the |
| revision that doesn't |
| |
| You will then do: |
| |
| - Rebuild a revision that you believe works, install, and verify that. |
| - Do a binary search over the kernels to figure out which one |
| introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but |
| you know that 1.3.69 does. Pick a kernel in the middle and build |
| that, like 1.3.50. Build & test; if it works, pick the mid point |
| between .50 and .69, else the mid point between .28 and .50. |
| - You'll narrow it down to the kernel that introduced the bug. You |
| can probably do better than this but it gets tricky. |
| |
| - Narrow it down to a subdirectory |
| |
| - Copy kernel that works into "test". Let's say that 3.62 works, |
| but 3.63 doesn't. So you diff -r those two kernels and come |
| up with a list of directories that changed. For each of those |
| directories: |
| |
| Copy the non-working directory next to the working directory |
| as "dir.63". |
| One directory at time, try moving the working directory to |
| "dir.62" and mv dir.63 dir"time, try:: |
| |
| mv dir dir.62 |
| mv dir.63 dir |
| find dir -name '*.[oa]' -print | xargs rm -f |
| |
| And then rebuild and retest. Assuming that all related |
| changes were contained in the sub directory, this should |
| isolate the change to a directory. |
| |
| Problems: changes in header files may have occurred; I've |
| found in my case that they were self explanatory - you may |
| or may not want to give up when that happens. |
| |
| - Narrow it down to a file |
| |
| - You can apply the same technique to each file in the directory, |
| hoping that the changes in that file are self contained. |
| |
| - Narrow it down to a routine |
| |
| - You can take the old file and the new file and manually create |
| a merged file that has:: |
| |
| #ifdef VER62 |
| routine() |
| { |
| ... |
| } |
| #else |
| routine() |
| { |
| ... |
| } |
| #endif |
| |
| And then walk through that file, one routine at a time and |
| prefix it with:: |
| |
| #define VER62 |
| /* both routines here */ |
| #undef VER62 |
| |
| Then recompile, retest, move the ifdefs until you find the one |
| that makes the difference. |
| |
| Finally, you take all the info that you have, kernel revisions, bug |
| description, the extent to which you have narrowed it down, and pass |
| that off to whomever you believe is the maintainer of that section. |
| A post to linux.dev.kernel isn't such a bad idea if you've done some |
| work to narrow it down. |
| |
| If you get it down to a routine, you'll probably get a fix in 24 hours. |
| |
| My apologies to Linus and the other kernel hackers for describing this |
| brute force approach, it's hardly what a kernel hacker would do. However, |
| it does work and it lets non-hackers help fix bugs. And it is cool |
| because Linux snapshots will let you do this - something that you can't |
| do with vendor supplied releases. |
| |
| Fixing the bug |
| ============== |
| |
| Nobody is going to tell you how to fix bugs. Seriously. You need to work it |
| out. But below are some hints on how to use the tools. |
| |
| To debug a kernel, use objdump and look for the hex offset from the crash |
| output to find the valid line of code/assembler. Without debug symbols, you |
| will see the assembler code for the routine shown, but if your kernel has |
| debug symbols the C code will also be available. (Debug symbols can be enabled |
| in the kernel hacking menu of the menu configuration.) For example:: |
| |
| objdump -r -S -l --disassemble net/dccp/ipv4.o |
| |
| .. note:: |
| |
| You need to be at the top level of the kernel tree for this to pick up |
| your C files. |
| |
| If you don't have access to the code you can also debug on some crash dumps |
| e.g. crash dump output as shown by Dave Miller:: |
| |
| EIP is at ip_queue_xmit+0x14/0x4c0 |
| ... |
| Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00 |
| 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08 |
| <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85 |
| |
| Put the bytes into a "foo.s" file like this: |
| |
| .text |
| .globl foo |
| foo: |
| .byte .... /* bytes from Code: part of OOPS dump */ |
| |
| Compile it with "gcc -c -o foo.o foo.s" then look at the output of |
| "objdump --disassemble foo.o". |
| |
| Output: |
| |
| ip_queue_xmit: |
| push %ebp |
| push %edi |
| push %esi |
| push %ebx |
| sub $0xbc, %esp |
| mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb) |
| mov 0x8(%ebp), %ebx ! %ebx = skb->sk |
| mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt |
| |
| In addition, you can use GDB to figure out the exact file and line |
| number of the OOPS from the ``vmlinux`` file. If you have |
| ``CONFIG_DEBUG_INFO`` enabled, you can simply copy the EIP value from the |
| OOPS:: |
| |
| EIP: 0060:[<c021e50e>] Not tainted VLI |
| |
| And use GDB to translate that to human-readable form:: |
| |
| gdb vmlinux |
| (gdb) l *0xc021e50e |
| |
| If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function |
| offset from the OOPS:: |
| |
| EIP is at vt_ioctl+0xda8/0x1482 |
| |
| And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled:: |
| |
| make vmlinux |
| gdb vmlinux |
| (gdb) p vt_ioctl |
| (gdb) l *(0x<address of vt_ioctl> + 0xda8) |
| |
| or, as one command:: |
| |
| (gdb) l *(vt_ioctl + 0xda8) |
| |
| If you have a call trace, such as:: |
| |
| Call Trace: |
| [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5 |
| [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e |
| [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee |
| ... |
| |
| this shows the problem in the :jbd: module. You can load that module in gdb |
| and list the relevant code:: |
| |
| gdb fs/jbd/jbd.ko |
| (gdb) p log_wait_commit |
| (gdb) l *(0x<address> + 0xa3) |
| |
| or:: |
| |
| (gdb) l *(log_wait_commit + 0xa3) |
| |
| |
| Another very useful option of the Kernel Hacking section in menuconfig is |
| Debug memory allocations. This will help you see whether data has been |
| initialised and not set before use etc. To see the values that get assigned |
| with this look at ``mm/slab.c`` and search for ``POISON_INUSE``. When using |
| this an Oops will often show the poisoned data instead of zero which is the |
| default. |
| |
| Once you have worked out a fix please submit it upstream. After all open |
| source is about sharing what you do and don't you want to be recognised for |
| your genius? |
| |
| Please do read |
| ref:`Documentation/process/submitting-patches.rst <submittingpatches>` though |
| to help your code get accepted. |