Vegard Nossum | e594c8d | 2009-06-13 14:15:57 +0200 | [diff] [blame] | 1 | GETTING STARTED WITH KMEMCHECK |
| 2 | ============================== |
| 3 | |
| 4 | Vegard Nossum <vegardno@ifi.uio.no> |
| 5 | |
| 6 | |
| 7 | Contents |
| 8 | ======== |
| 9 | 0. Introduction |
| 10 | 1. Downloading |
| 11 | 2. Configuring and compiling |
| 12 | 3. How to use |
| 13 | 3.1. Booting |
| 14 | 3.2. Run-time enable/disable |
| 15 | 3.3. Debugging |
| 16 | 3.4. Annotating false positives |
| 17 | 4. Reporting errors |
| 18 | 5. Technical description |
| 19 | |
| 20 | |
| 21 | 0. Introduction |
| 22 | =============== |
| 23 | |
| 24 | kmemcheck is a debugging feature for the Linux Kernel. More specifically, it |
| 25 | is a dynamic checker that detects and warns about some uses of uninitialized |
| 26 | memory. |
| 27 | |
| 28 | Userspace programmers might be familiar with Valgrind's memcheck. The main |
| 29 | difference between memcheck and kmemcheck is that memcheck works for userspace |
| 30 | programs only, and kmemcheck works for the kernel only. The implementations |
| 31 | are of course vastly different. Because of this, kmemcheck is not as accurate |
| 32 | as memcheck, but it turns out to be good enough in practice to discover real |
| 33 | programmer errors that the compiler is not able to find through static |
| 34 | analysis. |
| 35 | |
| 36 | Enabling kmemcheck on a kernel will probably slow it down to the extent that |
| 37 | the machine will not be usable for normal workloads such as e.g. an |
| 38 | interactive desktop. kmemcheck will also cause the kernel to use about twice |
| 39 | as much memory as normal. For this reason, kmemcheck is strictly a debugging |
| 40 | feature. |
| 41 | |
| 42 | |
| 43 | 1. Downloading |
| 44 | ============== |
| 45 | |
Vegard Nossum | e3c6c4a | 2009-07-01 22:36:22 +0200 | [diff] [blame] | 46 | As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. |
Vegard Nossum | e594c8d | 2009-06-13 14:15:57 +0200 | [diff] [blame] | 47 | |
| 48 | |
| 49 | 2. Configuring and compiling |
| 50 | ============================ |
| 51 | |
| 52 | kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of |
| 53 | configuration variables must have specific settings in order for the kmemcheck |
| 54 | menu to even appear in "menuconfig". These are: |
| 55 | |
| 56 | o CONFIG_CC_OPTIMIZE_FOR_SIZE=n |
| 57 | |
| 58 | This option is located under "General setup" / "Optimize for size". |
| 59 | |
| 60 | Without this, gcc will use certain optimizations that usually lead to |
| 61 | false positive warnings from kmemcheck. An example of this is a 16-bit |
| 62 | field in a struct, where gcc may load 32 bits, then discard the upper |
| 63 | 16 bits. kmemcheck sees only the 32-bit load, and may trigger a |
| 64 | warning for the upper 16 bits (if they're uninitialized). |
| 65 | |
| 66 | o CONFIG_SLAB=y or CONFIG_SLUB=y |
| 67 | |
| 68 | This option is located under "General setup" / "Choose SLAB |
| 69 | allocator". |
| 70 | |
| 71 | o CONFIG_FUNCTION_TRACER=n |
| 72 | |
| 73 | This option is located under "Kernel hacking" / "Tracers" / "Kernel |
| 74 | Function Tracer" |
| 75 | |
| 76 | When function tracing is compiled in, gcc emits a call to another |
| 77 | function at the beginning of every function. This means that when the |
| 78 | page fault handler is called, the ftrace framework will be called |
| 79 | before kmemcheck has had a chance to handle the fault. If ftrace then |
| 80 | modifies memory that was tracked by kmemcheck, the result is an |
| 81 | endless recursive page fault. |
| 82 | |
| 83 | o CONFIG_DEBUG_PAGEALLOC=n |
| 84 | |
| 85 | This option is located under "Kernel hacking" / "Debug page memory |
| 86 | allocations". |
| 87 | |
| 88 | In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also |
| 89 | located under "Kernel hacking". With this, you will be able to get line number |
| 90 | information from the kmemcheck warnings, which is extremely valuable in |
| 91 | debugging a problem. This option is not mandatory, however, because it slows |
| 92 | down the compilation process and produces a much bigger kernel image. |
| 93 | |
| 94 | Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck: |
| 95 | trap use of uninitialized memory"). Here follows a description of the |
| 96 | kmemcheck configuration variables: |
| 97 | |
| 98 | o CONFIG_KMEMCHECK |
| 99 | |
| 100 | This must be enabled in order to use kmemcheck at all... |
| 101 | |
| 102 | o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT |
| 103 | |
| 104 | This option controls the status of kmemcheck at boot-time. "Enabled" |
| 105 | will enable kmemcheck right from the start, "disabled" will boot the |
| 106 | kernel as normal (but with the kmemcheck code compiled in, so it can |
| 107 | be enabled at run-time after the kernel has booted), and "one-shot" is |
| 108 | a special mode which will turn kmemcheck off automatically after |
| 109 | detecting the first use of uninitialized memory. |
| 110 | |
| 111 | If you are using kmemcheck to actively debug a problem, then you |
| 112 | probably want to choose "enabled" here. |
| 113 | |
| 114 | The one-shot mode is mostly useful in automated test setups because it |
| 115 | can prevent floods of warnings and increase the chances of the machine |
| 116 | surviving in case something is really wrong. In other cases, the one- |
| 117 | shot mode could actually be counter-productive because it would turn |
| 118 | itself off at the very first error -- in the case of a false positive |
| 119 | too -- and this would come in the way of debugging the specific |
| 120 | problem you were interested in. |
| 121 | |
| 122 | If you would like to use your kernel as normal, but with a chance to |
| 123 | enable kmemcheck in case of some problem, it might be a good idea to |
| 124 | choose "disabled" here. When kmemcheck is disabled, most of the run- |
| 125 | time overhead is not incurred, and the kernel will be almost as fast |
| 126 | as normal. |
| 127 | |
| 128 | o CONFIG_KMEMCHECK_QUEUE_SIZE |
| 129 | |
| 130 | Select the maximum number of error reports to store in an internal |
| 131 | (fixed-size) buffer. Since errors can occur virtually anywhere and in |
| 132 | any context, we need a temporary storage area which is guaranteed not |
| 133 | to generate any other page faults when accessed. The queue will be |
| 134 | emptied as soon as a tasklet may be scheduled. If the queue is full, |
| 135 | new error reports will be lost. |
| 136 | |
| 137 | The default value of 64 is probably fine. If some code produces more |
| 138 | than 64 errors within an irqs-off section, then the code is likely to |
| 139 | produce many, many more, too, and these additional reports seldom give |
| 140 | any more information (the first report is usually the most valuable |
| 141 | anyway). |
| 142 | |
| 143 | This number might have to be adjusted if you are not using serial |
| 144 | console or similar to capture the kernel log. If you are using the |
| 145 | "dmesg" command to save the log, then getting a lot of kmemcheck |
| 146 | warnings might overflow the kernel log itself, and the earlier reports |
| 147 | will get lost in that way instead. Try setting this to 10 or so on |
| 148 | such a setup. |
| 149 | |
| 150 | o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT |
| 151 | |
| 152 | Select the number of shadow bytes to save along with each entry of the |
| 153 | error-report queue. These bytes indicate what parts of an allocation |
| 154 | are initialized, uninitialized, etc. and will be displayed when an |
| 155 | error is detected to help the debugging of a particular problem. |
| 156 | |
| 157 | The number entered here is actually the logarithm of the number of |
| 158 | bytes that will be saved. So if you pick for example 5 here, kmemcheck |
| 159 | will save 2^5 = 32 bytes. |
| 160 | |
| 161 | The default value should be fine for debugging most problems. It also |
| 162 | fits nicely within 80 columns. |
| 163 | |
| 164 | o CONFIG_KMEMCHECK_PARTIAL_OK |
| 165 | |
| 166 | This option (when enabled) works around certain GCC optimizations that |
| 167 | produce 32-bit reads from 16-bit variables where the upper 16 bits are |
| 168 | thrown away afterwards. |
| 169 | |
| 170 | The default value (enabled) is recommended. This may of course hide |
| 171 | some real errors, but disabling it would probably produce a lot of |
| 172 | false positives. |
| 173 | |
| 174 | o CONFIG_KMEMCHECK_BITOPS_OK |
| 175 | |
| 176 | This option silences warnings that would be generated for bit-field |
| 177 | accesses where not all the bits are initialized at the same time. This |
| 178 | may also hide some real bugs. |
| 179 | |
| 180 | This option is probably obsolete, or it should be replaced with |
| 181 | the kmemcheck-/bitfield-annotations for the code in question. The |
| 182 | default value is therefore fine. |
| 183 | |
| 184 | Now compile the kernel as usual. |
| 185 | |
| 186 | |
| 187 | 3. How to use |
| 188 | ============= |
| 189 | |
| 190 | 3.1. Booting |
| 191 | ============ |
| 192 | |
| 193 | First some information about the command-line options. There is only one |
| 194 | option specific to kmemcheck, and this is called "kmemcheck". It can be used |
| 195 | to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT |
| 196 | option. Its possible settings are: |
| 197 | |
| 198 | o kmemcheck=0 (disabled) |
| 199 | o kmemcheck=1 (enabled) |
| 200 | o kmemcheck=2 (one-shot mode) |
| 201 | |
| 202 | If SLUB debugging has been enabled in the kernel, it may take precedence over |
| 203 | kmemcheck in such a way that the slab caches which are under SLUB debugging |
| 204 | will not be tracked by kmemcheck. In order to ensure that this doesn't happen |
| 205 | (even though it shouldn't by default), use SLUB's boot option "slub_debug", |
| 206 | like this: slub_debug=- |
| 207 | |
| 208 | In fact, this option may also be used for fine-grained control over SLUB vs. |
| 209 | kmemcheck. For example, if the command line includes "kmemcheck=1 |
| 210 | slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" |
| 211 | slab cache, and with kmemcheck tracking all the other caches. This is advanced |
| 212 | usage, however, and is not generally recommended. |
| 213 | |
| 214 | |
| 215 | 3.2. Run-time enable/disable |
| 216 | ============================ |
| 217 | |
| 218 | When the kernel has booted, it is possible to enable or disable kmemcheck at |
| 219 | run-time. WARNING: This feature is still experimental and may cause false |
| 220 | positive warnings to appear. Therefore, try not to use this. If you find that |
| 221 | it doesn't work properly (e.g. you see an unreasonable amount of warnings), I |
| 222 | will be happy to take bug reports. |
| 223 | |
| 224 | Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: |
| 225 | |
| 226 | $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck |
| 227 | |
| 228 | The numbers are the same as for the kmemcheck= command-line option. |
| 229 | |
| 230 | |
| 231 | 3.3. Debugging |
| 232 | ============== |
| 233 | |
| 234 | A typical report will look something like this: |
| 235 | |
| 236 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) |
| 237 | 80000000000000000000000000000000000000000088ffff0000000000000000 |
| 238 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u |
| 239 | ^ |
| 240 | |
| 241 | Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A |
| 242 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 |
| 243 | RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 |
| 244 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 |
| 245 | RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 |
| 246 | RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 |
| 247 | R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e |
| 248 | R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 |
| 249 | FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 |
| 250 | CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 |
| 251 | CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 |
| 252 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 |
| 253 | DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 |
| 254 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 |
| 255 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 |
| 256 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 |
| 257 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 |
| 258 | [<ffffffffffffffff>] 0xffffffffffffffff |
| 259 | |
| 260 | The single most valuable information in this report is the RIP (or EIP on 32- |
| 261 | bit) value. This will help us pinpoint exactly which instruction that caused |
| 262 | the warning. |
| 263 | |
| 264 | If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do |
| 265 | is give this address to the addr2line program, like this: |
| 266 | |
| 267 | $ addr2line -e vmlinux -i ffffffff8104ede8 |
| 268 | arch/x86/include/asm/string_64.h:12 |
| 269 | include/asm-generic/siginfo.h:287 |
| 270 | kernel/signal.c:380 |
| 271 | kernel/signal.c:410 |
| 272 | |
| 273 | The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must |
| 274 | be the vmlinux of the kernel that produced the warning in the first place! If |
| 275 | not, the line number information will almost certainly be wrong. |
| 276 | |
| 277 | The "-i" tells addr2line to also print the line numbers of inlined functions. |
| 278 | In this case, the flag was very important, because otherwise, it would only |
| 279 | have printed the first line, which is just a call to memcpy(), which could be |
| 280 | called from a thousand places in the kernel, and is therefore not very useful. |
| 281 | These inlined functions would not show up in the stack trace above, simply |
| 282 | because the kernel doesn't load the extra debugging information. This |
| 283 | technique can of course be used with ordinary kernel oopses as well. |
| 284 | |
| 285 | In this case, it's the caller of memcpy() that is interesting, and it can be |
| 286 | found in include/asm-generic/siginfo.h, line 287: |
| 287 | |
| 288 | 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) |
| 289 | 282 { |
| 290 | 283 if (from->si_code < 0) |
| 291 | 284 memcpy(to, from, sizeof(*to)); |
| 292 | 285 else |
| 293 | 286 /* _sigchld is currently the largest know union member */ |
| 294 | 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); |
| 295 | 288 } |
| 296 | |
| 297 | Since this was a read (kmemcheck usually warns about reads only, though it can |
| 298 | warn about writes to unallocated or freed memory as well), it was probably the |
| 299 | "from" argument which contained some uninitialized bytes. Following the chain |
| 300 | of calls, we move upwards to see where "from" was allocated or initialized, |
| 301 | kernel/signal.c, line 380: |
| 302 | |
| 303 | 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) |
| 304 | 360 { |
| 305 | ... |
| 306 | 367 list_for_each_entry(q, &list->list, list) { |
| 307 | 368 if (q->info.si_signo == sig) { |
| 308 | 369 if (first) |
| 309 | 370 goto still_pending; |
| 310 | 371 first = q; |
| 311 | ... |
| 312 | 377 if (first) { |
| 313 | 378 still_pending: |
| 314 | 379 list_del_init(&first->list); |
| 315 | 380 copy_siginfo(info, &first->info); |
| 316 | 381 __sigqueue_free(first); |
| 317 | ... |
| 318 | 392 } |
| 319 | 393 } |
| 320 | |
| 321 | Here, it is &first->info that is being passed on to copy_siginfo(). The |
| 322 | variable "first" was found on a list -- passed in as the second argument to |
| 323 | collect_signal(). We continue our journey through the stack, to figure out |
| 324 | where the item on "list" was allocated or initialized. We move to line 410: |
| 325 | |
| 326 | 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, |
| 327 | 396 siginfo_t *info) |
| 328 | 397 { |
| 329 | ... |
| 330 | 410 collect_signal(sig, pending, info); |
| 331 | ... |
| 332 | 414 } |
| 333 | |
| 334 | Now we need to follow the "pending" pointer, since that is being passed on to |
| 335 | collect_signal() as "list". At this point, we've run out of lines from the |
| 336 | "addr2line" output. Not to worry, we just paste the next addresses from the |
| 337 | kmemcheck stack dump, i.e.: |
| 338 | |
| 339 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 |
| 340 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 |
| 341 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 |
| 342 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 |
| 343 | |
| 344 | $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ |
| 345 | ffffffff8100b87d ffffffff8100c7b5 |
| 346 | kernel/signal.c:446 |
| 347 | kernel/signal.c:1806 |
| 348 | arch/x86/kernel/signal.c:805 |
| 349 | arch/x86/kernel/signal.c:871 |
| 350 | arch/x86/kernel/entry_64.S:694 |
| 351 | |
| 352 | Remember that since these addresses were found on the stack and not as the |
| 353 | RIP value, they actually point to the _next_ instruction (they are return |
| 354 | addresses). This becomes obvious when we look at the code for line 446: |
| 355 | |
| 356 | 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) |
| 357 | 423 { |
| 358 | ... |
| 359 | 431 signr = __dequeue_signal(&tsk->signal->shared_pending, |
| 360 | 432 mask, info); |
| 361 | 433 /* |
| 362 | 434 * itimer signal ? |
| 363 | 435 * |
| 364 | 436 * itimers are process shared and we restart periodic |
| 365 | 437 * itimers in the signal delivery path to prevent DoS |
| 366 | 438 * attacks in the high resolution timer case. This is |
| 367 | 439 * compliant with the old way of self restarting |
| 368 | 440 * itimers, as the SIGALRM is a legacy signal and only |
| 369 | 441 * queued once. Changing the restart behaviour to |
| 370 | 442 * restart the timer in the signal dequeue path is |
| 371 | 443 * reducing the timer noise on heavy loaded !highres |
| 372 | 444 * systems too. |
| 373 | 445 */ |
| 374 | 446 if (unlikely(signr == SIGALRM)) { |
| 375 | ... |
| 376 | 489 } |
| 377 | |
| 378 | So instead of looking at 446, we should be looking at 431, which is the line |
| 379 | that executes just before 446. Here we see that what we are looking for is |
| 380 | &tsk->signal->shared_pending. |
| 381 | |
| 382 | Our next task is now to figure out which function that puts items on this |
| 383 | "shared_pending" list. A crude, but efficient tool, is git grep: |
| 384 | |
| 385 | $ git grep -n 'shared_pending' kernel/ |
| 386 | ... |
| 387 | kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; |
| 388 | kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; |
| 389 | ... |
| 390 | |
| 391 | There were more results, but none of them were related to list operations, |
| 392 | and these were the only assignments. We inspect the line numbers more closely |
| 393 | and find that this is indeed where items are being added to the list: |
| 394 | |
| 395 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, |
| 396 | 817 int group) |
| 397 | 818 { |
| 398 | ... |
| 399 | 828 pending = group ? &t->signal->shared_pending : &t->pending; |
| 400 | ... |
| 401 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && |
| 402 | 852 (is_si_special(info) || |
| 403 | 853 info->si_code >= 0))); |
| 404 | 854 if (q) { |
| 405 | 855 list_add_tail(&q->list, &pending->list); |
| 406 | ... |
| 407 | 890 } |
| 408 | |
| 409 | and: |
| 410 | |
| 411 | 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) |
| 412 | 1310 { |
| 413 | .... |
| 414 | 1339 pending = group ? &t->signal->shared_pending : &t->pending; |
| 415 | 1340 list_add_tail(&q->list, &pending->list); |
| 416 | .... |
| 417 | 1347 } |
| 418 | |
| 419 | In the first case, the list element we are looking for, "q", is being returned |
| 420 | from the function __sigqueue_alloc(), which looks like an allocation function. |
| 421 | Let's take a look at it: |
| 422 | |
| 423 | 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, |
| 424 | 188 int override_rlimit) |
| 425 | 189 { |
| 426 | 190 struct sigqueue *q = NULL; |
| 427 | 191 struct user_struct *user; |
| 428 | 192 |
| 429 | 193 /* |
| 430 | 194 * We won't get problems with the target's UID changing under us |
| 431 | 195 * because changing it requires RCU be used, and if t != current, the |
| 432 | 196 * caller must be holding the RCU readlock (by way of a spinlock) and |
| 433 | 197 * we use RCU protection here |
| 434 | 198 */ |
| 435 | 199 user = get_uid(__task_cred(t)->user); |
| 436 | 200 atomic_inc(&user->sigpending); |
| 437 | 201 if (override_rlimit || |
| 438 | 202 atomic_read(&user->sigpending) <= |
| 439 | 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) |
| 440 | 204 q = kmem_cache_alloc(sigqueue_cachep, flags); |
| 441 | 205 if (unlikely(q == NULL)) { |
| 442 | 206 atomic_dec(&user->sigpending); |
| 443 | 207 free_uid(user); |
| 444 | 208 } else { |
| 445 | 209 INIT_LIST_HEAD(&q->list); |
| 446 | 210 q->flags = 0; |
| 447 | 211 q->user = user; |
| 448 | 212 } |
| 449 | 213 |
| 450 | 214 return q; |
| 451 | 215 } |
| 452 | |
| 453 | We see that this function initializes q->list, q->flags, and q->user. It seems |
| 454 | that now is the time to look at the definition of "struct sigqueue", e.g.: |
| 455 | |
| 456 | 14 struct sigqueue { |
| 457 | 15 struct list_head list; |
| 458 | 16 int flags; |
| 459 | 17 siginfo_t info; |
| 460 | 18 struct user_struct *user; |
| 461 | 19 }; |
| 462 | |
| 463 | And, you might remember, it was a memcpy() on &first->info that caused the |
| 464 | warning, so this makes perfect sense. It also seems reasonable to assume that |
| 465 | it is the caller of __sigqueue_alloc() that has the responsibility of filling |
| 466 | out (initializing) this member. |
| 467 | |
| 468 | But just which fields of the struct were uninitialized? Let's look at |
| 469 | kmemcheck's report again: |
| 470 | |
| 471 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) |
| 472 | 80000000000000000000000000000000000000000088ffff0000000000000000 |
| 473 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u |
| 474 | ^ |
| 475 | |
| 476 | These first two lines are the memory dump of the memory object itself, and the |
| 477 | shadow bytemap, respectively. The memory object itself is in this case |
| 478 | &first->info. Just beware that the start of this dump is NOT the start of the |
| 479 | object itself! The position of the caret (^) corresponds with the address of |
| 480 | the read (ffff88003e4a2024). |
| 481 | |
| 482 | The shadow bytemap dump legend is as follows: |
| 483 | |
| 484 | i - initialized |
| 485 | u - uninitialized |
| 486 | a - unallocated (memory has been allocated by the slab layer, but has not |
| 487 | yet been handed off to anybody) |
| 488 | f - freed (memory has been allocated by the slab layer, but has been freed |
| 489 | by the previous owner) |
| 490 | |
| 491 | In order to figure out where (relative to the start of the object) the |
| 492 | uninitialized memory was located, we have to look at the disassembly. For |
| 493 | that, we'll need the RIP address again: |
| 494 | |
| 495 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 |
| 496 | |
| 497 | $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: |
| 498 | ffffffff8104edc8: mov %r8,0x8(%r8) |
| 499 | ffffffff8104edcc: test %r10d,%r10d |
| 500 | ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> |
| 501 | ffffffff8104edd5: mov %rax,%rdx |
| 502 | ffffffff8104edd8: mov $0xc,%ecx |
| 503 | ffffffff8104eddd: mov %r13,%rdi |
| 504 | ffffffff8104ede0: mov $0x30,%eax |
| 505 | ffffffff8104ede5: mov %rdx,%rsi |
| 506 | ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) |
| 507 | ffffffff8104edea: test $0x2,%al |
| 508 | ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> |
| 509 | ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) |
| 510 | ffffffff8104edf0: test $0x1,%al |
| 511 | ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> |
| 512 | ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) |
| 513 | ffffffff8104edf5: mov %r8,%rdi |
| 514 | ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> |
| 515 | |
| 516 | As expected, it's the "rep movsl" instruction from the memcpy() that causes |
| 517 | the warning. We know about REP MOVSL that it uses the register RCX to count |
| 518 | the number of remaining iterations. By taking a look at the register dump |
| 519 | again (from the kmemcheck report), we can figure out how many bytes were left |
| 520 | to copy: |
| 521 | |
| 522 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 |
| 523 | |
| 524 | By looking at the disassembly, we also see that %ecx is being loaded with the |
| 525 | value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind |
| 526 | that this is the number of iterations, not bytes. And since this is a "long" |
| 527 | operation, we need to multiply by 4 to get the number of bytes. So this means |
| 528 | that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes |
| 529 | from the start of the object. |
| 530 | |
| 531 | We can now try to figure out which field of the "struct siginfo" that was not |
| 532 | initialized. This is the beginning of the struct: |
| 533 | |
| 534 | 40 typedef struct siginfo { |
| 535 | 41 int si_signo; |
| 536 | 42 int si_errno; |
| 537 | 43 int si_code; |
| 538 | 44 |
| 539 | 45 union { |
| 540 | .. |
| 541 | 92 } _sifields; |
| 542 | 93 } siginfo_t; |
| 543 | |
| 544 | On 64-bit, the int is 4 bytes long, so it must the the union member that has |
| 545 | not been initialized. We can verify this using gdb: |
| 546 | |
| 547 | $ gdb vmlinux |
| 548 | ... |
| 549 | (gdb) p &((struct siginfo *) 0)->_sifields |
| 550 | $1 = (union {...} *) 0x10 |
| 551 | |
| 552 | Actually, it seems that the union member is located at offset 0x10 -- which |
| 553 | means that gcc has inserted 4 bytes of padding between the members si_code |
| 554 | and _sifields. We can now get a fuller picture of the memory dump: |
| 555 | |
| 556 | _----------------------------=> si_code |
| 557 | / _--------------------=> (padding) |
| 558 | | / _------------=> _sifields(._kill._pid) |
| 559 | | | / _----=> _sifields(._kill._uid) |
| 560 | | | | / |
| 561 | -------|-------|-------|-------| |
| 562 | 80000000000000000000000000000000000000000088ffff0000000000000000 |
| 563 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u |
| 564 | |
| 565 | This allows us to realize another important fact: si_code contains the value |
| 566 | 0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are |
| 567 | really the number 0x00000080. With a bit of research, we find that this is |
| 568 | actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: |
| 569 | |
| 570 | 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ |
| 571 | |
| 572 | This macro is used in exactly one place in the x86 kernel: In send_signal() |
| 573 | in kernel/signal.c: |
| 574 | |
| 575 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, |
| 576 | 817 int group) |
| 577 | 818 { |
| 578 | ... |
| 579 | 828 pending = group ? &t->signal->shared_pending : &t->pending; |
| 580 | ... |
| 581 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && |
| 582 | 852 (is_si_special(info) || |
| 583 | 853 info->si_code >= 0))); |
| 584 | 854 if (q) { |
| 585 | 855 list_add_tail(&q->list, &pending->list); |
| 586 | 856 switch ((unsigned long) info) { |
| 587 | ... |
| 588 | 865 case (unsigned long) SEND_SIG_PRIV: |
| 589 | 866 q->info.si_signo = sig; |
| 590 | 867 q->info.si_errno = 0; |
| 591 | 868 q->info.si_code = SI_KERNEL; |
| 592 | 869 q->info.si_pid = 0; |
| 593 | 870 q->info.si_uid = 0; |
| 594 | 871 break; |
| 595 | ... |
| 596 | 890 } |
| 597 | |
| 598 | Not only does this match with the .si_code member, it also matches the place |
| 599 | we found earlier when looking for where siginfo_t objects are enqueued on the |
| 600 | "shared_pending" list. |
| 601 | |
| 602 | So to sum up: It seems that it is the padding introduced by the compiler |
| 603 | between two struct fields that is uninitialized, and this gets reported when |
| 604 | we do a memcpy() on the struct. This means that we have identified a false |
| 605 | positive warning. |
| 606 | |
| 607 | Normally, kmemcheck will not report uninitialized accesses in memcpy() calls |
| 608 | when both the source and destination addresses are tracked. (Instead, we copy |
| 609 | the shadow bytemap as well). In this case, the destination address clearly |
| 610 | was not tracked. We can dig a little deeper into the stack trace from above: |
| 611 | |
| 612 | arch/x86/kernel/signal.c:805 |
| 613 | arch/x86/kernel/signal.c:871 |
| 614 | arch/x86/kernel/entry_64.S:694 |
| 615 | |
| 616 | And we clearly see that the destination siginfo object is located on the |
| 617 | stack: |
| 618 | |
| 619 | 782 static void do_signal(struct pt_regs *regs) |
| 620 | 783 { |
| 621 | 784 struct k_sigaction ka; |
| 622 | 785 siginfo_t info; |
| 623 | ... |
| 624 | 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); |
| 625 | ... |
| 626 | 854 } |
| 627 | |
| 628 | And this &info is what eventually gets passed to copy_siginfo() as the |
| 629 | destination argument. |
| 630 | |
| 631 | Now, even though we didn't find an actual error here, the example is still a |
| 632 | good one, because it shows how one would go about to find out what the report |
| 633 | was all about. |
| 634 | |
| 635 | |
| 636 | 3.4. Annotating false positives |
| 637 | =============================== |
| 638 | |
| 639 | There are a few different ways to make annotations in the source code that |
| 640 | will keep kmemcheck from checking and reporting certain allocations. Here |
| 641 | they are: |
| 642 | |
| 643 | o __GFP_NOTRACK_FALSE_POSITIVE |
| 644 | |
| 645 | This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore |
| 646 | also to other functions that end up calling one of these) to indicate |
| 647 | that the allocation should not be tracked because it would lead to |
| 648 | a false positive report. This is a "big hammer" way of silencing |
| 649 | kmemcheck; after all, even if the false positive pertains to |
| 650 | particular field in a struct, for example, we will now lose the |
| 651 | ability to find (real) errors in other parts of the same struct. |
| 652 | |
| 653 | Example: |
| 654 | |
| 655 | /* No warnings will ever trigger on accessing any part of x */ |
| 656 | x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); |
| 657 | |
| 658 | o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and |
| 659 | kmemcheck_annotate_bitfield(ptr, name) |
| 660 | |
| 661 | The first two of these three macros can be used inside struct |
| 662 | definitions to signal, respectively, the beginning and end of a |
| 663 | bitfield. Additionally, this will assign the bitfield a name, which |
| 664 | is given as an argument to the macros. |
| 665 | |
| 666 | Having used these markers, one can later use |
| 667 | kmemcheck_annotate_bitfield() at the point of allocation, to indicate |
| 668 | which parts of the allocation is part of a bitfield. |
| 669 | |
| 670 | Example: |
| 671 | |
| 672 | struct foo { |
| 673 | int x; |
| 674 | |
| 675 | kmemcheck_bitfield_begin(flags); |
| 676 | int flag_a:1; |
| 677 | int flag_b:1; |
| 678 | kmemcheck_bitfield_end(flags); |
| 679 | |
| 680 | int y; |
| 681 | }; |
| 682 | |
| 683 | struct foo *x = kmalloc(sizeof *x); |
| 684 | |
| 685 | /* No warnings will trigger on accessing the bitfield of x */ |
| 686 | kmemcheck_annotate_bitfield(x, flags); |
| 687 | |
| 688 | Note that kmemcheck_annotate_bitfield() can be used even before the |
| 689 | return value of kmalloc() is checked -- in other words, passing NULL |
| 690 | as the first argument is legal (and will do nothing). |
| 691 | |
| 692 | |
| 693 | 4. Reporting errors |
| 694 | =================== |
| 695 | |
| 696 | As we have seen, kmemcheck will produce false positive reports. Therefore, it |
| 697 | is not very wise to blindly post kmemcheck warnings to mailing lists and |
| 698 | maintainers. Instead, I encourage maintainers and developers to find errors |
| 699 | in their own code. If you get a warning, you can try to work around it, try |
| 700 | to figure out if it's a real error or not, or simply ignore it. Most |
| 701 | developers know their own code and will quickly and efficiently determine the |
| 702 | root cause of a kmemcheck report. This is therefore also the most efficient |
| 703 | way to work with kmemcheck. |
| 704 | |
| 705 | That said, we (the kmemcheck maintainers) will always be on the lookout for |
| 706 | false positives that we can annotate and silence. So whatever you find, |
| 707 | please drop us a note privately! Kernel configs and steps to reproduce (if |
| 708 | available) are of course a great help too. |
| 709 | |
| 710 | Happy hacking! |
| 711 | |
| 712 | |
| 713 | 5. Technical description |
| 714 | ======================== |
| 715 | |
| 716 | kmemcheck works by marking memory pages non-present. This means that whenever |
| 717 | somebody attempts to access the page, a page fault is generated. The page |
| 718 | fault handler notices that the page was in fact only hidden, and so it calls |
| 719 | on the kmemcheck code to make further investigations. |
| 720 | |
| 721 | When the investigations are completed, kmemcheck "shows" the page by marking |
| 722 | it present (as it would be under normal circumstances). This way, the |
| 723 | interrupted code can continue as usual. |
| 724 | |
| 725 | But after the instruction has been executed, we should hide the page again, so |
| 726 | that we can catch the next access too! Now kmemcheck makes use of a debugging |
| 727 | feature of the processor, namely single-stepping. When the processor has |
| 728 | finished the one instruction that generated the memory access, a debug |
| 729 | exception is raised. From here, we simply hide the page again and continue |
| 730 | execution, this time with the single-stepping feature turned off. |
| 731 | |
| 732 | kmemcheck requires some assistance from the memory allocator in order to work. |
| 733 | The memory allocator needs to |
| 734 | |
| 735 | 1. Tell kmemcheck about newly allocated pages and pages that are about to |
| 736 | be freed. This allows kmemcheck to set up and tear down the shadow memory |
| 737 | for the pages in question. The shadow memory stores the status of each |
| 738 | byte in the allocation proper, e.g. whether it is initialized or |
| 739 | uninitialized. |
| 740 | |
| 741 | 2. Tell kmemcheck which parts of memory should be marked uninitialized. |
| 742 | There are actually a few more states, such as "not yet allocated" and |
| 743 | "recently freed". |
| 744 | |
| 745 | If a slab cache is set up using the SLAB_NOTRACK flag, it will never return |
| 746 | memory that can take page faults because of kmemcheck. |
| 747 | |
| 748 | If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still |
| 749 | request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. |
| 750 | This does not prevent the page faults from occurring, however, but marks the |
| 751 | object in question as being initialized so that no warnings will ever be |
| 752 | produced for this object. |
| 753 | |
| 754 | Currently, the SLAB and SLUB allocators are supported by kmemcheck. |