Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 1 | Most of the text from Keith Owens, hacked by AK |
| 2 | |
| 3 | x86_64 page size (PAGE_SIZE) is 4K. |
| 4 | |
| 5 | Like all other architectures, x86_64 has a kernel stack for every |
| 6 | active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. |
| 7 | These stacks contain useful data as long as a thread is alive or a |
| 8 | zombie. While the thread is in user space the kernel stack is empty |
| 9 | except for the thread_info structure at the bottom. |
| 10 | |
| 11 | In addition to the per thread stacks, there are specialized stacks |
| 12 | associated with each cpu. These stacks are only used while the kernel |
| 13 | is in control on that cpu, when a cpu returns to user space the |
| 14 | specialized stacks contain no useful data. The main cpu stacks is |
| 15 | |
| 16 | * Interrupt stack. IRQSTACKSIZE |
| 17 | |
| 18 | Used for external hardware interrupts. If this is the first external |
| 19 | hardware interrupt (i.e. not a nested hardware interrupt) then the |
| 20 | kernel switches from the current task to the interrupt stack. Like |
| 21 | the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS), |
| 22 | this gives more room for kernel interrupt processing without having |
| 23 | to increase the size of every per thread stack. |
| 24 | |
| 25 | The interrupt stack is also used when processing a softirq. |
| 26 | |
| 27 | Switching to the kernel interrupt stack is done by software based on a |
| 28 | per CPU interrupt nest counter. This is needed because x86-64 "IST" |
| 29 | hardware stacks cannot nest without races. |
| 30 | |
| 31 | x86_64 also has a feature which is not available on i386, the ability |
| 32 | to automatically switch to a new stack for designated events such as |
| 33 | double fault or NMI, which makes it easier to handle these unusual |
| 34 | events on x86_64. This feature is called the Interrupt Stack Table |
| 35 | (IST). There can be up to 7 IST entries per cpu. The IST code is an |
| 36 | index into the Task State Segment (TSS), the IST entries in the TSS |
| 37 | point to dedicated stacks, each stack can be a different size. |
| 38 | |
| 39 | An IST is selected by an non-zero value in the IST field of an |
| 40 | interrupt-gate descriptor. When an interrupt occurs and the hardware |
| 41 | loads such a descriptor, the hardware automatically sets the new stack |
| 42 | pointer based on the IST value, then invokes the interrupt handler. If |
| 43 | software wants to allow nested IST interrupts then the handler must |
| 44 | adjust the IST values on entry to and exit from the interrupt handler. |
| 45 | (this is occasionally done, e.g. for debug exceptions) |
| 46 | |
| 47 | Events with different IST codes (i.e. with different stacks) can be |
| 48 | nested. For example, a debug interrupt can safely be interrupted by an |
| 49 | NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack |
| 50 | pointers on entry to and exit from all IST events, in theory allowing |
| 51 | IST events with the same code to be nested. However in most cases, the |
| 52 | stack size allocated to an IST assumes no nesting for the same code. |
| 53 | If that assumption is ever broken then the stacks will become corrupt. |
| 54 | |
| 55 | The currently assigned IST stacks are :- |
| 56 | |
| 57 | * STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). |
| 58 | |
| 59 | Used for interrupt 12 - Stack Fault Exception (#SS). |
| 60 | |
| 61 | This allows to recover from invalid stack segments. Rarely |
| 62 | happens. |
| 63 | |
| 64 | * DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). |
| 65 | |
| 66 | Used for interrupt 8 - Double Fault Exception (#DF). |
| 67 | |
| 68 | Invoked when handling a exception causes another exception. Happens |
| 69 | when the kernel is very confused (e.g. kernel stack pointer corrupt) |
| 70 | Using a separate stack allows to recover from it well enough in many |
| 71 | cases to still output an oops. |
| 72 | |
| 73 | * NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). |
| 74 | |
| 75 | Used for non-maskable interrupts (NMI). |
| 76 | |
| 77 | NMI can be delivered at any time, including when the kernel is in the |
| 78 | middle of switching stacks. Using IST for NMI events avoids making |
| 79 | assumptions about the previous state of the kernel stack. |
| 80 | |
| 81 | * DEBUG_STACK. DEBUG_STKSZ |
| 82 | |
| 83 | Used for hardware debug interrupts (interrupt 1) and for software |
| 84 | debug interrupts (INT3). |
| 85 | |
| 86 | When debugging a kernel, debug interrupts (both hardware and |
| 87 | software) can occur at any time. Using IST for these interrupts |
| 88 | avoids making assumptions about the previous state of the kernel |
| 89 | stack. |
| 90 | |
| 91 | * MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). |
| 92 | |
| 93 | Used for interrupt 18 - Machine Check Exception (#MC). |
| 94 | |
| 95 | MCE can be delivered at any time, including when the kernel is in the |
| 96 | middle of switching stacks. Using IST for MCE events avoids making |
| 97 | assumptions about the previous state of the kernel stack. |
| 98 | |
| 99 | For more details see the Intel IA32 or AMD AMD64 architecture manuals. |