Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | -*-Mode: outline-*- |
| 2 | |
| 3 | Light-weight System Calls for IA-64 |
| 4 | ----------------------------------- |
| 5 | |
| 6 | Started: 13-Jan-2003 |
| 7 | Last update: 27-Sep-2003 |
| 8 | |
| 9 | David Mosberger-Tang |
| 10 | <davidm@hpl.hp.com> |
| 11 | |
| 12 | Using the "epc" instruction effectively introduces a new mode of |
| 13 | execution to the ia64 linux kernel. We call this mode the |
| 14 | "fsys-mode". To recap, the normal states of execution are: |
| 15 | |
| 16 | - kernel mode: |
| 17 | Both the register stack and the memory stack have been |
| 18 | switched over to kernel memory. The user-level state is saved |
| 19 | in a pt-regs structure at the top of the kernel memory stack. |
| 20 | |
| 21 | - user mode: |
| 22 | Both the register stack and the kernel stack are in |
| 23 | user memory. The user-level state is contained in the |
| 24 | CPU registers. |
| 25 | |
| 26 | - bank 0 interruption-handling mode: |
| 27 | This is the non-interruptible state which all |
| 28 | interruption-handlers start execution in. The user-level |
| 29 | state remains in the CPU registers and some kernel state may |
| 30 | be stored in bank 0 of registers r16-r31. |
| 31 | |
| 32 | In contrast, fsys-mode has the following special properties: |
| 33 | |
| 34 | - execution is at privilege level 0 (most-privileged) |
| 35 | |
| 36 | - CPU registers may contain a mixture of user-level and kernel-level |
| 37 | state (it is the responsibility of the kernel to ensure that no |
| 38 | security-sensitive kernel-level state is leaked back to |
| 39 | user-level) |
| 40 | |
| 41 | - execution is interruptible and preemptible (an fsys-mode handler |
| 42 | can disable interrupts and avoid all other interruption-sources |
| 43 | to avoid preemption) |
| 44 | |
| 45 | - neither the memory-stack nor the register-stack can be trusted while |
| 46 | in fsys-mode (they point to the user-level stacks, which may |
| 47 | be invalid, or completely bogus addresses) |
| 48 | |
| 49 | In summary, fsys-mode is much more similar to running in user-mode |
| 50 | than it is to running in kernel-mode. Of course, given that the |
| 51 | privilege level is at level 0, this means that fsys-mode requires some |
| 52 | care (see below). |
| 53 | |
| 54 | |
| 55 | * How to tell fsys-mode |
| 56 | |
| 57 | Linux operates in fsys-mode when (a) the privilege level is 0 (most |
| 58 | privileged) and (b) the stacks have NOT been switched to kernel memory |
| 59 | yet. For convenience, the header file <asm-ia64/ptrace.h> provides |
| 60 | three macros: |
| 61 | |
| 62 | user_mode(regs) |
| 63 | user_stack(task,regs) |
| 64 | fsys_mode(task,regs) |
| 65 | |
| 66 | The "regs" argument is a pointer to a pt_regs structure. The "task" |
| 67 | argument is a pointer to the task structure to which the "regs" |
| 68 | pointer belongs to. user_mode() returns TRUE if the CPU state pointed |
| 69 | to by "regs" was executing in user mode (privilege level 3). |
| 70 | user_stack() returns TRUE if the state pointed to by "regs" was |
| 71 | executing on the user-level stack(s). Finally, fsys_mode() returns |
| 72 | TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. |
| 73 | The fsys_mode() macro is equivalent to the expression: |
| 74 | |
| 75 | !user_mode(regs) && user_stack(task,regs) |
| 76 | |
| 77 | * How to write an fsyscall handler |
| 78 | |
| 79 | The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers |
| 80 | (fsyscall_table). This table contains one entry for each system call. |
| 81 | By default, a system call is handled by fsys_fallback_syscall(). This |
| 82 | routine takes care of entering (full) kernel mode and calling the |
| 83 | normal Linux system call handler. For performance-critical system |
| 84 | calls, it is possible to write a hand-tuned fsyscall_handler. For |
| 85 | example, fsys.S contains fsys_getpid(), which is a hand-tuned version |
| 86 | of the getpid() system call. |
| 87 | |
| 88 | The entry and exit-state of an fsyscall handler is as follows: |
| 89 | |
| 90 | ** Machine state on entry to fsyscall handler: |
| 91 | |
| 92 | - r10 = 0 |
| 93 | - r11 = saved ar.pfs (a user-level value) |
| 94 | - r15 = system call number |
| 95 | - r16 = "current" task pointer (in normal kernel-mode, this is in r13) |
| 96 | - r32-r39 = system call arguments |
| 97 | - b6 = return address (a user-level value) |
| 98 | - ar.pfs = previous frame-state (a user-level value) |
| 99 | - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) |
| 100 | - all other registers may contain values passed in from user-mode |
| 101 | |
| 102 | ** Required machine state on exit to fsyscall handler: |
| 103 | |
| 104 | - r11 = saved ar.pfs (as passed into the fsyscall handler) |
| 105 | - r15 = system call number (as passed into the fsyscall handler) |
| 106 | - r32-r39 = system call arguments (as passed into the fsyscall handler) |
| 107 | - b6 = return address (as passed into the fsyscall handler) |
| 108 | - ar.pfs = previous frame-state (as passed into the fsyscall handler) |
| 109 | |
| 110 | Fsyscall handlers can execute with very little overhead, but with that |
| 111 | speed comes a set of restrictions: |
| 112 | |
| 113 | o Fsyscall-handlers MUST check for any pending work in the flags |
| 114 | member of the thread-info structure and if any of the |
| 115 | TIF_ALLWORK_MASK flags are set, the handler needs to fall back on |
| 116 | doing a full system call (by calling fsys_fallback_syscall). |
| 117 | |
| 118 | o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, |
| 119 | r15, b6, and ar.pfs) because they will be needed in case of a |
| 120 | system call restart. Of course, all "preserved" registers also |
| 121 | must be preserved, in accordance to the normal calling conventions. |
| 122 | |
| 123 | o Fsyscall-handlers MUST check argument registers for containing a |
| 124 | NaT value before using them in any way that could trigger a |
| 125 | NaT-consumption fault. If a system call argument is found to |
| 126 | contain a NaT value, an fsyscall-handler may return immediately |
| 127 | with r8=EINVAL, r10=-1. |
| 128 | |
| 129 | o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform |
| 130 | any other operation that would trigger mandatory RSE |
| 131 | (register-stack engine) traffic. |
| 132 | |
| 133 | o Fsyscall-handlers MUST NOT write to any stacked registers because |
| 134 | it is not safe to assume that user-level called a handler with the |
| 135 | proper number of arguments. |
| 136 | |
| 137 | o Fsyscall-handlers need to be careful when accessing per-CPU variables: |
| 138 | unless proper safe-guards are taken (e.g., interruptions are avoided), |
| 139 | execution may be pre-empted and resumed on another CPU at any given |
| 140 | time. |
| 141 | |
| 142 | o Fsyscall-handlers must be careful not to leak sensitive kernel' |
| 143 | information back to user-level. In particular, before returning to |
| 144 | user-level, care needs to be taken to clear any scratch registers |
| 145 | that could contain sensitive information (note that the current |
| 146 | task pointer is not considered sensitive: it's already exposed |
| 147 | through ar.k6). |
| 148 | |
| 149 | o Fsyscall-handlers MUST NOT access user-memory without first |
| 150 | validating access-permission (this can be done typically via |
| 151 | probe.r.fault and/or probe.w.fault) and without guarding against |
| 152 | memory access exceptions (this can be done with the EX() macros |
| 153 | defined by asmmacro.h). |
| 154 | |
| 155 | The above restrictions may seem draconian, but remember that it's |
| 156 | possible to trade off some of the restrictions by paying a slightly |
| 157 | higher overhead. For example, if an fsyscall-handler could benefit |
| 158 | from the shadow register bank, it could temporarily disable PSR.i and |
| 159 | PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as |
| 160 | needed. In other words, following the above rules yields extremely |
| 161 | fast system call execution (while fully preserving system call |
| 162 | semantics), but there is also a lot of flexibility in handling more |
| 163 | complicated cases. |
| 164 | |
| 165 | * Signal handling |
| 166 | |
| 167 | The delivery of (asynchronous) signals must be delayed until fsys-mode |
| 168 | is exited. This is acomplished with the help of the lower-privilege |
| 169 | transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() |
| 170 | checks whether the interrupted task was in fsys-mode and, if so, sets |
| 171 | PSR.lp and returns immediately. When fsys-mode is exited via the |
| 172 | "br.ret" instruction that lowers the privilege level, a trap will |
| 173 | occur. The trap handler clears PSR.lp again and returns immediately. |
| 174 | The kernel exit path then checks for and delivers any pending signals. |
| 175 | |
| 176 | * PSR Handling |
| 177 | |
| 178 | The "epc" instruction doesn't change the contents of PSR at all. This |
| 179 | is in contrast to a regular interruption, which clears almost all |
| 180 | bits. Because of that, some care needs to be taken to ensure things |
| 181 | work as expected. The following discussion describes how each PSR bit |
| 182 | is handled. |
| 183 | |
| 184 | PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used |
| 185 | to ensure the CPU is in little-endian mode before the first |
| 186 | load/store instruction is executed. PSR.be is normally NOT |
| 187 | restored upon return from an fsys-mode handler. In other |
| 188 | words, user-level code must not rely on PSR.be being preserved |
| 189 | across a system call. |
| 190 | PSR.up Unchanged. |
| 191 | PSR.ac Unchanged. |
| 192 | PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| 193 | PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| 194 | PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| 195 | PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| 196 | PSR.pk Unchanged. |
| 197 | PSR.dt Unchanged. |
| 198 | PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| 199 | PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| 200 | PSR.sp Unchanged. |
| 201 | PSR.pp Unchanged. |
| 202 | PSR.di Unchanged. |
| 203 | PSR.si Unchanged. |
| 204 | PSR.db Unchanged. The kernel prevents user-level from setting a hardware |
| 205 | breakpoint that triggers at any privilege level other than 3 (user-mode). |
| 206 | PSR.lp Unchanged. |
| 207 | PSR.tb Lazy redirect. If a taken-branch trap occurs while in |
| 208 | fsys-mode, the trap-handler modifies the saved machine state |
| 209 | such that execution resumes in the gate page at |
| 210 | syscall_via_break(), with privilege level 3. Note: the |
| 211 | taken branch would occur on the branch invoking the |
| 212 | fsyscall-handler, at which point, by definition, a syscall |
| 213 | restart is still safe. If the system call number is invalid, |
| 214 | the fsys-mode handler will return directly to user-level. This |
| 215 | return will trigger a taken-branch trap, but since the trap is |
| 216 | taken _after_ restoring the privilege level, the CPU has already |
| 217 | left fsys-mode, so no special treatment is needed. |
| 218 | PSR.rt Unchanged. |
| 219 | PSR.cpl Cleared to 0. |
| 220 | PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). |
| 221 | PSR.mc Unchanged. |
| 222 | PSR.it Unchanged (guaranteed to be 1). |
| 223 | PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 224 | PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 225 | PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 226 | PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to |
| 227 | be taken. The trap handler then modifies the saved machine |
| 228 | state such that execution resumes in the gate page at |
| 229 | syscall_via_break(), with privilege level 3. |
| 230 | PSR.ri Unchanged. |
| 231 | PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode |
| 232 | handler performed a speculative load that gets NaTted. If so, this |
| 233 | would be the normal & expected behavior, so no special treatment is |
| 234 | needed. |
| 235 | PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. |
| 236 | Doing so requires clearing PSR.i and PSR.ic as well. |
| 237 | PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 238 | |
| 239 | * Using fast system calls |
| 240 | |
| 241 | To use fast system calls, userspace applications need simply call |
| 242 | __kernel_syscall_via_epc(). For example |
| 243 | |
| 244 | -- example fgettimeofday() call -- |
| 245 | -- fgettimeofday.S -- |
| 246 | |
| 247 | #include <asm/asmmacro.h> |
| 248 | |
| 249 | GLOBAL_ENTRY(fgettimeofday) |
| 250 | .prologue |
| 251 | .save ar.pfs, r11 |
| 252 | mov r11 = ar.pfs |
| 253 | .body |
| 254 | |
| 255 | mov r2 = 0xa000000000020660;; // gate address |
| 256 | // found by inspection of System.map for the |
| 257 | // __kernel_syscall_via_epc() function. See |
| 258 | // below for how to do this for real. |
| 259 | |
| 260 | mov b7 = r2 |
| 261 | mov r15 = 1087 // gettimeofday syscall |
| 262 | ;; |
| 263 | br.call.sptk.many b6 = b7 |
| 264 | ;; |
| 265 | |
| 266 | .restore sp |
| 267 | |
| 268 | mov ar.pfs = r11 |
| 269 | br.ret.sptk.many rp;; // return to caller |
| 270 | END(fgettimeofday) |
| 271 | |
| 272 | -- end fgettimeofday.S -- |
| 273 | |
| 274 | In reality, getting the gate address is accomplished by two extra |
| 275 | values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) |
| 276 | |
| 277 | o AT_SYSINFO : is the address of __kernel_syscall_via_epc() |
| 278 | o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO |
| 279 | |
| 280 | The ELF DSO is a pre-linked library that is mapped in by the kernel at |
| 281 | the gate page. It is a proper ELF shared object so, with a dynamic |
| 282 | loader that recognises the library, you should be able to make calls to |
| 283 | the exported functions within it as with any other shared library. |
| 284 | AT_SYSINFO points into the kernel DSO at the |
| 285 | __kernel_syscall_via_epc() function for historical reasons (it was |
| 286 | used before the kernel DSO) and as a convenience. |