s390: remove all code using the access register mode
The vdso code for the getcpu() and the clock_gettime() call use the access
register mode to access the per-CPU vdso data page with the current code.
An alternative to the complicated AR mode is to use the secondary space
mode. This makes the vdso faster and quite a bit simpler. The downside is
that the uaccess code has to be changed quite a bit.
Which instructions are used depends on the machine and what kind of uaccess
operation is requested. The instruction dictates which ASCE value needs
to be loaded into %cr1 and %cr7.
The different cases:
* User copy with MVCOS for z10 and newer machines
The MVCOS instruction can copy between the primary space (aka user) and
the home space (aka kernel) directly. For set_fs(KERNEL_DS) the kernel
ASCE is loaded into %cr1. For set_fs(USER_DS) the user space is already
loaded in %cr1.
* User copy with MVCP/MVCS for older machines
To be able to execute the MVCP/MVCS instructions the kernel needs to
switch to primary mode. The control register %cr1 has to be set to the
kernel ASCE and %cr7 to either the kernel ASCE or the user ASCE dependent
on set_fs(KERNEL_DS) vs set_fs(USER_DS).
* Data access in the user address space for strnlen / futex
To use "normal" instruction with data from the user address space the
secondary space mode is used. The kernel needs to switch to primary mode,
%cr1 has to contain the kernel ASCE and %cr7 either the user ASCE or the
kernel ASCE, dependent on set_fs.
To load a new value into %cr1 or %cr7 is an expensive operation, the kernel
tries to be lazy about it. E.g. for multiple user copies in a row with
MVCP/MVCS the replacement of the vdso ASCE in %cr7 with the user ASCE is
done only once. On return to user space a CPU bit is checked that loads the
vdso ASCE again.
To enable and disable the data access via the secondary space two new
functions are added, enable_sacf_uaccess and disable_sacf_uaccess. The fact
that a context is in secondary space uaccess mode is stored in the
mm_segment_t value for the task. The code of an interrupt may use set_fs
as long as it returns to the previous state it got with get_fs with another
call to set_fs. The code in finish_arch_post_lock_switch simply has to do a
set_fs with the current mm_segment_t value for the task.
For CPUs with MVCOS:
CPU running in | %cr1 ASCE | %cr7 ASCE |
--------------------------------------|-----------|-----------|
user space | user | vdso |
kernel, USER_DS, normal-mode | user | vdso |
kernel, USER_DS, normal-mode, lazy | user | user |
kernel, USER_DS, sacf-mode | kernel | user |
kernel, KERNEL_DS, normal-mode | kernel | vdso |
kernel, KERNEL_DS, normal-mode, lazy | kernel | kernel |
kernel, KERNEL_DS, sacf-mode | kernel | kernel |
For CPUs without MVCOS:
CPU running in | %cr1 ASCE | %cr7 ASCE |
--------------------------------------|-----------|-----------|
user space | user | vdso |
kernel, USER_DS, normal-mode | user | vdso |
kernel, USER_DS, normal-mode lazy | kernel | user |
kernel, USER_DS, sacf-mode | kernel | user |
kernel, KERNEL_DS, normal-mode | kernel | vdso |
kernel, KERNEL_DS, normal-mode, lazy | kernel | kernel |
kernel, KERNEL_DS, sacf-mode | kernel | kernel |
The lines with "lazy" refer to the state after a copy via the secondary
space with a delayed reload of %cr1 and %cr7.
There are three hardware address spaces that can cause a DAT exception,
primary, secondary and home space. The exception can be related to
four different fault types: user space fault, vdso fault, kernel fault,
and the gmap faults.
Dependent on the set_fs state and normal vs. sacf mode there are a number
of fault combinations:
1) user address space fault via the primary ASCE
2) gmap address space fault via the primary ASCE
3) kernel address space fault via the primary ASCE for machines with
MVCOS and set_fs(KERNEL_DS)
4) vdso address space faults via the secondary ASCE with an invalid
address while running in secondary space in problem state
5) user address space fault via the secondary ASCE for user-copy
based on the secondary space mode, e.g. futex_ops or strnlen_user
6) kernel address space fault via the secondary ASCE for user-copy
with secondary space mode with set_fs(KERNEL_DS)
7) kernel address space fault via the primary ASCE for user-copy
with secondary space mode with set_fs(USER_DS) on machines without
MVCOS.
8) kernel address space fault via the home space ASCE
Replace user_space_fault() with a new function get_fault_type() that
can distinguish all four different fault types.
With these changes the futex atomic ops from the kernel and the
strnlen_user will get a little bit slower, as well as the old style
uaccess with MVCP/MVCS. All user accesses based on MVCOS will be as
fast as before. On the positive side, the user space vdso code is a
lot faster and Linux ceases to use the complicated AR mode.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index be974b3..1465400 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -50,6 +50,13 @@
#define VM_FAULT_SIGNAL 0x080000
#define VM_FAULT_PFAULT 0x100000
+enum fault_type {
+ KERNEL_FAULT,
+ USER_FAULT,
+ VDSO_FAULT,
+ GMAP_FAULT,
+};
+
static unsigned long store_indication __read_mostly;
static int __init fault_init(void)
@@ -99,27 +106,34 @@ void bust_spinlocks(int yes)
}
/*
- * Returns the address space associated with the fault.
- * Returns 0 for kernel space and 1 for user space.
+ * Find out which address space caused the exception.
+ * Access register mode is impossible, ignore space == 3.
*/
-static inline int user_space_fault(struct pt_regs *regs)
+static inline enum fault_type get_fault_type(struct pt_regs *regs)
{
unsigned long trans_exc_code;
- /*
- * The lowest two bits of the translation exception
- * identification indicate which paging table was used.
- */
trans_exc_code = regs->int_parm_long & 3;
- if (trans_exc_code == 3) /* home space -> kernel */
- return 0;
- if (user_mode(regs))
- return 1;
- if (trans_exc_code == 2) /* secondary space -> set_fs */
- return current->thread.mm_segment.ar4;
- if (test_pt_regs_flag(regs, PIF_GUEST_FAULT))
- return 1;
- return 0;
+ if (likely(trans_exc_code == 0)) {
+ /* primary space exception */
+ if (IS_ENABLED(CONFIG_PGSTE) &&
+ test_pt_regs_flag(regs, PIF_GUEST_FAULT))
+ return GMAP_FAULT;
+ if (current->thread.mm_segment == USER_DS)
+ return USER_FAULT;
+ return KERNEL_FAULT;
+ }
+ if (trans_exc_code == 2) {
+ /* secondary space exception */
+ if (current->thread.mm_segment & 1) {
+ if (current->thread.mm_segment == USER_DS_SACF)
+ return USER_FAULT;
+ return KERNEL_FAULT;
+ }
+ return VDSO_FAULT;
+ }
+ /* home space exception -> access via kernel ASCE */
+ return KERNEL_FAULT;
}
static int bad_address(void *p)
@@ -204,20 +218,23 @@ static void dump_fault_info(struct pt_regs *regs)
break;
}
pr_cont("mode while using ");
- if (!user_space_fault(regs)) {
- asce = S390_lowcore.kernel_asce;
- pr_cont("kernel ");
- }
-#ifdef CONFIG_PGSTE
- else if (test_pt_regs_flag(regs, PIF_GUEST_FAULT)) {
- struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
- asce = gmap->asce;
- pr_cont("gmap ");
- }
-#endif
- else {
+ switch (get_fault_type(regs)) {
+ case USER_FAULT:
asce = S390_lowcore.user_asce;
pr_cont("user ");
+ break;
+ case VDSO_FAULT:
+ asce = S390_lowcore.vdso_asce;
+ pr_cont("vdso ");
+ break;
+ case GMAP_FAULT:
+ asce = ((struct gmap *) S390_lowcore.gmap)->asce;
+ pr_cont("gmap ");
+ break;
+ case KERNEL_FAULT:
+ asce = S390_lowcore.kernel_asce;
+ pr_cont("kernel ");
+ break;
}
pr_cont("ASCE.\n");
dump_pagetable(asce, regs->int_parm_long & __FAIL_ADDR_MASK);
@@ -273,7 +290,7 @@ static noinline void do_no_context(struct pt_regs *regs)
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
*/
- if (!user_space_fault(regs))
+ if (get_fault_type(regs) == KERNEL_FAULT)
printk(KERN_ALERT "Unable to handle kernel pointer dereference"
" in virtual kernel address space\n");
else
@@ -395,12 +412,11 @@ static noinline void do_fault_error(struct pt_regs *regs, int access, int fault)
*/
static inline int do_exception(struct pt_regs *regs, int access)
{
-#ifdef CONFIG_PGSTE
struct gmap *gmap;
-#endif
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
+ enum fault_type type;
unsigned long trans_exc_code;
unsigned long address;
unsigned int flags;
@@ -425,8 +441,19 @@ static inline int do_exception(struct pt_regs *regs, int access)
* user context.
*/
fault = VM_FAULT_BADCONTEXT;
- if (unlikely(!user_space_fault(regs) || faulthandler_disabled() || !mm))
+ type = get_fault_type(regs);
+ switch (type) {
+ case KERNEL_FAULT:
goto out;
+ case VDSO_FAULT:
+ fault = VM_FAULT_BADMAP;
+ goto out;
+ case USER_FAULT:
+ case GMAP_FAULT:
+ if (faulthandler_disabled() || !mm)
+ goto out;
+ break;
+ }
address = trans_exc_code & __FAIL_ADDR_MASK;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
@@ -437,10 +464,9 @@ static inline int do_exception(struct pt_regs *regs, int access)
flags |= FAULT_FLAG_WRITE;
down_read(&mm->mmap_sem);
-#ifdef CONFIG_PGSTE
- gmap = test_pt_regs_flag(regs, PIF_GUEST_FAULT) ?
- (struct gmap *) S390_lowcore.gmap : NULL;
- if (gmap) {
+ gmap = NULL;
+ if (IS_ENABLED(CONFIG_PGSTE) && type == GMAP_FAULT) {
+ gmap = (struct gmap *) S390_lowcore.gmap;
current->thread.gmap_addr = address;
current->thread.gmap_write_flag = !!(flags & FAULT_FLAG_WRITE);
current->thread.gmap_int_code = regs->int_code & 0xffff;
@@ -452,7 +478,6 @@ static inline int do_exception(struct pt_regs *regs, int access)
if (gmap->pfault_enabled)
flags |= FAULT_FLAG_RETRY_NOWAIT;
}
-#endif
retry:
fault = VM_FAULT_BADMAP;
@@ -507,15 +532,14 @@ static inline int do_exception(struct pt_regs *regs, int access)
regs, address);
}
if (fault & VM_FAULT_RETRY) {
-#ifdef CONFIG_PGSTE
- if (gmap && (flags & FAULT_FLAG_RETRY_NOWAIT)) {
+ if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
+ (flags & FAULT_FLAG_RETRY_NOWAIT)) {
/* FAULT_FLAG_RETRY_NOWAIT has been set,
* mmap_sem has not been released */
current->thread.gmap_pfault = 1;
fault = VM_FAULT_PFAULT;
goto out_up;
}
-#endif
/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
* of starvation. */
flags &= ~(FAULT_FLAG_ALLOW_RETRY |
@@ -525,8 +549,7 @@ static inline int do_exception(struct pt_regs *regs, int access)
goto retry;
}
}
-#ifdef CONFIG_PGSTE
- if (gmap) {
+ if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
address = __gmap_link(gmap, current->thread.gmap_addr,
address);
if (address == -EFAULT) {
@@ -538,7 +561,6 @@ static inline int do_exception(struct pt_regs *regs, int access)
goto out_up;
}
}
-#endif
fault = 0;
out_up:
up_read(&mm->mmap_sem);