Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- the rest of MM
- KASAN updates
- procfs updates
- exit, fork updates
- printk updates
- lib/ updates
- radix-tree testsuite updates
- checkpatch updates
- kprobes updates
- a few other misc bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
samples/kprobes: print out the symbol name for the hooks
samples/kprobes: add a new module parameter
kprobes: add the "tls" argument for j_do_fork
init/main.c: simplify initcall_blacklisted()
fs/efs/super.c: fix return value
checkpatch: improve --git <commit-count> shortcut
checkpatch: reduce number of `git log` calls with --git
checkpatch: add support to check already applied git commits
checkpatch: add --list-types to show message types to show or ignore
checkpatch: advertise the --fix and --fix-inplace options more
checkpatch: whine about ACCESS_ONCE
checkpatch: add test for keywords not starting on tabstops
checkpatch: improve CONSTANT_COMPARISON test for structure members
checkpatch: add PREFER_IS_ENABLED test
lib/GCD.c: use binary GCD algorithm instead of Euclidean
radix-tree: free up the bottom bit of exceptional entries for reuse
dax: move RADIX_DAX_ definitions to dax.c
radix-tree: make radix_tree_descend() more useful
radix-tree: introduce radix_tree_replace_clear_tags()
radix-tree: tidy up __radix_tree_create()
...
diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 2e69e83..4518d30 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -166,3 +166,12 @@
The mm_stat file is read-only and represents device's mm
statistics (orig_data_size, compr_data_size, etc.) in a format
similar to block layer statistics file format.
+
+What: /sys/block/zram<id>/debug_stat
+Date: July 2016
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The debug_stat file is read-only and represents various
+ device's debugging info useful for kernel developers. Its
+ format is not documented intentionally and may change
+ anytime without any notice.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 5bda503..13100fb 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -59,27 +59,16 @@
pre-created. Default: 1.
2) Set max number of compression streams
- Compression backend may use up to max_comp_streams compression streams,
- thus allowing up to max_comp_streams concurrent compression operations.
- By default, compression backend uses single compression stream.
+ Regardless the value passed to this attribute, ZRAM will always
+ allocate multiple compression streams - one per online CPUs - thus
+ allowing several concurrent compression operations. The number of
+ allocated compression streams goes down when some of the CPUs
+ become offline. There is no single-compression-stream mode anymore,
+ unless you are running a UP system or has only 1 CPU online.
- Examples:
- #show max compression streams number
+ To find out how many streams are currently available:
cat /sys/block/zram0/max_comp_streams
- #set max compression streams number to 3
- echo 3 > /sys/block/zram0/max_comp_streams
-
-Note:
-In order to enable compression backend's multi stream support max_comp_streams
-must be initially set to desired concurrency level before ZRAM device
-initialisation. Once the device initialised as a single stream compression
-backend (max_comp_streams equals to 1), you will see error if you try to change
-the value of max_comp_streams because single stream compression backend
-implemented as a special case by lock overhead issue and does not support
-dynamic max_comp_streams. Only multi stream backend supports dynamic
-max_comp_streams adjustment.
-
3) Select compression algorithm
Using comp_algorithm device attribute one can see available and
currently selected (shown in square brackets) compression algorithms,
@@ -183,6 +172,7 @@
pages_compacted RO the number of pages freed during compaction
(available only via zram<id>/mm_stat node)
compact WO trigger memory compaction
+debug_stat RO this file is used for zram debugging purposes
WARNING
=======
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@
TracerPid PID of process tracing this process (0 if not)
Uid Real, effective, saved set, and file system UIDs
Gid Real, effective, saved set, and file system GIDs
+ Umask file mode creation mask
FDSize number of file descriptor slots currently allocated
Groups supplementary group list
NStgid descendant namespace thread group ID hierarchy
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index fb0e1f2..7c871d6 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -340,7 +340,7 @@
== Graceful fallback ==
-Code walking pagetables but unware about huge pmds can simply call
+Code walking pagetables but unaware about huge pmds can simply call
split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_pmd where
@@ -414,7 +414,7 @@
map/unmap of the whole compound page.
We set PG_double_map when a PMD of the page got split for the first time,
-but still have PMD mapping. The addtional references go away with last
+but still have PMD mapping. The additional references go away with last
compound_mapcount.
split_huge_page internally has to distribute the refcounts in the head
@@ -432,10 +432,10 @@
We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
-All tail pages has zero ->_refcount until atomic_add(). It prevent scanner
-from geting reference to tail page up to the point. After the atomic_add()
-we don't care about ->_refcount value. We already known how many references
-with should uncharge from head page.
+All tail pages have zero ->_refcount until atomic_add(). This prevents the
+scanner from getting a reference to the tail page up to that point. After the
+atomic_add() we don't care about the ->_refcount value. We already known how
+many references should be uncharged from the head page.
For head page get_page_unless_zero() will succeed and we don't mind. It's
clear where reference should go after split: it will stay on head page.
diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.txt
new file mode 100644
index 0000000..38e4dac
--- /dev/null
+++ b/Documentation/vm/z3fold.txt
@@ -0,0 +1,26 @@
+z3fold
+------
+
+z3fold is a special purpose allocator for storing compressed pages.
+It is designed to store up to three compressed pages per physical page.
+It is a zbud derivative which allows for higher compression
+ratio keeping the simplicity and determinism of its predecessor.
+
+The main differences between z3fold and zbud are:
+* unlike zbud, z3fold allows for up to PAGE_SIZE allocations
+* z3fold can hold up to 3 compressed pages in its page
+* z3fold doesn't export any API itself and is thus intended to be used
+ via the zpool API.
+
+To keep the determinism and simplicity, z3fold, just like zbud, always
+stores an integral number of compressed pages per page, but it can store
+up to 3 pages unlike zbud which can store at most 2. Therefore the
+compression ratio goes to around 2.7x while zbud's one is around 1.7x.
+
+Unlike zbud (but like zsmalloc for that matter) z3fold_alloc() does not
+return a dereferenceable pointer. Instead, it returns an unsigned long
+handle which encodes actual location of the allocated object.
+
+Keeping effective compression ratio close to zsmalloc's, z3fold doesn't
+depend on MMU enabled and provides more predictable reclaim behavior
+which makes it a better fit for small and response-critical systems.
diff --git a/MAINTAINERS b/MAINTAINERS
index 49d1e83..ed1229e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6264,7 +6264,7 @@
F: arch/*/include/asm/kasan.h
F: arch/*/mm/kasan_init*
F: Documentation/kasan.txt
-F: include/linux/kasan.h
+F: include/linux/kasan*.h
F: lib/test_kasan.c
F: mm/kasan/
F: scripts/Makefile.kasan
@@ -8280,7 +8280,6 @@
OPENRISC ARCHITECTURE
M: Jonas Bonn <jonas@southpole.se>
W: http://openrisc.net
-L: linux@lists.openrisc.net (moderated for non-subscribers)
S: Maintained
T: git git://openrisc.net/~jonas/linux
F: arch/openrisc/
@@ -8401,7 +8400,6 @@
PANASONIC MN10300/AM33/AM34 PORT
M: David Howells <dhowells@redhat.com>
-M: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
L: linux-am33-list@redhat.com (moderated for non-subscribers)
W: ftp://ftp.redhat.com/pub/redhat/gnupro/AM33/
S: Maintained
@@ -8835,7 +8833,6 @@
PIN CONTROLLER - ST SPEAR
M: Viresh Kumar <vireshk@kernel.org>
-L: spear-devel@list.st.com
L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
W: http://www.st.com/spear
S: Maintained
@@ -10040,7 +10037,6 @@
SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) ST SPEAR DRIVER
M: Viresh Kumar <vireshk@kernel.org>
-L: spear-devel@list.st.com
L: linux-mmc@vger.kernel.org
S: Maintained
F: drivers/mmc/host/sdhci-spear.c
@@ -10603,7 +10599,6 @@
SPEAR PLATFORM SUPPORT
M: Viresh Kumar <vireshk@kernel.org>
M: Shiraz Hashim <shiraz.linux.kernel@gmail.com>
-L: spear-devel@list.st.com
L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
W: http://www.st.com/spear
S: Maintained
@@ -10612,7 +10607,6 @@
SPEAR CLOCK FRAMEWORK SUPPORT
M: Viresh Kumar <vireshk@kernel.org>
-L: spear-devel@list.st.com
L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
W: http://www.st.com/spear
S: Maintained
diff --git a/arch/Kconfig b/arch/Kconfig
index 81869a5..b16e74e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -187,7 +187,11 @@
config HAVE_KPROBES_ON_FTRACE
bool
+config HAVE_NMI
+ bool
+
config HAVE_NMI_WATCHDOG
+ depends on HAVE_NMI
bool
#
# An arch should select this if it provides all these things:
@@ -517,6 +521,11 @@
- ARCH_MMAP_RND_BITS_MIN
- ARCH_MMAP_RND_BITS_MAX
+config HAVE_EXIT_THREAD
+ bool
+ help
+ An architecture implements exit_thread.
+
config ARCH_MMAP_RND_BITS_MIN
int
@@ -638,4 +647,7 @@
config ARCH_NO_COHERENT_DMA_MMAP
bool
+config CPU_NO_EFFICIENT_FFS
+ def_bool n
+
source "kernel/gcov/Kconfig"
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index fe99f89..7f312d8 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -26,6 +26,7 @@
select MODULES_USE_ELF_RELA
select ODD_RT_SIGACTION
select OLD_SIGSUSPEND
+ select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
help
The Alpha is a 64-bit general-purpose processor designed and
marketed by the Digital Equipment Corporation of blessed memory,
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 84d1326..b483156 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -210,14 +210,6 @@
}
EXPORT_SYMBOL(start_thread);
-/*
- * Free current thread data structures etc..
- */
-void
-exit_thread(void)
-{
-}
-
void
flush_thread(void)
{
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 8894f7e..0dcbacf 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -107,6 +107,7 @@
config ISA_ARCOMPACT
bool "ARCompact ISA"
+ select CPU_NO_EFFICIENT_FFS
help
The original ARC ISA of ARC600/700 cores
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index a3f750e..b5db9e7 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -183,13 +183,6 @@
{
}
-/*
- * Free any architecture-specific thread data structures, etc.
- */
-void exit_thread(void)
-{
-}
-
int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpu)
{
return 0;
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index b99d25b..90542db 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -50,6 +50,7 @@
select HAVE_DMA_CONTIGUOUS if MMU
select HAVE_DYNAMIC_FTRACE if (!XIP_KERNEL) && !CPU_ENDIAN_BE32 && MMU
select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) && MMU
+ select HAVE_EXIT_THREAD
select HAVE_FTRACE_MCOUNT_RECORD if (!XIP_KERNEL)
select HAVE_FUNCTION_GRAPH_TRACER if (!THUMB2_KERNEL)
select HAVE_FUNCTION_TRACER if (!XIP_KERNEL)
@@ -66,6 +67,7 @@
select HAVE_KRETPROBES if (HAVE_KPROBES)
select HAVE_MEMBLOCK
select HAVE_MOD_ARCH_SPECIFIC
+ select HAVE_NMI
select HAVE_OPROFILE if (HAVE_PERF_EVENTS)
select HAVE_OPTPROBES if !THUMB2_KERNEL
select HAVE_PERF_EVENTS
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 4adfb46..a647d66 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -193,9 +193,9 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- thread_notify(THREAD_NOTIFY_EXIT, current_thread_info());
+ thread_notify(THREAD_NOTIFY_EXIT, task_thread_info(tsk));
}
void flush_thread(void)
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index baee702..df90bc5 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -644,9 +644,11 @@
break;
case IPI_CPU_BACKTRACE:
+ printk_nmi_enter();
irq_enter();
nmi_cpu_backtrace(regs);
irq_exit();
+ printk_nmi_exit();
break;
default:
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index 5534766..cb569b6 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -421,18 +421,21 @@
select CPU_USE_DOMAINS if MMU
select NEED_KUSER_HELPERS
select TLS_REG_EMUL if SMP || !MMU
+ select CPU_NO_EFFICIENT_FFS
config CPU_32v4
bool
select CPU_USE_DOMAINS if MMU
select NEED_KUSER_HELPERS
select TLS_REG_EMUL if SMP || !MMU
+ select CPU_NO_EFFICIENT_FFS
config CPU_32v4T
bool
select CPU_USE_DOMAINS if MMU
select NEED_KUSER_HELPERS
select TLS_REG_EMUL if SMP || !MMU
+ select CPU_NO_EFFICIENT_FFS
config CPU_32v5
bool
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 2a61e4b..73085d3 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -156,10 +156,6 @@
* - we could be preempted if tree preempt rcu is enabled, so
* it is unsafe to use thread->cpu.
* THREAD_NOTIFY_EXIT
- * - the thread (v) will be running on the local CPU, so
- * v === current_thread_info()
- * - thread->cpu is the local CPU number at the time it is accessed,
- * but may change at any time.
* - we could be preempted if tree preempt rcu is enabled, so
* it is unsafe to use thread->cpu.
*/
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 48eea68..6cd2612 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -200,13 +200,6 @@
__show_regs(regs);
}
-/*
- * Free current thread data structures etc..
- */
-void exit_thread(void)
-{
-}
-
static void tls_thread_flush(void)
{
asm ("msr tpidr_el0, xzr");
diff --git a/arch/avr32/Kconfig b/arch/avr32/Kconfig
index 18b8877..7e75d45 100644
--- a/arch/avr32/Kconfig
+++ b/arch/avr32/Kconfig
@@ -4,6 +4,7 @@
# that we usually don't need on AVR32.
select EXPERT
select HAVE_CLK
+ select HAVE_EXIT_THREAD
select HAVE_OPROFILE
select HAVE_KPROBES
select VIRT_TO_BUS
@@ -17,6 +18,7 @@
select GENERIC_CLOCKEVENTS
select HAVE_MOD_ARCH_SPECIFIC
select MODULES_USE_ELF_RELA
+ select HAVE_NMI
help
AVR32 is a high-performance 32-bit RISC microprocessor core,
designed for cost-sensitive embedded applications, with particular
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 42a53e74..68e5b9d 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -62,9 +62,9 @@
/*
* Free current thread data structures etc
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- ocd_disable(current);
+ ocd_disable(tsk);
}
void flush_thread(void)
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index a63c122..28c63fe 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -40,6 +40,7 @@
select HAVE_MOD_ARCH_SPECIFIC
select MODULES_USE_ELF_RELA
select HAVE_DEBUG_STACKOVERFLOW
+ select HAVE_NMI
config GENERIC_CSUM
def_bool y
diff --git a/arch/blackfin/include/asm/processor.h b/arch/blackfin/include/asm/processor.h
index 7acd466..0c265ab 100644
--- a/arch/blackfin/include/asm/processor.h
+++ b/arch/blackfin/include/asm/processor.h
@@ -76,13 +76,6 @@
}
/*
- * Free current thread data structures etc..
- */
-static inline void exit_thread(void)
-{
-}
-
-/*
* Return saved PC of a blocked thread.
*/
#define thread_saved_pc(tsk) (tsk->thread.pc)
diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
index 3ae9f5a..0ee7686 100644
--- a/arch/c6x/kernel/process.c
+++ b/arch/c6x/kernel/process.c
@@ -82,10 +82,6 @@
{
}
-void exit_thread(void)
-{
-}
-
/*
* Do necessary setup to start up a newly executed thread.
*/
diff --git a/arch/cris/Kconfig b/arch/cris/Kconfig
index 99bda1b..deba266 100644
--- a/arch/cris/Kconfig
+++ b/arch/cris/Kconfig
@@ -59,6 +59,7 @@
select GENERIC_IOMAP
select MODULES_USE_ELF_RELA
select CLONE_BACKWARDS2
+ select HAVE_EXIT_THREAD if ETRAX_ARCH_V32
select OLD_SIGSUSPEND
select OLD_SIGACTION
select GPIOLIB
@@ -69,6 +70,7 @@
select GENERIC_CLOCKEVENTS if ETRAX_ARCH_V32
select GENERIC_SCHED_CLOCK if ETRAX_ARCH_V32
select HAVE_DEBUG_BUGVERBOSE if ETRAX_ARCH_V32
+ select HAVE_NMI
config HZ
int
diff --git a/arch/cris/arch-v10/kernel/process.c b/arch/cris/arch-v10/kernel/process.c
index 02b7834..96e5afe 100644
--- a/arch/cris/arch-v10/kernel/process.c
+++ b/arch/cris/arch-v10/kernel/process.c
@@ -35,15 +35,6 @@
local_irq_enable();
}
-/*
- * Free current thread data structures etc..
- */
-
-void exit_thread(void)
-{
- /* Nothing needs to be done. */
-}
-
/* if the watchdog is enabled, we can simply disable interrupts and go
* into an eternal loop, and the watchdog will reset the CPU after 0.1s
* if on the other hand the watchdog wasn't enabled, we just enable it and wait
diff --git a/arch/cris/arch-v32/kernel/process.c b/arch/cris/arch-v32/kernel/process.c
index c7ce784..4d1afa9 100644
--- a/arch/cris/arch-v32/kernel/process.c
+++ b/arch/cris/arch-v32/kernel/process.c
@@ -33,9 +33,9 @@
*/
extern void deconfigure_bp(long pid);
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- deconfigure_bp(current->pid);
+ deconfigure_bp(tsk->pid);
}
/*
diff --git a/arch/frv/include/asm/processor.h b/arch/frv/include/asm/processor.h
index ae8d423..73f0a79 100644
--- a/arch/frv/include/asm/processor.h
+++ b/arch/frv/include/asm/processor.h
@@ -97,13 +97,6 @@
#define forget_segments() do { } while (0)
/*
- * Free current thread data structures etc..
- */
-static inline void exit_thread(void)
-{
-}
-
-/*
* Return saved PC of a blocked thread.
*/
extern unsigned long thread_saved_pc(struct task_struct *tsk);
diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig
index 986ea84..aa232de 100644
--- a/arch/h8300/Kconfig
+++ b/arch/h8300/Kconfig
@@ -20,6 +20,7 @@
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZO
select HAVE_ARCH_KGDB
+ select CPU_NO_EFFICIENT_FFS
config RWSEM_GENERIC_SPINLOCK
def_bool y
diff --git a/arch/h8300/include/asm/processor.h b/arch/h8300/include/asm/processor.h
index 54e3fd8..111df73 100644
--- a/arch/h8300/include/asm/processor.h
+++ b/arch/h8300/include/asm/processor.h
@@ -111,13 +111,6 @@
}
/*
- * Free current thread data structures etc..
- */
-static inline void exit_thread(void)
-{
-}
-
-/*
* Return saved PC of a blocked thread.
*/
unsigned long thread_saved_pc(struct task_struct *tsk);
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index a9ebd47..d9edfd3 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -137,13 +137,6 @@
}
/*
- * Free any architecture-specific thread data structures, etc.
- */
-void exit_thread(void)
-{
-}
-
-/*
* Some archs flush debug and FPU info here
*/
void flush_thread(void)
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index b534eba..f80758c 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -18,6 +18,7 @@
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_EXIT_THREAD
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_KPROBES
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 9cd607b..2436ad5 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -4542,8 +4542,8 @@
/*
- * called only from exit_thread(): task == current
- * we come here only if current has a context attached (loaded or masked)
+ * called only from exit_thread()
+ * we come here only if the task has a context attached (loaded or masked)
*/
void
pfm_exit_thread(struct task_struct *task)
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index b515149..aae6c4d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -570,22 +570,22 @@
}
/*
- * Clean up state associated with current thread. This is called when
+ * Clean up state associated with a thread. This is called when
* the thread calls exit().
*/
void
-exit_thread (void)
+exit_thread (struct task_struct *tsk)
{
- ia64_drop_fpu(current);
+ ia64_drop_fpu(tsk);
#ifdef CONFIG_PERFMON
/* if needed, stop monitoring and flush state to perfmon context */
- if (current->thread.pfm_context)
- pfm_exit_thread(current);
+ if (tsk->thread.pfm_context)
+ pfm_exit_thread(tsk);
/* free debug register resources */
- if (current->thread.flags & IA64_THREAD_DBG_VALID)
- pfm_release_debug_registers(current);
+ if (tsk->thread.flags & IA64_THREAD_DBG_VALID)
+ pfm_release_debug_registers(tsk);
#endif
}
diff --git a/arch/m32r/Kconfig b/arch/m32r/Kconfig
index c82b292..3cc8498 100644
--- a/arch/m32r/Kconfig
+++ b/arch/m32r/Kconfig
@@ -17,6 +17,7 @@
select ARCH_USES_GETTIMEOFFSET
select MODULES_USE_ELF_RELA
select HAVE_DEBUG_STACKOVERFLOW
+ select CPU_NO_EFFICIENT_FFS
config SBUS
bool
diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c
index e69221d..a88b1f0 100644
--- a/arch/m32r/kernel/process.c
+++ b/arch/m32r/kernel/process.c
@@ -101,15 +101,6 @@
#endif
}
-/*
- * Free current thread data structures etc..
- */
-void exit_thread(void)
-{
- /* Nothing to do. */
- DPRINTK("pid = %d\n", current->pid);
-}
-
void flush_thread(void)
{
DPRINTK("pid = %d\n", current->pid);
diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index c1beb5a..8ace920 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -40,6 +40,7 @@
select CPU_HAS_NO_MULDIV64
select CPU_HAS_NO_UNALIGNED
select GENERIC_CSUM
+ select CPU_NO_EFFICIENT_FFS
help
The Freescale (was Motorola) 68000 CPU is the first generation of
the well known M68K family of processors. The CPU core as well as
@@ -51,6 +52,7 @@
bool
select CPU_HAS_NO_BITFIELDS
select CPU_HAS_NO_UNALIGNED
+ select CPU_NO_EFFICIENT_FFS
help
The Freescale (was then Motorola) CPU32 is a CPU core that is
based on the 68020 processor. For the most part it is used in
@@ -130,6 +132,7 @@
depends on !MMU
select COLDFIRE_SW_A7
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5206 processor support.
@@ -138,6 +141,7 @@
depends on !MMU
select COLDFIRE_SW_A7
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5206e processor support.
@@ -163,6 +167,7 @@
depends on !MMU
select COLDFIRE_SW_A7
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5249 processor support.
@@ -171,6 +176,7 @@
depends on !MMU
select COLDFIRE_SW_A7
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Freescale (Motorola) Coldfire 5251/5253 processor support.
@@ -189,6 +195,7 @@
depends on !MMU
select COLDFIRE_SW_A7
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5272 processor support.
@@ -217,6 +224,7 @@
select COLDFIRE_SW_A7
select HAVE_CACHE_CB
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5307 processor support.
@@ -242,6 +250,7 @@
select COLDFIRE_SW_A7
select HAVE_CACHE_CB
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Motorola ColdFire 5407 processor support.
@@ -251,6 +260,7 @@
select MMU_COLDFIRE if MMU
select HAVE_CACHE_CB
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Freescale ColdFire 5470/5471/5472/5473/5474/5475 processor support.
@@ -260,6 +270,7 @@
select M54xx
select HAVE_CACHE_CB
select HAVE_MBAR
+ select CPU_NO_EFFICIENT_FFS
help
Freescale ColdFire 5480/5481/5482/5483/5484/5485 processor support.
diff --git a/arch/m68k/include/asm/processor.h b/arch/m68k/include/asm/processor.h
index 20dda1d..a6ce2ec 100644
--- a/arch/m68k/include/asm/processor.h
+++ b/arch/m68k/include/asm/processor.h
@@ -153,13 +153,6 @@
{
}
-/*
- * Free current thread data structures etc..
- */
-static inline void exit_thread(void)
-{
-}
-
extern unsigned long thread_saved_pc(struct task_struct *tsk);
unsigned long get_wchan(struct task_struct *p);
diff --git a/arch/metag/Kconfig b/arch/metag/Kconfig
index a0fa88d..5b7a45d 100644
--- a/arch/metag/Kconfig
+++ b/arch/metag/Kconfig
@@ -11,6 +11,7 @@
select HAVE_DEBUG_KMEMLEAK
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_DYNAMIC_FTRACE
+ select HAVE_EXIT_THREAD
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
select HAVE_KERNEL_BZIP2
@@ -29,6 +30,7 @@
select OF
select OF_EARLY_FLATTREE
select SPARSE_IRQ
+ select CPU_NO_EFFICIENT_FFS
config STACKTRACE_SUPPORT
def_bool y
diff --git a/arch/metag/include/asm/processor.h b/arch/metag/include/asm/processor.h
index 0838ca6..a0333eb 100644
--- a/arch/metag/include/asm/processor.h
+++ b/arch/metag/include/asm/processor.h
@@ -134,8 +134,6 @@
#define copy_segments(tsk, mm) do { } while (0)
#define release_segments(mm) do { } while (0)
-extern void exit_thread(void);
-
/*
* Return saved PC of a blocked thread.
*/
diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 7f54618..3506279 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -345,10 +345,10 @@
/*
* Free current thread data structures etc.
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- clear_fpu(¤t->thread);
- clear_dsp(¤t->thread);
+ clear_fpu(&tsk->thread);
+ clear_dsp(&tsk->thread);
}
/* TODO: figure out how to unwind the kernel stack here to figure out
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index 3d793b5..f17c3a4 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -32,6 +32,7 @@
select OF_EARLY_FLATTREE
select TRACING_SUPPORT
select VIRT_TO_BUS
+ select CPU_NO_EFFICIENT_FFS
config SWAP
def_bool n
diff --git a/arch/microblaze/include/asm/processor.h b/arch/microblaze/include/asm/processor.h
index 497a988..c38d0dd 100644
--- a/arch/microblaze/include/asm/processor.h
+++ b/arch/microblaze/include/asm/processor.h
@@ -70,11 +70,6 @@
{
}
-/* Free all resources held by a thread. */
-static inline void exit_thread(void)
-{
-}
-
extern unsigned long thread_saved_pc(struct task_struct *t);
extern unsigned long get_wchan(struct task_struct *p);
@@ -127,11 +122,6 @@
{
}
-/* Free current thread data structures etc. */
-static inline void exit_thread(void)
-{
-}
-
/* Return saved (kernel) PC of a blocked thread. */
# define thread_saved_pc(tsk) \
((tsk)->thread.regs ? (tsk)->thread.regs->r15 : 0)
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 5663f41..8040fb1 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -48,6 +48,7 @@
select GENERIC_SCHED_CLOCK if !CAVIUM_OCTEON_SOC
select GENERIC_CMOS_UPDATE
select HAVE_MOD_ARCH_SPECIFIC
+ select HAVE_NMI
select VIRT_TO_BUS
select MODULES_USE_ELF_REL if MODULES
select MODULES_USE_ELF_RELA if MODULES && 64BIT
diff --git a/arch/mips/include/asm/cpu-features.h b/arch/mips/include/asm/cpu-features.h
index e6f19fc..e961c8a 100644
--- a/arch/mips/include/asm/cpu-features.h
+++ b/arch/mips/include/asm/cpu-features.h
@@ -204,6 +204,16 @@
#endif
#endif
+/* __builtin_constant_p(cpu_has_mips_r) && cpu_has_mips_r */
+#if !((defined(cpu_has_mips32r1) && cpu_has_mips32r1) || \
+ (defined(cpu_has_mips32r2) && cpu_has_mips32r2) || \
+ (defined(cpu_has_mips32r6) && cpu_has_mips32r6) || \
+ (defined(cpu_has_mips64r1) && cpu_has_mips64r1) || \
+ (defined(cpu_has_mips64r2) && cpu_has_mips64r2) || \
+ (defined(cpu_has_mips64r6) && cpu_has_mips64r6))
+#define CPU_NO_EFFICIENT_FFS 1
+#endif
+
#ifndef cpu_has_mips_1
# define cpu_has_mips_1 (!cpu_has_mips_r6)
#endif
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a6b3dc5..411c971 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -73,10 +73,6 @@
regs->regs[29] = sp;
}
-void exit_thread(void)
-{
-}
-
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
/*
diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 06ddb55..9627e81 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -1,5 +1,6 @@
config MN10300
def_bool y
+ select HAVE_EXIT_THREAD
select HAVE_OPROFILE
select HAVE_UID16
select GENERIC_IRQ_SHOW
diff --git a/arch/mn10300/include/asm/fpu.h b/arch/mn10300/include/asm/fpu.h
index 738ff72..a47e995 100644
--- a/arch/mn10300/include/asm/fpu.h
+++ b/arch/mn10300/include/asm/fpu.h
@@ -76,11 +76,9 @@
preempt_enable();
}
-static inline void exit_fpu(void)
+static inline void exit_fpu(struct task_struct *tsk)
{
#ifdef CONFIG_LAZY_SAVE_FPU
- struct task_struct *tsk = current;
-
preempt_disable();
if (fpu_state_owner == tsk)
fpu_state_owner = NULL;
@@ -123,7 +121,7 @@
static inline void fpu_save(struct fpu_state_struct *s) {}
static inline void fpu_kill_state(struct task_struct *tsk) {}
static inline void unlazy_fpu(struct task_struct *tsk) {}
-static inline void exit_fpu(void) {}
+static inline void exit_fpu(struct task_struct *tsk) {}
static inline void flush_fpu(void) {}
static inline int fpu_setup_sigcontext(struct fpucontext *buf) { return 0; }
static inline int fpu_restore_sigcontext(struct fpucontext *buf) { return 0; }
diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c
index 3707da5..cbede4e 100644
--- a/arch/mn10300/kernel/process.c
+++ b/arch/mn10300/kernel/process.c
@@ -103,9 +103,9 @@
/*
* free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- exit_fpu();
+ exit_fpu(tsk);
}
void flush_thread(void)
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index 87ca653..51a56c8 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -15,6 +15,7 @@
select SOC_BUS
select SPARSE_IRQ
select USB_ARCH_HAS_HCD if USB_SUPPORT
+ select CPU_NO_EFFICIENT_FFS
config GENERIC_CSUM
def_bool y
diff --git a/arch/nios2/include/asm/processor.h b/arch/nios2/include/asm/processor.h
index c2ba45c..1c953f0 100644
--- a/arch/nios2/include/asm/processor.h
+++ b/arch/nios2/include/asm/processor.h
@@ -75,11 +75,6 @@
{
}
-/* Free current thread data structures etc.. */
-static inline void exit_thread(void)
-{
-}
-
/* Return saved PC of a blocked thread. */
#define thread_saved_pc(tsk) ((tsk)->thread.kregs->ea)
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e118c02..142cb05 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -25,6 +25,7 @@
select MODULES_USE_ELF_RELA
select HAVE_DEBUG_STACKOVERFLOW
select OR1K_PIC
+ select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1
config MMU
def_bool y
diff --git a/arch/openrisc/include/asm/processor.h b/arch/openrisc/include/asm/processor.h
index 4d235e3..70334c9 100644
--- a/arch/openrisc/include/asm/processor.h
+++ b/arch/openrisc/include/asm/processor.h
@@ -85,15 +85,6 @@
unsigned long get_wchan(struct task_struct *p);
/*
- * Free current thread data structures etc..
- */
-
-extern inline void exit_thread(void)
-{
- /* Nothing needs to be done. */
-}
-
-/*
* Return saved PC of a blocked thread. For now, this is the "user" PC
*/
extern unsigned long thread_saved_pc(struct task_struct *t);
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 88cfaa8..3d498a6 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -32,6 +32,7 @@
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_SECCOMP_FILTER
select ARCH_NO_COHERENT_DMA_MMAP
+ select CPU_NO_EFFICIENT_FFS
help
The PA-RISC microprocessor is designed by Hewlett-Packard and used
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index 809905a..4063943 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -144,13 +144,6 @@
void (*pm_power_off)(void) = machine_power_off;
EXPORT_SYMBOL(pm_power_off);
-/*
- * Free current thread data structures etc..
- */
-void exit_thread(void)
-{
-}
-
void flush_thread(void)
{
/* Only needs to handle fpu stuff or perf monitors.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f0403b5..01f7464 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -155,6 +155,7 @@
select NO_BOOTMEM
select HAVE_GENERIC_RCU_GUP
select HAVE_PERF_EVENTS_NMI if PPC64
+ select HAVE_NMI if PERF_EVENTS
select EDAC_SUPPORT
select EDAC_ATOMIC_SCRUB
select ARCH_HAS_DMA_SET_COHERENT_MASK
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ea8a28f..e2f12cb 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1329,10 +1329,6 @@
show_instructions(regs);
}
-void exit_thread(void)
-{
-}
-
void flush_thread(void)
{
#ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index de0fcc0..a8c2590 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -123,6 +123,7 @@
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_EARLY_PFN_TO_NID
select HAVE_ARCH_JUMP_LABEL
+ select CPU_NO_EFFICIENT_FFS if !HAVE_MARCH_Z9_109_FEATURES
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_SOFT_DIRTY
select HAVE_ARCH_TRACEHOOK
@@ -134,6 +135,7 @@
select HAVE_DMA_API_DEBUG
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
+ select HAVE_EXIT_THREAD
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
@@ -165,6 +167,7 @@
select TTY
select VIRT_CPU_ACCOUNTING
select VIRT_TO_BUS
+ select HAVE_NMI
config SCHED_OMIT_FRAME_POINTER
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 481d7a83..bba4fa7 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -68,9 +68,10 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- exit_thread_runtime_instr();
+ if (tsk == current)
+ exit_thread_runtime_instr();
}
void flush_thread(void)
diff --git a/arch/score/Kconfig b/arch/score/Kconfig
index 366e1b5..507d631 100644
--- a/arch/score/Kconfig
+++ b/arch/score/Kconfig
@@ -14,6 +14,7 @@
select VIRT_TO_BUS
select MODULES_USE_ELF_REL
select CLONE_BACKWARDS
+ select CPU_NO_EFFICIENT_FFS
choice
prompt "System type"
diff --git a/arch/score/kernel/process.c b/arch/score/kernel/process.c
index a1519ad3..aae9480 100644
--- a/arch/score/kernel/process.c
+++ b/arch/score/kernel/process.c
@@ -56,8 +56,6 @@
regs->regs[0] = sp;
}
-void exit_thread(void) {}
-
/*
* When a process does an "exec", machine state like FPU and debug
* registers need to be reset. This is a hook function for that.
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7ed20fc..e803a83 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -20,6 +20,7 @@
select PERF_USE_VMALLOC
select HAVE_DEBUG_KMEMLEAK
select HAVE_KERNEL_GZIP
+ select CPU_NO_EFFICIENT_FFS
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_XZ
@@ -44,6 +45,7 @@
select OLD_SIGSUSPEND
select OLD_SIGACTION
select HAVE_ARCH_AUDITSYSCALL
+ select HAVE_NMI
help
The SuperH is a RISC processor targeted for use in embedded systems
and consumer electronics; it was also used in the Sega Dreamcast
@@ -71,6 +73,7 @@
config SUPERH64
def_bool ARCH = "sh64"
+ select HAVE_EXIT_THREAD
select KALLSYMS
config ARCH_DEFCONFIG
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 2885fc9..ee12e94 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -76,13 +76,6 @@
}
EXPORT_SYMBOL(start_thread);
-/*
- * Free current thread data structures etc..
- */
-void exit_thread(void)
-{
-}
-
void flush_thread(void)
{
struct task_struct *tsk = current;
diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c
index e2062e6..9d3e991 100644
--- a/arch/sh/kernel/process_64.c
+++ b/arch/sh/kernel/process_64.c
@@ -288,7 +288,7 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
/*
* See arch/sparc/kernel/process.c for the precedent for doing
@@ -307,9 +307,8 @@
* which it would get safely nulled.
*/
#ifdef CONFIG_SH_FPU
- if (last_task_used_math == current) {
+ if (last_task_used_math == tsk)
last_task_used_math = NULL;
- }
#endif
}
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index db0a26c..546293d 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -20,6 +20,7 @@
select HAVE_OPROFILE
select HAVE_ARCH_KGDB if !SMP || SPARC64
select HAVE_ARCH_TRACEHOOK
+ select HAVE_EXIT_THREAD
select SYSCTL_EXCEPTION_TRACE
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
select RTC_CLASS
@@ -41,6 +42,7 @@
select ODD_RT_SIGACTION
select OLD_SIGSUSPEND
select ARCH_HAS_SG_CHAIN
+ select CPU_NO_EFFICIENT_FFS
config SPARC32
def_bool !64BIT
@@ -78,6 +80,7 @@
select NO_BOOTMEM
select HAVE_ARCH_AUDITSYSCALL
select ARCH_SUPPORTS_ATOMIC_RMW
+ select HAVE_NMI
config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index c5113c7..b7780a5 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -184,21 +184,21 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
#ifndef CONFIG_SMP
- if(last_task_used_math == current) {
+ if (last_task_used_math == tsk) {
#else
- if (test_thread_flag(TIF_USEDFPU)) {
+ if (test_ti_thread_flag(task_thread_info(tsk), TIF_USEDFPU)) {
#endif
/* Keep process from leaving FPU in a bogon state. */
put_psr(get_psr() | PSR_EF);
- fpsave(¤t->thread.float_regs[0], ¤t->thread.fsr,
- ¤t->thread.fpqueue[0], ¤t->thread.fpqdepth);
+ fpsave(&tsk->thread.float_regs[0], &tsk->thread.fsr,
+ &tsk->thread.fpqueue[0], &tsk->thread.fpqdepth);
#ifndef CONFIG_SMP
last_task_used_math = NULL;
#else
- clear_thread_flag(TIF_USEDFPU);
+ clear_ti_thread_flag(task_thread_info(tsk), TIF_USEDFPU);
#endif
}
}
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c16ef1a..fa14402 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -417,9 +417,9 @@
}
/* Free current thread data structures etc.. */
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- struct thread_info *t = current_thread_info();
+ struct thread_info *t = task_thread_info(tsk);
if (t->utraps) {
if (t->utraps[0] < 2)
diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 8171930..76989b87 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -3,6 +3,7 @@
config TILE
def_bool y
+ select HAVE_EXIT_THREAD
select HAVE_PERF_EVENTS
select USE_PMC if PERF_EVENTS
select HAVE_DMA_API_DEBUG
@@ -29,6 +30,7 @@
select HAVE_DEBUG_STACKOVERFLOW
select ARCH_WANT_FRAME_POINTERS
select HAVE_CONTEXT_TRACKING
+ select HAVE_NMI if USE_PMC
select EDAC_SUPPORT
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d3..6b705cc 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -541,7 +541,7 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
#ifdef CONFIG_HARDWALL
/*
@@ -550,7 +550,7 @@
* the last reference to a hardwall fd, it would already have
* been released and deactivated at this point.)
*/
- hardwall_deactivate_all(current);
+ hardwall_deactivate_all(tsk);
#endif
}
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 48af59a..0b04711 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -103,10 +103,6 @@
tracehook_notify_resume(regs);
}
-void exit_thread(void)
-{
-}
-
int get_current_pid(void)
{
return task_pid_nr(current);
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index b008e99..00299c9 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -201,13 +201,6 @@
__backtrace();
}
-/*
- * Free current thread data structures etc..
- */
-void exit_thread(void)
-{
-}
-
void flush_thread(void)
{
struct thread_info *thread = current_thread_info();
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 15f8276..48ac290 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -105,6 +105,7 @@
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EFFICIENT_UNALIGNED_ACCESS
+ select HAVE_EXIT_THREAD
select HAVE_FENTRY if X86_64
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_FP_TEST
@@ -130,6 +131,7 @@
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
+ select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
select HAVE_PCSPKR_PLATFORM
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 12f9653..2982387 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -5,6 +5,7 @@
*/
#include <linux/errno.h>
#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
#include <linux/thread_info.h>
#include <linux/string.h>
#include <asm/asm.h>
@@ -721,6 +722,8 @@
might_fault();
+ kasan_check_write(to, n);
+
/*
* While we would like to have the compiler do the checking for us
* even in the non-constant size case, any false positives there are
@@ -754,6 +757,8 @@
{
int sz = __compiletime_object_size(from);
+ kasan_check_read(from, n);
+
might_fault();
/* See the comment in copy_from_user() above. */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 3076986..2eac2aa 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -7,6 +7,7 @@
#include <linux/compiler.h>
#include <linux/errno.h>
#include <linux/lockdep.h>
+#include <linux/kasan-checks.h>
#include <asm/alternative.h>
#include <asm/cpufeatures.h>
#include <asm/page.h>
@@ -109,6 +110,7 @@
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
might_fault();
+ kasan_check_write(dst, size);
return __copy_from_user_nocheck(dst, src, size);
}
@@ -175,6 +177,7 @@
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
might_fault();
+ kasan_check_read(src, size);
return __copy_to_user_nocheck(dst, src, size);
}
@@ -242,12 +245,14 @@
static __must_check __always_inline int
__copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
{
+ kasan_check_write(dst, size);
return __copy_from_user_nocheck(dst, src, size);
}
static __must_check __always_inline int
__copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
{
+ kasan_check_read(src, size);
return __copy_to_user_nocheck(dst, src, size);
}
@@ -258,6 +263,7 @@
__copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
{
might_fault();
+ kasan_check_write(dst, size);
return __copy_user_nocache(dst, src, size, 1);
}
@@ -265,6 +271,7 @@
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
unsigned size)
{
+ kasan_check_write(dst, size);
return __copy_user_nocache(dst, src, size, 0);
}
diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index 045e424..7788ce6 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -18,7 +18,6 @@
#include <linux/nmi.h>
#include <linux/module.h>
#include <linux/delay.h>
-#include <linux/seq_buf.h>
#ifdef CONFIG_HARDLOCKUP_DETECTOR
u64 hw_nmi_get_sample_period(int watchdog_thresh)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 2915d54..96becbb 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -97,10 +97,9 @@
/*
* Free current thread data structures etc..
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
- struct task_struct *me = current;
- struct thread_struct *t = &me->thread;
+ struct thread_struct *t = &tsk->thread;
unsigned long *bp = t->io_bitmap_ptr;
struct fpu *fpu = &t->fpu;
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 85257af..64336f6 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -14,6 +14,7 @@
select GENERIC_PCI_IOMAP
select GENERIC_SCHED_CLOCK
select HAVE_DMA_API_DEBUG
+ select HAVE_EXIT_THREAD
select HAVE_FUNCTION_TRACER
select HAVE_FUTEX_CMPXCHG if !MMU
select HAVE_HW_BREAKPOINT if PERF_EVENTS
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 5bbfed8..e0ded48 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -115,10 +115,10 @@
/*
* This is called when the thread calls exit().
*/
-void exit_thread(void)
+void exit_thread(struct task_struct *tsk)
{
#if XTENSA_HAVE_COPROCESSORS
- coprocessor_release_all(current_thread_info());
+ coprocessor_release_all(task_thread_info(tsk));
#endif
}
diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
index e507cfb..edcea70 100644
--- a/block/partitions/ldm.c
+++ b/block/partitions/ldm.c
@@ -27,6 +27,8 @@
#include <linux/pagemap.h>
#include <linux/stringify.h>
#include <linux/kernel.h>
+#include <linux/uuid.h>
+
#include "ldm.h"
#include "check.h"
#include "msdos.h"
@@ -66,60 +68,6 @@
}
/**
- * ldm_parse_hexbyte - Convert a ASCII hex number to a byte
- * @src: Pointer to at least 2 characters to convert.
- *
- * Convert a two character ASCII hex string to a number.
- *
- * Return: 0-255 Success, the byte was parsed correctly
- * -1 Error, an invalid character was supplied
- */
-static int ldm_parse_hexbyte (const u8 *src)
-{
- unsigned int x; /* For correct wrapping */
- int h;
-
- /* high part */
- x = h = hex_to_bin(src[0]);
- if (h < 0)
- return -1;
-
- /* low part */
- h = hex_to_bin(src[1]);
- if (h < 0)
- return -1;
-
- return (x << 4) + h;
-}
-
-/**
- * ldm_parse_guid - Convert GUID from ASCII to binary
- * @src: 36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba
- * @dest: Memory block to hold binary GUID (16 bytes)
- *
- * N.B. The GUID need not be NULL terminated.
- *
- * Return: 'true' @dest contains binary GUID
- * 'false' @dest contents are undefined
- */
-static bool ldm_parse_guid (const u8 *src, u8 *dest)
-{
- static const int size[] = { 4, 2, 2, 2, 6 };
- int i, j, v;
-
- if (src[8] != '-' || src[13] != '-' ||
- src[18] != '-' || src[23] != '-')
- return false;
-
- for (j = 0; j < 5; j++, src++)
- for (i = 0; i < size[j]; i++, src+=2, *dest++ = v)
- if ((v = ldm_parse_hexbyte (src)) < 0)
- return false;
-
- return true;
-}
-
-/**
* ldm_parse_privhead - Read the LDM Database PRIVHEAD structure
* @data: Raw database PRIVHEAD structure loaded from the device
* @ph: In-memory privhead structure in which to return parsed information
@@ -167,7 +115,7 @@
ldm_error("PRIVHEAD disk size doesn't match real disk size");
return false;
}
- if (!ldm_parse_guid(data + 0x0030, ph->disk_id)) {
+ if (uuid_be_to_bin(data + 0x0030, (uuid_be *)ph->disk_id)) {
ldm_error("PRIVHEAD contains an invalid GUID.");
return false;
}
@@ -944,7 +892,7 @@
disk = &vb->vblk.disk;
ldm_get_vstr (buffer + 0x18 + r_diskid, disk->alt_name,
sizeof (disk->alt_name));
- if (!ldm_parse_guid (buffer + 0x19 + r_name, disk->disk_id))
+ if (uuid_be_to_bin(buffer + 0x19 + r_name, (uuid_be *)disk->disk_id))
return false;
return true;
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 3ef42e5..b51a816 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/wait.h>
#include <linux/sched.h>
+#include <linux/cpu.h>
#include "zcomp.h"
#include "zcomp_lzo.h"
@@ -20,29 +21,6 @@
#include "zcomp_lz4.h"
#endif
-/*
- * single zcomp_strm backend
- */
-struct zcomp_strm_single {
- struct mutex strm_lock;
- struct zcomp_strm *zstrm;
-};
-
-/*
- * multi zcomp_strm backend
- */
-struct zcomp_strm_multi {
- /* protect strm list */
- spinlock_t strm_lock;
- /* max possible number of zstrm streams */
- int max_strm;
- /* number of available zstrm streams */
- int avail_strm;
- /* list of available strms */
- struct list_head idle_strm;
- wait_queue_head_t strm_wait;
-};
-
static struct zcomp_backend *backends[] = {
&zcomp_lzo,
#ifdef CONFIG_ZRAM_LZ4_COMPRESS
@@ -93,188 +71,6 @@
return zstrm;
}
-/*
- * get idle zcomp_strm or wait until other process release
- * (zcomp_strm_release()) one for us
- */
-static struct zcomp_strm *zcomp_strm_multi_find(struct zcomp *comp)
-{
- struct zcomp_strm_multi *zs = comp->stream;
- struct zcomp_strm *zstrm;
-
- while (1) {
- spin_lock(&zs->strm_lock);
- if (!list_empty(&zs->idle_strm)) {
- zstrm = list_entry(zs->idle_strm.next,
- struct zcomp_strm, list);
- list_del(&zstrm->list);
- spin_unlock(&zs->strm_lock);
- return zstrm;
- }
- /* zstrm streams limit reached, wait for idle stream */
- if (zs->avail_strm >= zs->max_strm) {
- spin_unlock(&zs->strm_lock);
- wait_event(zs->strm_wait, !list_empty(&zs->idle_strm));
- continue;
- }
- /* allocate new zstrm stream */
- zs->avail_strm++;
- spin_unlock(&zs->strm_lock);
- /*
- * This function can be called in swapout/fs write path
- * so we can't use GFP_FS|IO. And it assumes we already
- * have at least one stream in zram initialization so we
- * don't do best effort to allocate more stream in here.
- * A default stream will work well without further multiple
- * streams. That's why we use NORETRY | NOWARN.
- */
- zstrm = zcomp_strm_alloc(comp, GFP_NOIO | __GFP_NORETRY |
- __GFP_NOWARN);
- if (!zstrm) {
- spin_lock(&zs->strm_lock);
- zs->avail_strm--;
- spin_unlock(&zs->strm_lock);
- wait_event(zs->strm_wait, !list_empty(&zs->idle_strm));
- continue;
- }
- break;
- }
- return zstrm;
-}
-
-/* add stream back to idle list and wake up waiter or free the stream */
-static void zcomp_strm_multi_release(struct zcomp *comp, struct zcomp_strm *zstrm)
-{
- struct zcomp_strm_multi *zs = comp->stream;
-
- spin_lock(&zs->strm_lock);
- if (zs->avail_strm <= zs->max_strm) {
- list_add(&zstrm->list, &zs->idle_strm);
- spin_unlock(&zs->strm_lock);
- wake_up(&zs->strm_wait);
- return;
- }
-
- zs->avail_strm--;
- spin_unlock(&zs->strm_lock);
- zcomp_strm_free(comp, zstrm);
-}
-
-/* change max_strm limit */
-static bool zcomp_strm_multi_set_max_streams(struct zcomp *comp, int num_strm)
-{
- struct zcomp_strm_multi *zs = comp->stream;
- struct zcomp_strm *zstrm;
-
- spin_lock(&zs->strm_lock);
- zs->max_strm = num_strm;
- /*
- * if user has lowered the limit and there are idle streams,
- * immediately free as much streams (and memory) as we can.
- */
- while (zs->avail_strm > num_strm && !list_empty(&zs->idle_strm)) {
- zstrm = list_entry(zs->idle_strm.next,
- struct zcomp_strm, list);
- list_del(&zstrm->list);
- zcomp_strm_free(comp, zstrm);
- zs->avail_strm--;
- }
- spin_unlock(&zs->strm_lock);
- return true;
-}
-
-static void zcomp_strm_multi_destroy(struct zcomp *comp)
-{
- struct zcomp_strm_multi *zs = comp->stream;
- struct zcomp_strm *zstrm;
-
- while (!list_empty(&zs->idle_strm)) {
- zstrm = list_entry(zs->idle_strm.next,
- struct zcomp_strm, list);
- list_del(&zstrm->list);
- zcomp_strm_free(comp, zstrm);
- }
- kfree(zs);
-}
-
-static int zcomp_strm_multi_create(struct zcomp *comp, int max_strm)
-{
- struct zcomp_strm *zstrm;
- struct zcomp_strm_multi *zs;
-
- comp->destroy = zcomp_strm_multi_destroy;
- comp->strm_find = zcomp_strm_multi_find;
- comp->strm_release = zcomp_strm_multi_release;
- comp->set_max_streams = zcomp_strm_multi_set_max_streams;
- zs = kmalloc(sizeof(struct zcomp_strm_multi), GFP_KERNEL);
- if (!zs)
- return -ENOMEM;
-
- comp->stream = zs;
- spin_lock_init(&zs->strm_lock);
- INIT_LIST_HEAD(&zs->idle_strm);
- init_waitqueue_head(&zs->strm_wait);
- zs->max_strm = max_strm;
- zs->avail_strm = 1;
-
- zstrm = zcomp_strm_alloc(comp, GFP_KERNEL);
- if (!zstrm) {
- kfree(zs);
- return -ENOMEM;
- }
- list_add(&zstrm->list, &zs->idle_strm);
- return 0;
-}
-
-static struct zcomp_strm *zcomp_strm_single_find(struct zcomp *comp)
-{
- struct zcomp_strm_single *zs = comp->stream;
- mutex_lock(&zs->strm_lock);
- return zs->zstrm;
-}
-
-static void zcomp_strm_single_release(struct zcomp *comp,
- struct zcomp_strm *zstrm)
-{
- struct zcomp_strm_single *zs = comp->stream;
- mutex_unlock(&zs->strm_lock);
-}
-
-static bool zcomp_strm_single_set_max_streams(struct zcomp *comp, int num_strm)
-{
- /* zcomp_strm_single support only max_comp_streams == 1 */
- return false;
-}
-
-static void zcomp_strm_single_destroy(struct zcomp *comp)
-{
- struct zcomp_strm_single *zs = comp->stream;
- zcomp_strm_free(comp, zs->zstrm);
- kfree(zs);
-}
-
-static int zcomp_strm_single_create(struct zcomp *comp)
-{
- struct zcomp_strm_single *zs;
-
- comp->destroy = zcomp_strm_single_destroy;
- comp->strm_find = zcomp_strm_single_find;
- comp->strm_release = zcomp_strm_single_release;
- comp->set_max_streams = zcomp_strm_single_set_max_streams;
- zs = kmalloc(sizeof(struct zcomp_strm_single), GFP_KERNEL);
- if (!zs)
- return -ENOMEM;
-
- comp->stream = zs;
- mutex_init(&zs->strm_lock);
- zs->zstrm = zcomp_strm_alloc(comp, GFP_KERNEL);
- if (!zs->zstrm) {
- kfree(zs);
- return -ENOMEM;
- }
- return 0;
-}
-
/* show available compressors */
ssize_t zcomp_available_show(const char *comp, char *buf)
{
@@ -299,19 +95,14 @@
return find_backend(comp) != NULL;
}
-bool zcomp_set_max_streams(struct zcomp *comp, int num_strm)
-{
- return comp->set_max_streams(comp, num_strm);
-}
-
struct zcomp_strm *zcomp_strm_find(struct zcomp *comp)
{
- return comp->strm_find(comp);
+ return *get_cpu_ptr(comp->stream);
}
void zcomp_strm_release(struct zcomp *comp, struct zcomp_strm *zstrm)
{
- comp->strm_release(comp, zstrm);
+ put_cpu_ptr(comp->stream);
}
int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
@@ -327,9 +118,83 @@
return comp->backend->decompress(src, src_len, dst);
}
+static int __zcomp_cpu_notifier(struct zcomp *comp,
+ unsigned long action, unsigned long cpu)
+{
+ struct zcomp_strm *zstrm;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (WARN_ON(*per_cpu_ptr(comp->stream, cpu)))
+ break;
+ zstrm = zcomp_strm_alloc(comp, GFP_KERNEL);
+ if (IS_ERR_OR_NULL(zstrm)) {
+ pr_err("Can't allocate a compression stream\n");
+ return NOTIFY_BAD;
+ }
+ *per_cpu_ptr(comp->stream, cpu) = zstrm;
+ break;
+ case CPU_DEAD:
+ case CPU_UP_CANCELED:
+ zstrm = *per_cpu_ptr(comp->stream, cpu);
+ if (!IS_ERR_OR_NULL(zstrm))
+ zcomp_strm_free(comp, zstrm);
+ *per_cpu_ptr(comp->stream, cpu) = NULL;
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static int zcomp_cpu_notifier(struct notifier_block *nb,
+ unsigned long action, void *pcpu)
+{
+ unsigned long cpu = (unsigned long)pcpu;
+ struct zcomp *comp = container_of(nb, typeof(*comp), notifier);
+
+ return __zcomp_cpu_notifier(comp, action, cpu);
+}
+
+static int zcomp_init(struct zcomp *comp)
+{
+ unsigned long cpu;
+ int ret;
+
+ comp->notifier.notifier_call = zcomp_cpu_notifier;
+
+ comp->stream = alloc_percpu(struct zcomp_strm *);
+ if (!comp->stream)
+ return -ENOMEM;
+
+ cpu_notifier_register_begin();
+ for_each_online_cpu(cpu) {
+ ret = __zcomp_cpu_notifier(comp, CPU_UP_PREPARE, cpu);
+ if (ret == NOTIFY_BAD)
+ goto cleanup;
+ }
+ __register_cpu_notifier(&comp->notifier);
+ cpu_notifier_register_done();
+ return 0;
+
+cleanup:
+ for_each_online_cpu(cpu)
+ __zcomp_cpu_notifier(comp, CPU_UP_CANCELED, cpu);
+ cpu_notifier_register_done();
+ return -ENOMEM;
+}
+
void zcomp_destroy(struct zcomp *comp)
{
- comp->destroy(comp);
+ unsigned long cpu;
+
+ cpu_notifier_register_begin();
+ for_each_online_cpu(cpu)
+ __zcomp_cpu_notifier(comp, CPU_UP_CANCELED, cpu);
+ __unregister_cpu_notifier(&comp->notifier);
+ cpu_notifier_register_done();
+
+ free_percpu(comp->stream);
kfree(comp);
}
@@ -339,9 +204,9 @@
* backend pointer or ERR_PTR if things went bad. ERR_PTR(-EINVAL)
* if requested algorithm is not supported, ERR_PTR(-ENOMEM) in
* case of allocation error, or any other error potentially
- * returned by functions zcomp_strm_{multi,single}_create.
+ * returned by zcomp_init().
*/
-struct zcomp *zcomp_create(const char *compress, int max_strm)
+struct zcomp *zcomp_create(const char *compress)
{
struct zcomp *comp;
struct zcomp_backend *backend;
@@ -356,10 +221,7 @@
return ERR_PTR(-ENOMEM);
comp->backend = backend;
- if (max_strm > 1)
- error = zcomp_strm_multi_create(comp, max_strm);
- else
- error = zcomp_strm_single_create(comp);
+ error = zcomp_init(comp);
if (error) {
kfree(comp);
return ERR_PTR(error);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index b7d2a4b..ffd88cb 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -10,8 +10,6 @@
#ifndef _ZCOMP_H_
#define _ZCOMP_H_
-#include <linux/mutex.h>
-
struct zcomp_strm {
/* compression/decompression buffer */
void *buffer;
@@ -21,8 +19,6 @@
* working memory)
*/
void *private;
- /* used in multi stream backend, protected by backend strm_lock */
- struct list_head list;
};
/* static compression backend */
@@ -41,19 +37,15 @@
/* dynamic per-device compression frontend */
struct zcomp {
- void *stream;
+ struct zcomp_strm * __percpu *stream;
struct zcomp_backend *backend;
-
- struct zcomp_strm *(*strm_find)(struct zcomp *comp);
- void (*strm_release)(struct zcomp *comp, struct zcomp_strm *zstrm);
- bool (*set_max_streams)(struct zcomp *comp, int num_strm);
- void (*destroy)(struct zcomp *comp);
+ struct notifier_block notifier;
};
ssize_t zcomp_available_show(const char *comp, char *buf);
bool zcomp_available_algorithm(const char *comp);
-struct zcomp *zcomp_create(const char *comp, int max_strm);
+struct zcomp *zcomp_create(const char *comp);
void zcomp_destroy(struct zcomp *comp);
struct zcomp_strm *zcomp_strm_find(struct zcomp *comp);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 370c2f7..8fcad8b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -304,46 +304,25 @@
return len;
}
+/*
+ * We switched to per-cpu streams and this attr is not needed anymore.
+ * However, we will keep it around for some time, because:
+ * a) we may revert per-cpu streams in the future
+ * b) it's visible to user space and we need to follow our 2 years
+ * retirement rule; but we already have a number of 'soon to be
+ * altered' attrs, so max_comp_streams need to wait for the next
+ * layoff cycle.
+ */
static ssize_t max_comp_streams_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- int val;
- struct zram *zram = dev_to_zram(dev);
-
- down_read(&zram->init_lock);
- val = zram->max_comp_streams;
- up_read(&zram->init_lock);
-
- return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+ return scnprintf(buf, PAGE_SIZE, "%d\n", num_online_cpus());
}
static ssize_t max_comp_streams_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- int num;
- struct zram *zram = dev_to_zram(dev);
- int ret;
-
- ret = kstrtoint(buf, 0, &num);
- if (ret < 0)
- return ret;
- if (num < 1)
- return -EINVAL;
-
- down_write(&zram->init_lock);
- if (init_done(zram)) {
- if (!zcomp_set_max_streams(zram->comp, num)) {
- pr_info("Cannot change max compression streams\n");
- ret = -EINVAL;
- goto out;
- }
- }
-
- zram->max_comp_streams = num;
- ret = len;
-out:
- up_write(&zram->init_lock);
- return ret;
+ return len;
}
static ssize_t comp_algorithm_show(struct device *dev,
@@ -456,8 +435,26 @@
return ret;
}
+static ssize_t debug_stat_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int version = 1;
+ struct zram *zram = dev_to_zram(dev);
+ ssize_t ret;
+
+ down_read(&zram->init_lock);
+ ret = scnprintf(buf, PAGE_SIZE,
+ "version: %d\n%8llu\n",
+ version,
+ (u64)atomic64_read(&zram->stats.writestall));
+ up_read(&zram->init_lock);
+
+ return ret;
+}
+
static DEVICE_ATTR_RO(io_stat);
static DEVICE_ATTR_RO(mm_stat);
+static DEVICE_ATTR_RO(debug_stat);
ZRAM_ATTR_RO(num_reads);
ZRAM_ATTR_RO(num_writes);
ZRAM_ATTR_RO(failed_reads);
@@ -514,7 +511,7 @@
goto out_error;
}
- meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO | __GFP_HIGHMEM);
+ meta->mem_pool = zs_create_pool(pool_name);
if (!meta->mem_pool) {
pr_err("Error creating memory pool\n");
goto out_error;
@@ -650,7 +647,7 @@
{
int ret = 0;
size_t clen;
- unsigned long handle;
+ unsigned long handle = 0;
struct page *page;
unsigned char *user_mem, *cmem, *src, *uncmem = NULL;
struct zram_meta *meta = zram->meta;
@@ -673,9 +670,8 @@
goto out;
}
- zstrm = zcomp_strm_find(zram->comp);
+compress_again:
user_mem = kmap_atomic(page);
-
if (is_partial_io(bvec)) {
memcpy(uncmem + offset, user_mem + bvec->bv_offset,
bvec->bv_len);
@@ -699,6 +695,7 @@
goto out;
}
+ zstrm = zcomp_strm_find(zram->comp);
ret = zcomp_compress(zram->comp, zstrm, uncmem, &clen);
if (!is_partial_io(bvec)) {
kunmap_atomic(user_mem);
@@ -710,6 +707,7 @@
pr_err("Compression failed! err=%d\n", ret);
goto out;
}
+
src = zstrm->buffer;
if (unlikely(clen > max_zpage_size)) {
clen = PAGE_SIZE;
@@ -717,8 +715,35 @@
src = uncmem;
}
- handle = zs_malloc(meta->mem_pool, clen);
+ /*
+ * handle allocation has 2 paths:
+ * a) fast path is executed with preemption disabled (for
+ * per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
+ * since we can't sleep;
+ * b) slow path enables preemption and attempts to allocate
+ * the page with __GFP_DIRECT_RECLAIM bit set. we have to
+ * put per-cpu compression stream and, thus, to re-do
+ * the compression once handle is allocated.
+ *
+ * if we have a 'non-null' handle here then we are coming
+ * from the slow path and handle has already been allocated.
+ */
+ if (!handle)
+ handle = zs_malloc(meta->mem_pool, clen,
+ __GFP_KSWAPD_RECLAIM |
+ __GFP_NOWARN |
+ __GFP_HIGHMEM);
if (!handle) {
+ zcomp_strm_release(zram->comp, zstrm);
+ zstrm = NULL;
+
+ atomic64_inc(&zram->stats.writestall);
+
+ handle = zs_malloc(meta->mem_pool, clen,
+ GFP_NOIO | __GFP_HIGHMEM);
+ if (handle)
+ goto compress_again;
+
pr_err("Error allocating memory for compressed page: %u, size=%zu\n",
index, clen);
ret = -ENOMEM;
@@ -1009,7 +1034,6 @@
/* Reset stats */
memset(&zram->stats, 0, sizeof(zram->stats));
zram->disksize = 0;
- zram->max_comp_streams = 1;
set_capacity(zram->disk, 0);
part_stat_set_all(&zram->disk->part0, 0);
@@ -1038,7 +1062,7 @@
if (!meta)
return -ENOMEM;
- comp = zcomp_create(zram->compressor, zram->max_comp_streams);
+ comp = zcomp_create(zram->compressor);
if (IS_ERR(comp)) {
pr_err("Cannot initialise %s compressing backend\n",
zram->compressor);
@@ -1177,6 +1201,7 @@
&dev_attr_comp_algorithm.attr,
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
+ &dev_attr_debug_stat.attr,
NULL,
};
@@ -1273,7 +1298,6 @@
}
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
- zram->max_comp_streams = 1;
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 8e92339..3f5bf66 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -85,6 +85,7 @@
atomic64_t zero_pages; /* no. of zero filled pages */
atomic64_t pages_stored; /* no. of pages currently stored */
atomic_long_t max_used_pages; /* no. of maximum pages stored */
+ atomic64_t writestall; /* no. of write slow paths */
};
struct zram_meta {
@@ -102,7 +103,6 @@
* the number of pages zram can consume for storing compressed data
*/
unsigned long limit_pages;
- int max_comp_streams;
struct zram_stats stats;
atomic_t refcount; /* refcount for zram_meta */
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b583e53..0158d3b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
#include <linux/irq.h>
#include <linux/syscalls.h>
#include <linux/completion.h>
+#include <linux/uuid.h>
#include <asm/processor.h>
#include <asm/uaccess.h>
@@ -1621,26 +1622,6 @@
return urandom_read(NULL, buf, count, NULL);
}
-/***************************************************************
- * Random UUID interface
- *
- * Used here for a Boot ID, but can be useful for other kernel
- * drivers.
- ***************************************************************/
-
-/*
- * Generate random UUID
- */
-void generate_random_uuid(unsigned char uuid_out[16])
-{
- get_random_bytes(uuid_out, 16);
- /* Set UUID version to 4 --- truly random generation */
- uuid_out[6] = (uuid_out[6] & 0x0F) | 0x40;
- /* Set the UUID variant to DCE */
- uuid_out[8] = (uuid_out[8] & 0x3F) | 0x80;
-}
-EXPORT_SYMBOL(generate_random_uuid);
-
/********************************************************************
*
* Sysctl interface
diff --git a/drivers/hwspinlock/hwspinlock_core.c b/drivers/hwspinlock/hwspinlock_core.c
index d50c701..4074441 100644
--- a/drivers/hwspinlock/hwspinlock_core.c
+++ b/drivers/hwspinlock/hwspinlock_core.c
@@ -313,7 +313,7 @@
hwlock = radix_tree_deref_slot(slot);
if (unlikely(!hwlock))
continue;
- if (radix_tree_is_indirect_ptr(hwlock)) {
+ if (radix_tree_deref_retry(hwlock)) {
slot = radix_tree_iter_retry(&iter);
continue;
}
diff --git a/drivers/platform/x86/wmi.c b/drivers/platform/x86/wmi.c
index eb391a2..ceeb8c1 100644
--- a/drivers/platform/x86/wmi.c
+++ b/drivers/platform/x86/wmi.c
@@ -37,6 +37,7 @@
#include <linux/acpi.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/uuid.h>
ACPI_MODULE_NAME("wmi");
MODULE_AUTHOR("Carlos Corbacho");
@@ -115,100 +116,21 @@
* GUID parsing functions
*/
-/**
- * wmi_parse_hexbyte - Convert a ASCII hex number to a byte
- * @src: Pointer to at least 2 characters to convert.
- *
- * Convert a two character ASCII hex string to a number.
- *
- * Return: 0-255 Success, the byte was parsed correctly
- * -1 Error, an invalid character was supplied
- */
-static int wmi_parse_hexbyte(const u8 *src)
-{
- int h;
- int value;
-
- /* high part */
- h = value = hex_to_bin(src[0]);
- if (value < 0)
- return -1;
-
- /* low part */
- value = hex_to_bin(src[1]);
- if (value >= 0)
- return (h << 4) | value;
- return -1;
-}
-
-/**
- * wmi_swap_bytes - Rearrange GUID bytes to match GUID binary
- * @src: Memory block holding binary GUID (16 bytes)
- * @dest: Memory block to hold byte swapped binary GUID (16 bytes)
- *
- * Byte swap a binary GUID to match it's real GUID value
- */
-static void wmi_swap_bytes(u8 *src, u8 *dest)
-{
- int i;
-
- for (i = 0; i <= 3; i++)
- memcpy(dest + i, src + (3 - i), 1);
-
- for (i = 0; i <= 1; i++)
- memcpy(dest + 4 + i, src + (5 - i), 1);
-
- for (i = 0; i <= 1; i++)
- memcpy(dest + 6 + i, src + (7 - i), 1);
-
- memcpy(dest + 8, src + 8, 8);
-}
-
-/**
- * wmi_parse_guid - Convert GUID from ASCII to binary
- * @src: 36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba
- * @dest: Memory block to hold binary GUID (16 bytes)
- *
- * N.B. The GUID need not be NULL terminated.
- *
- * Return: 'true' @dest contains binary GUID
- * 'false' @dest contents are undefined
- */
-static bool wmi_parse_guid(const u8 *src, u8 *dest)
-{
- static const int size[] = { 4, 2, 2, 2, 6 };
- int i, j, v;
-
- if (src[8] != '-' || src[13] != '-' ||
- src[18] != '-' || src[23] != '-')
- return false;
-
- for (j = 0; j < 5; j++, src++) {
- for (i = 0; i < size[j]; i++, src += 2, *dest++ = v) {
- v = wmi_parse_hexbyte(src);
- if (v < 0)
- return false;
- }
- }
-
- return true;
-}
-
static bool find_guid(const char *guid_string, struct wmi_block **out)
{
- char tmp[16], guid_input[16];
+ uuid_le guid_input;
struct wmi_block *wblock;
struct guid_block *block;
struct list_head *p;
- wmi_parse_guid(guid_string, tmp);
- wmi_swap_bytes(tmp, guid_input);
+ if (uuid_le_to_bin(guid_string, &guid_input))
+ return false;
list_for_each(p, &wmi_block_list) {
wblock = list_entry(p, struct wmi_block, list);
block = &wblock->gblock;
- if (memcmp(block->guid, guid_input, 16) == 0) {
+ if (memcmp(block->guid, &guid_input, 16) == 0) {
if (out)
*out = wblock;
return true;
@@ -498,20 +420,20 @@
{
struct wmi_block *block;
acpi_status status = AE_NOT_EXIST;
- char tmp[16], guid_input[16];
+ uuid_le guid_input;
struct list_head *p;
if (!guid || !handler)
return AE_BAD_PARAMETER;
- wmi_parse_guid(guid, tmp);
- wmi_swap_bytes(tmp, guid_input);
+ if (uuid_le_to_bin(guid, &guid_input))
+ return AE_BAD_PARAMETER;
list_for_each(p, &wmi_block_list) {
acpi_status wmi_status;
block = list_entry(p, struct wmi_block, list);
- if (memcmp(block->gblock.guid, guid_input, 16) == 0) {
+ if (memcmp(block->gblock.guid, &guid_input, 16) == 0) {
if (block->handler &&
block->handler != wmi_notify_debug)
return AE_ALREADY_ACQUIRED;
@@ -539,20 +461,20 @@
{
struct wmi_block *block;
acpi_status status = AE_NOT_EXIST;
- char tmp[16], guid_input[16];
+ uuid_le guid_input;
struct list_head *p;
if (!guid)
return AE_BAD_PARAMETER;
- wmi_parse_guid(guid, tmp);
- wmi_swap_bytes(tmp, guid_input);
+ if (uuid_le_to_bin(guid, &guid_input))
+ return AE_BAD_PARAMETER;
list_for_each(p, &wmi_block_list) {
acpi_status wmi_status;
block = list_entry(p, struct wmi_block, list);
- if (memcmp(block->gblock.guid, guid_input, 16) == 0) {
+ if (memcmp(block->gblock.guid, &guid_input, 16) == 0) {
if (!block->handler ||
block->handler == wmi_notify_debug)
return AE_NULL_ENTRY;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bd0f45f..bfb80da 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -20,13 +20,13 @@
#include <linux/slab.h>
#include <linux/buffer_head.h>
#include <linux/blkdev.h>
-#include <linux/random.h>
#include <linux/iocontext.h>
#include <linux/capability.h>
#include <linux/ratelimit.h>
#include <linux/kthread.h>
#include <linux/raid/pq.h>
#include <linux/semaphore.h>
+#include <linux/uuid.h>
#include <asm/div64.h>
#include "ctree.h"
#include "extent_map.h"
diff --git a/fs/dax.c b/fs/dax.c
index 0dbe4e0..a345c16 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
#include <linux/pfn_t.h>
#include <linux/sizes.h>
+#define RADIX_DAX_MASK 0xf
+#define RADIX_DAX_SHIFT 4
+#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+ RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
{
struct request_queue *q = bdev->bd_queue;
diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c
index e2ab6d0..1d73fc6 100644
--- a/fs/efivarfs/inode.c
+++ b/fs/efivarfs/inode.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/ctype.h>
#include <linux/slab.h>
+#include <linux/uuid.h>
#include "internal.h"
@@ -46,11 +47,7 @@
*/
bool efivarfs_valid_name(const char *str, int len)
{
- static const char dashes[EFI_VARIABLE_GUID_LEN] = {
- [8] = 1, [13] = 1, [18] = 1, [23] = 1
- };
const char *s = str + len - EFI_VARIABLE_GUID_LEN;
- int i;
/*
* We need a GUID, plus at least one letter for the variable name,
@@ -68,37 +65,7 @@
*
* 12345678-1234-1234-1234-123456789abc
*/
- for (i = 0; i < EFI_VARIABLE_GUID_LEN; i++) {
- if (dashes[i]) {
- if (*s++ != '-')
- return false;
- } else {
- if (!isxdigit(*s++))
- return false;
- }
- }
-
- return true;
-}
-
-static void efivarfs_hex_to_guid(const char *str, efi_guid_t *guid)
-{
- guid->b[0] = hex_to_bin(str[6]) << 4 | hex_to_bin(str[7]);
- guid->b[1] = hex_to_bin(str[4]) << 4 | hex_to_bin(str[5]);
- guid->b[2] = hex_to_bin(str[2]) << 4 | hex_to_bin(str[3]);
- guid->b[3] = hex_to_bin(str[0]) << 4 | hex_to_bin(str[1]);
- guid->b[4] = hex_to_bin(str[11]) << 4 | hex_to_bin(str[12]);
- guid->b[5] = hex_to_bin(str[9]) << 4 | hex_to_bin(str[10]);
- guid->b[6] = hex_to_bin(str[16]) << 4 | hex_to_bin(str[17]);
- guid->b[7] = hex_to_bin(str[14]) << 4 | hex_to_bin(str[15]);
- guid->b[8] = hex_to_bin(str[19]) << 4 | hex_to_bin(str[20]);
- guid->b[9] = hex_to_bin(str[21]) << 4 | hex_to_bin(str[22]);
- guid->b[10] = hex_to_bin(str[24]) << 4 | hex_to_bin(str[25]);
- guid->b[11] = hex_to_bin(str[26]) << 4 | hex_to_bin(str[27]);
- guid->b[12] = hex_to_bin(str[28]) << 4 | hex_to_bin(str[29]);
- guid->b[13] = hex_to_bin(str[30]) << 4 | hex_to_bin(str[31]);
- guid->b[14] = hex_to_bin(str[32]) << 4 | hex_to_bin(str[33]);
- guid->b[15] = hex_to_bin(str[34]) << 4 | hex_to_bin(str[35]);
+ return uuid_is_valid(s);
}
static int efivarfs_create(struct inode *dir, struct dentry *dentry,
@@ -119,8 +86,7 @@
/* length of the variable name itself: remove GUID and separator */
namelen = dentry->d_name.len - EFI_VARIABLE_GUID_LEN - 1;
- efivarfs_hex_to_guid(dentry->d_name.name + namelen + 1,
- &var->var.VendorGuid);
+ uuid_le_to_bin(dentry->d_name.name + namelen + 1, &var->var.VendorGuid);
if (efivar_variable_is_removable(var->var.VendorGuid,
dentry->d_name.name, namelen))
diff --git a/fs/efs/super.c b/fs/efs/super.c
index cb68dac..368f7dd 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -275,7 +275,7 @@
if (!bh) {
pr_err("cannot read volume header\n");
- return -EINVAL;
+ return -EIO;
}
/*
@@ -293,7 +293,7 @@
bh = sb_bread(s, sb->fs_start + EFS_SUPER);
if (!bh) {
pr_err("cannot read superblock\n");
- return -EINVAL;
+ return -EIO;
}
if (efs_validate_super(sb, (struct efs_super *) bh->b_data)) {
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index eae5917..7497f50 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -13,8 +13,8 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/file.h>
-#include <linux/random.h>
#include <linux/quotaops.h>
+#include <linux/uuid.h>
#include <asm/uaccess.h>
#include "ext4_jbd2.h"
#include "ext4.h"
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index eb9d027..c6b1495 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -20,7 +20,7 @@
#include <linux/uaccess.h>
#include <linux/mount.h>
#include <linux/pagevec.h>
-#include <linux/random.h>
+#include <linux/uuid.h>
#include "f2fs.h"
#include "node.h"
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 592cea5..989a2ce 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -931,7 +931,8 @@
* This is WB_SYNC_NONE writeback, so if allocation fails just
* wakeup the thread for old dirty data writeback
*/
- work = kzalloc(sizeof(*work), GFP_ATOMIC);
+ work = kzalloc(sizeof(*work),
+ GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (!work) {
trace_writeback_nowork(wb);
wb_wakeup(wb);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..88c7de1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
#include <linux/tracehook.h>
#include <linux/string_helpers.h>
#include <linux/user_namespace.h>
+#include <linux/fs_struct.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
@@ -139,12 +140,25 @@
return task_state_array[fls(state)];
}
+static inline int get_task_umask(struct task_struct *tsk)
+{
+ struct fs_struct *fs;
+ int umask = -ENOENT;
+
+ task_lock(tsk);
+ fs = tsk->fs;
+ if (fs)
+ umask = fs->umask;
+ task_unlock(tsk);
+ return umask;
+}
+
static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
{
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
- int g;
+ int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
+ umask = get_task_umask(p);
+ if (umask >= 0)
+ seq_printf(m, "Umask:\t%#04o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ff4527d..a11eb71 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3163,6 +3163,44 @@
}
/*
+ * proc_tid_comm_permission is a special permission function exclusively
+ * used for the node /proc/<pid>/task/<tid>/comm.
+ * It bypasses generic permission checks in the case where a task of the same
+ * task group attempts to access the node.
+ * The rationale behind this is that glibc and bionic access this node for
+ * cross thread naming (pthread_set/getname_np(!self)). However, if
+ * PR_SET_DUMPABLE gets set to 0 this node among others becomes uid=0 gid=0,
+ * which locks out the cross thread naming implementation.
+ * This function makes sure that the node is always accessible for members of
+ * same thread group.
+ */
+static int proc_tid_comm_permission(struct inode *inode, int mask)
+{
+ bool is_same_tgroup;
+ struct task_struct *task;
+
+ task = get_proc_task(inode);
+ if (!task)
+ return -ESRCH;
+ is_same_tgroup = same_thread_group(current, task);
+ put_task_struct(task);
+
+ if (likely(is_same_tgroup && !(mask & MAY_EXEC))) {
+ /* This file (/proc/<pid>/task/<tid>/comm) can always be
+ * read or written by the members of the corresponding
+ * thread group.
+ */
+ return 0;
+ }
+
+ return generic_permission(inode, mask);
+}
+
+static const struct inode_operations proc_tid_comm_inode_operations = {
+ .permission = proc_tid_comm_permission,
+};
+
+/*
* Tasks
*/
static const struct pid_entry tid_base_stuff[] = {
@@ -3180,7 +3218,9 @@
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
- REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
+ NOD("comm", S_IFREG|S_IRUGO|S_IWUSR,
+ &proc_tid_comm_inode_operations,
+ &proc_pid_set_comm_operations, {}),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall", S_IRUSR, proc_pid_syscall),
#endif
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index a586467..be3ddd1 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -211,14 +211,11 @@
struct page **pages = NULL, **ptr, *page;
loff_t isize;
- if (!(flags & MAP_SHARED))
- return addr;
-
/* the mapping mustn't extend beyond the EOF */
lpages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
isize = i_size_read(inode);
- ret = -EINVAL;
+ ret = -ENOSYS;
maxpages = (isize + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (pgoff >= maxpages)
goto out;
@@ -227,7 +224,6 @@
goto out;
/* gang-find the pages */
- ret = -ENOMEM;
pages = kcalloc(lpages, sizeof(struct page *), GFP_KERNEL);
if (!pages)
goto out_free;
@@ -263,7 +259,7 @@
*/
static int ramfs_nommu_mmap(struct file *file, struct vm_area_struct *vma)
{
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)))
return -ENOSYS;
file_accessed(file);
diff --git a/fs/reiserfs/objectid.c b/fs/reiserfs/objectid.c
index 99a5d5d..415d66c 100644
--- a/fs/reiserfs/objectid.c
+++ b/fs/reiserfs/objectid.c
@@ -3,8 +3,8 @@
*/
#include <linux/string.h>
-#include <linux/random.h>
#include <linux/time.h>
+#include <linux/uuid.h>
#include "reiserfs.h"
/* find where objectid map starts */
diff --git a/fs/ubifs/sb.c b/fs/ubifs/sb.c
index f4fbc7b..3cbb904 100644
--- a/fs/ubifs/sb.c
+++ b/fs/ubifs/sb.c
@@ -28,8 +28,8 @@
#include "ubifs.h"
#include <linux/slab.h>
-#include <linux/random.h>
#include <linux/math64.h>
+#include <linux/uuid.h>
/*
* Default journal size in logical eraseblocks as a percent of total
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 66cdb44..2d97952 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -137,7 +137,7 @@
VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
- mmput(ctx->mm);
+ mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
}
}
@@ -434,6 +434,9 @@
ACCESS_ONCE(ctx->released) = true;
+ if (!mmget_not_zero(mm))
+ goto wakeup;
+
/*
* Flush page faults out of all CPUs. NOTE: all page faults
* must be retried without returning VM_FAULT_SIGBUS if
@@ -466,7 +469,8 @@
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
}
up_write(&mm->mmap_sem);
-
+ mmput(mm);
+wakeup:
/*
* After no new page faults can wait on this fault_*wqh, flush
* the last page faults that may have been already waiting on
@@ -760,10 +764,12 @@
start = uffdio_register.range.start;
end = start + uffdio_register.range.len;
+ ret = -ENOMEM;
+ if (!mmget_not_zero(mm))
+ goto out;
+
down_write(&mm->mmap_sem);
vma = find_vma_prev(mm, start, &prev);
-
- ret = -ENOMEM;
if (!vma)
goto out_unlock;
@@ -864,6 +870,7 @@
} while (vma && vma->vm_start < end);
out_unlock:
up_write(&mm->mmap_sem);
+ mmput(mm);
if (!ret) {
/*
* Now that we scanned all vmas we can already tell
@@ -902,10 +909,12 @@
start = uffdio_unregister.start;
end = start + uffdio_unregister.len;
+ ret = -ENOMEM;
+ if (!mmget_not_zero(mm))
+ goto out;
+
down_write(&mm->mmap_sem);
vma = find_vma_prev(mm, start, &prev);
-
- ret = -ENOMEM;
if (!vma)
goto out_unlock;
@@ -998,6 +1007,7 @@
} while (vma && vma->vm_start < end);
out_unlock:
up_write(&mm->mmap_sem);
+ mmput(mm);
out:
return ret;
}
@@ -1067,9 +1077,11 @@
goto out;
if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
goto out;
-
- ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
- uffdio_copy.len);
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+ uffdio_copy.len);
+ mmput(ctx->mm);
+ }
if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
return -EFAULT;
if (ret < 0)
@@ -1110,8 +1122,11 @@
if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
goto out;
- ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
- uffdio_zeropage.range.len);
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+ uffdio_zeropage.range.len);
+ mmput(ctx->mm);
+ }
if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
return -EFAULT;
if (ret < 0)
@@ -1289,12 +1304,12 @@
ctx->released = false;
ctx->mm = current->mm;
/* prevent the mm struct to be freed */
- atomic_inc(&ctx->mm->mm_users);
+ atomic_inc(&ctx->mm->mm_count);
file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
if (IS_ERR(file)) {
- mmput(ctx->mm);
+ mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
}
out:
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 242b660..a58c852 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,46 @@
#define _LINUX_COMPACTION_H
/* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED 0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE 4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE 5
-#define COMPACT_NOT_SUITABLE_ZONE 6
-#define COMPACT_CONTENDED 7
/* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NOT_SUITABLE_ZONE,
+ /*
+ * compaction didn't start as it was not possible or direct reclaim
+ * was more suitable
+ */
+ COMPACT_SKIPPED,
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+
+ /* compaction not active last round */
+ COMPACT_INACTIVE = COMPACT_DEFERRED,
+
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NO_SUITABLE_PAGE,
+ /* compaction should continue to another pageblock */
+ COMPACT_CONTINUE,
+
+ /*
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
+ */
+ COMPACT_COMPLETE,
+ /*
+ * direct compaction has scanned part of the zone but wasn't successfull
+ * to compact suitable pages.
+ */
+ COMPACT_PARTIAL_SKIPPED,
+
+ /* compaction terminated prematurely due to lock contentions */
+ COMPACT_CONTENDED,
+
+ /*
+ * direct compaction partially compacted a zone and there might be
+ * suitable pages
+ */
+ COMPACT_PARTIAL,
+};
/* Used to signal whether compaction detected need_sched() or lock contention */
/* No contention detected */
@@ -38,12 +63,13 @@
extern int sysctl_compact_unevictable_allowed;
extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+ unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags, int classzone_idx);
extern void defer_compaction(struct zone *zone, int order);
@@ -52,12 +78,80 @@
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);
+/* Compaction has made some progress and retrying makes sense */
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ /*
+ * Even though this might sound confusing this in fact tells us
+ * that the compaction successfully isolated and migrated some
+ * pageblocks.
+ */
+ if (result == COMPACT_PARTIAL)
+ return true;
+
+ return false;
+}
+
+/* Compaction has failed and it doesn't make much sense to keep retrying. */
+static inline bool compaction_failed(enum compact_result result)
+{
+ /* All zones were scanned completely and still not result. */
+ if (result == COMPACT_COMPLETE)
+ return true;
+
+ return false;
+}
+
+/*
+ * Compaction has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ /*
+ * Compaction backed off due to watermark checks for order-0
+ * so the regular reclaim has to try harder and reclaim something.
+ */
+ if (result == COMPACT_SKIPPED)
+ return true;
+
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (result == COMPACT_DEFERRED)
+ return true;
+
+ /*
+ * If compaction in async mode encounters contention or blocks higher
+ * priority task we back off early rather than cause stalls.
+ */
+ if (result == COMPACT_CONTENDED)
+ return true;
+
+ /*
+ * Page scanners have met but we haven't scanned full zones so this
+ * is a back off in fact.
+ */
+ if (result == COMPACT_PARTIAL_SKIPPED)
+ return true;
+
+ return false;
+}
+
+
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags);
+
extern int kcompactd_run(int nid);
extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
#else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
@@ -73,7 +167,7 @@
{
}
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
return COMPACT_SKIPPED;
@@ -88,6 +182,21 @@
return true;
}
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_failed(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ return true;
+}
+
static inline int kcompactd_run(int nid)
{
return 0;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index df7acb5..c2db3ca 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -21,6 +21,7 @@
#include <linux/pfn.h>
#include <linux/pstore.h>
#include <linux/reboot.h>
+#include <linux/uuid.h>
#include <linux/screen_info.h>
#include <asm/page.h>
@@ -44,17 +45,10 @@
typedef u64 efi_physical_addr_t;
typedef void *efi_handle_t;
-
-typedef struct {
- u8 b[16];
-} efi_guid_t;
+typedef uuid_le efi_guid_t;
#define EFI_GUID(a,b,c,d0,d1,d2,d3,d4,d5,d6,d7) \
-((efi_guid_t) \
-{{ (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
- (b) & 0xff, ((b) >> 8) & 0xff, \
- (c) & 0xff, ((c) >> 8) & 0xff, \
- (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }})
+ UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)
/*
* Generic EFI table header
@@ -1117,7 +1111,7 @@
* Length of a GUID string (strlen("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"))
* not including trailing NUL
*/
-#define EFI_VARIABLE_GUID_LEN 36
+#define EFI_VARIABLE_GUID_LEN UUID_STRING_LEN
/*
* The type of search to perform when calling boottime->locate_handle
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c70676..359a8e4 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -14,6 +14,7 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/percpu-refcount.h>
+#include <linux/uuid.h>
#ifdef CONFIG_BLOCK
@@ -93,7 +94,7 @@
* Enough for the string representation of any kind of UUID plus NULL.
* EFI UUID is 36 characters. MSDOS UUID is 11 characters.
*/
-#define PARTITION_META_INFO_UUIDLTH 37
+#define PARTITION_META_INFO_UUIDLTH (UUID_STRING_LEN + 1)
struct partition_meta_info {
char uuid[PARTITION_META_INFO_UUIDLTH];
@@ -228,27 +229,9 @@
return NULL;
}
-static inline void part_pack_uuid(const u8 *uuid_str, u8 *to)
-{
- int i;
- for (i = 0; i < 16; ++i) {
- *to++ = (hex_to_bin(*uuid_str) << 4) |
- (hex_to_bin(*(uuid_str + 1)));
- uuid_str += 2;
- switch (i) {
- case 3:
- case 5:
- case 7:
- case 9:
- uuid_str++;
- continue;
- }
- }
-}
-
static inline int blk_part_pack_uuid(const u8 *uuid_str, u8 *to)
{
- part_pack_uuid(uuid_str, to);
+ uuid_be_to_bin(uuid_str, (uuid_be *)to);
return 0;
}
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index dfd59d6..c683996 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -61,6 +61,7 @@
#define nmi_enter() \
do { \
+ printk_nmi_enter(); \
lockdep_off(); \
ftrace_nmi_enter(); \
BUG_ON(in_nmi()); \
@@ -77,6 +78,7 @@
preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \
ftrace_nmi_exit(); \
lockdep_on(); \
+ printk_nmi_exit(); \
} while (0)
#endif /* LINUX_HARDIRQ_H */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e44c578..c26d463 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -353,9 +353,7 @@
static inline struct hstate *hstate_inode(struct inode *i)
{
- struct hugetlbfs_sb_info *hsb;
- hsb = HUGETLBFS_SB(i->i_sb);
- return hsb->hstate;
+ return HUGETLBFS_SB(i->i_sb)->hstate;
}
static inline struct hstate *hstate_file(struct file *f)
@@ -454,12 +452,12 @@
extern void dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);
-static inline int hugepage_migration_supported(struct hstate *h)
+static inline bool hugepage_migration_supported(struct hstate *h)
{
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
return huge_page_shift(h) == PMD_SHIFT;
#else
- return 0;
+ return false;
#endif
}
@@ -521,7 +519,7 @@
return page->index;
}
#define dissolve_free_huge_pages(s, e) do {} while (0)
-#define hugepage_migration_supported(h) 0
+#define hugepage_migration_supported(h) false
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
struct mm_struct *mm, pte_t *pte)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 24154c2..063962f 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -93,20 +93,17 @@
struct hugetlb_cgroup *h_cg,
struct page *page)
{
- return;
}
static inline void
hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
{
- return;
}
static inline void
hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
{
- return;
}
static inline void hugetlb_cgroup_file_init(void)
@@ -116,7 +113,6 @@
static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
struct page *newhpage)
{
- return;
}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
diff --git a/include/linux/kasan-checks.h b/include/linux/kasan-checks.h
new file mode 100644
index 0000000..b7f8ace
--- /dev/null
+++ b/include/linux/kasan-checks.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_KASAN_CHECKS_H
+#define _LINUX_KASAN_CHECKS_H
+
+#ifdef CONFIG_KASAN
+void kasan_check_read(const void *p, unsigned int size);
+void kasan_check_write(const void *p, unsigned int size);
+#else
+static inline void kasan_check_read(const void *p, unsigned int size) { }
+static inline void kasan_check_write(const void *p, unsigned int size) { }
+#endif
+
+#endif
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 737371b..611927f 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -50,6 +50,8 @@
void kasan_cache_create(struct kmem_cache *cache, size_t *size,
unsigned long *flags);
+void kasan_cache_shrink(struct kmem_cache *cache);
+void kasan_cache_destroy(struct kmem_cache *cache);
void kasan_poison_slab(struct page *page);
void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -63,7 +65,8 @@
void kasan_krealloc(const void *object, size_t new_size, gfp_t flags);
void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags);
-void kasan_slab_free(struct kmem_cache *s, void *object);
+bool kasan_slab_free(struct kmem_cache *s, void *object);
+void kasan_poison_slab_free(struct kmem_cache *s, void *object);
struct kasan_cache {
int alloc_meta_offset;
@@ -88,6 +91,8 @@
static inline void kasan_cache_create(struct kmem_cache *cache,
size_t *size,
unsigned long *flags) {}
+static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
+static inline void kasan_cache_destroy(struct kmem_cache *cache) {}
static inline void kasan_poison_slab(struct page *page) {}
static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
@@ -105,7 +110,11 @@
static inline void kasan_slab_alloc(struct kmem_cache *s, void *object,
gfp_t flags) {}
-static inline void kasan_slab_free(struct kmem_cache *s, void *object) {}
+static inline bool kasan_slab_free(struct kmem_cache *s, void *object)
+{
+ return false;
+}
+static inline void kasan_poison_slab_free(struct kmem_cache *s, void *object) {}
static inline int kasan_module_alloc(void *addr, size_t size) { return 0; }
static inline void kasan_free_shadow(const struct vm_struct *vm) {}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 94da967..a805474 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -415,25 +415,6 @@
return mz->lru_size[lru];
}
-static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
- unsigned long inactive_ratio;
- unsigned long inactive;
- unsigned long active;
- unsigned long gb;
-
- inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
- active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
-
- gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
- inactive_ratio = int_sqrt(10 * gb);
- else
- inactive_ratio = 1;
-
- return inactive * inactive_ratio < active;
-}
-
void mem_cgroup_handle_over_high(void);
void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -646,12 +627,6 @@
return true;
}
-static inline bool
-mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
- return true;
-}
-
static inline unsigned long
mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2b97be1..b530c99 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -475,8 +475,7 @@
static inline int compound_mapcount(struct page *page)
{
- if (!PageCompound(page))
- return 0;
+ VM_BUG_ON_PAGE(!PageCompound(page), page);
page = compound_head(page);
return atomic_read(compound_mapcount_ptr(page)) + 1;
}
@@ -597,7 +596,7 @@
}
void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon);
+ struct page *page, pte_t *pte, bool write, bool anon, bool old);
#endif
/*
@@ -1764,7 +1763,7 @@
extern void adjust_managed_page_count(struct page *page, long count);
extern void mem_init_print_info(const char *str);
-extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
@@ -2387,6 +2386,9 @@
return false;
page_ext = lookup_page_ext(page);
+ if (unlikely(!page_ext))
+ return false;
+
return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
}
#else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1fda9c9..d553855 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
+#include <linux/workqueue.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -513,6 +514,7 @@
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
#endif
+ struct work_struct async_put_work;
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60db20..02069c2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,6 +739,9 @@
extern struct mutex zonelists_mutex;
void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
+bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
+ int classzone_idx, unsigned int alloc_flags,
+ long free_pages);
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx,
unsigned int alloc_flags);
@@ -1060,7 +1063,7 @@
unsigned long *pageblock_flags;
#ifdef CONFIG_PAGE_EXTENSION
/*
- * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use
+ * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
* section. (see page_ext.h about this.)
*/
struct page_ext *page_ext;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83b9c39..d3f533f 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -110,13 +110,24 @@
static inline bool task_will_free_mem(struct task_struct *task)
{
+ struct signal_struct *sig = task->signal;
+
/*
* A coredumping process may sleep for an extended period in exit_mm(),
* so the oom killer cannot assume that the process will promptly exit
* and release memory.
*/
- return (task->flags & PF_EXITING) &&
- !(task->signal->flags & SIGNAL_GROUP_COREDUMP);
+ if (sig->flags & SIGNAL_GROUP_COREDUMP)
+ return false;
+
+ if (!(task->flags & PF_EXITING))
+ return false;
+
+ /* Make sure that the whole thread group is going down */
+ if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+ return false;
+
+ return true;
}
/* sysctls */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a61e06e..e5a3244 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -479,7 +479,7 @@
}
#endif
-#define PG_head_mask ((1L << PG_head))
+#define PG_head_mask ((1UL << PG_head))
#ifdef CONFIG_HUGETLB_PAGE
int PageHuge(struct page *page);
@@ -670,7 +670,7 @@
}
#ifdef CONFIG_MMU
-#define __PG_MLOCKED (1 << PG_mlocked)
+#define __PG_MLOCKED (1UL << PG_mlocked)
#else
#define __PG_MLOCKED 0
#endif
@@ -680,11 +680,11 @@
* these flags set. It they are, there is a problem.
*/
#define PAGE_FLAGS_CHECK_AT_FREE \
- (1 << PG_lru | 1 << PG_locked | \
- 1 << PG_private | 1 << PG_private_2 | \
- 1 << PG_writeback | 1 << PG_reserved | \
- 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- 1 << PG_unevictable | __PG_MLOCKED)
+ (1UL << PG_lru | 1UL << PG_locked | \
+ 1UL << PG_private | 1UL << PG_private_2 | \
+ 1UL << PG_writeback | 1UL << PG_reserved | \
+ 1UL << PG_slab | 1UL << PG_swapcache | 1UL << PG_active | \
+ 1UL << PG_unevictable | __PG_MLOCKED)
/*
* Flags checked when a page is prepped for return by the page allocator.
@@ -695,10 +695,10 @@
* alloc-free cycle to prevent from reusing the page.
*/
#define PAGE_FLAGS_CHECK_AT_PREP \
- (((1 << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+ (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
#define PAGE_FLAGS_PRIVATE \
- (1 << PG_private | 1 << PG_private_2)
+ (1UL << PG_private | 1UL << PG_private_2)
/**
* page_has_private - Determine if page has private stuff
* @page: The page to be checked
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fe1513f..9735410 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -518,33 +518,27 @@
extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
/*
- * Fault a userspace page into pagetables. Return non-zero on a fault.
- *
- * This assumes that two userspace pages are always sufficient.
+ * Fault one or two userspace pages into pagetables.
+ * Return -EINVAL if more than two pages would be needed.
+ * Return non-zero on a fault.
*/
static inline int fault_in_pages_writeable(char __user *uaddr, int size)
{
- int ret;
+ int span, ret;
if (unlikely(size == 0))
return 0;
+ span = offset_in_page(uaddr) + size;
+ if (span > 2 * PAGE_SIZE)
+ return -EINVAL;
/*
* Writing zeroes into userspace here is OK, because we know that if
* the zero gets there, we'll be overwriting it.
*/
ret = __put_user(0, uaddr);
- if (ret == 0) {
- char __user *end = uaddr + size - 1;
-
- /*
- * If the page was already mapped, this will get a cache miss
- * for sure, so try to avoid doing it.
- */
- if (((unsigned long)uaddr & PAGE_MASK) !=
- ((unsigned long)end & PAGE_MASK))
- ret = __put_user(0, end);
- }
+ if (ret == 0 && span > PAGE_SIZE)
+ ret = __put_user(0, uaddr + size - 1);
return ret;
}
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 4bc6daf..56939d3 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -129,7 +129,4 @@
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))
-/* To avoid include hell, as printk can not declare this, we declare it here */
-DECLARE_PER_CPU(printk_func_t, printk_func);
-
#endif /* __LINUX_PERCPU_H */
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9ccbdf2..f4da695 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -122,7 +122,19 @@
void early_printk(const char *s, ...) { }
#endif
-typedef __printf(1, 0) int (*printk_func_t)(const char *fmt, va_list args);
+#ifdef CONFIG_PRINTK_NMI
+extern void printk_nmi_init(void);
+extern void printk_nmi_enter(void);
+extern void printk_nmi_exit(void);
+extern void printk_nmi_flush(void);
+extern void printk_nmi_flush_on_panic(void);
+#else
+static inline void printk_nmi_init(void) { }
+static inline void printk_nmi_enter(void) { }
+static inline void printk_nmi_exit(void) { }
+static inline void printk_nmi_flush(void) { }
+static inline void printk_nmi_flush_on_panic(void) { }
+#endif /* PRINTK_NMI */
#ifdef CONFIG_PRINTK
asmlinkage __printf(5, 0)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac..cb4b7e8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -29,51 +29,45 @@
#include <linux/rcupdate.h>
/*
- * An indirect pointer (root->rnode pointing to a radix_tree_node, rather
- * than a data item) is signalled by the low bit set in the root->rnode
- * pointer.
+ * The bottom two bits of the slot determine how the remaining bits in the
+ * slot are interpreted:
*
- * In this case root->height is > 0, but the indirect pointer tests are
- * needed for RCU lookups (because root->height is unreliable). The only
- * time callers need worry about this is when doing a lookup_slot under
- * RCU.
+ * 00 - data pointer
+ * 01 - internal entry
+ * 10 - exceptional entry
+ * 11 - locked exceptional entry
*
- * Indirect pointer in fact is also used to tag the last pointer of a node
- * when it is shrunk, before we rcu free the node. See shrink code for
- * details.
+ * The internal entry may be a pointer to the next level in the tree, a
+ * sibling entry, or an indicator that the entry in this slot has been moved
+ * to another location in the tree and the lookup should be restarted. While
+ * NULL fits the 'data pointer' pattern, it means that there is no entry in
+ * the tree for this index (no matter what level of the tree it is found at).
+ * This means that you cannot store NULL in the tree as a value for the index.
*/
-#define RADIX_TREE_INDIRECT_PTR 1
+#define RADIX_TREE_ENTRY_MASK 3UL
+#define RADIX_TREE_INTERNAL_NODE 1UL
+
/*
- * A common use of the radix tree is to store pointers to struct pages;
- * but shmem/tmpfs needs also to store swap entries in the same tree:
- * those are marked as exceptional entries to distinguish them.
+ * Most users of the radix tree store pointers but shmem/tmpfs stores swap
+ * entries in the same tree. They are marked as exceptional entries to
+ * distinguish them from pointers to struct page.
* EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
*/
#define RADIX_TREE_EXCEPTIONAL_ENTRY 2
#define RADIX_TREE_EXCEPTIONAL_SHIFT 2
-#define RADIX_DAX_MASK 0xf
-#define RADIX_DAX_SHIFT 4
-#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
- RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
-static inline int radix_tree_is_indirect_ptr(void *ptr)
+static inline bool radix_tree_is_internal_node(void *ptr)
{
- return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+ return ((unsigned long)ptr & RADIX_TREE_ENTRY_MASK) ==
+ RADIX_TREE_INTERNAL_NODE;
}
/*** radix-tree API starts here ***/
#define RADIX_TREE_MAX_TAGS 3
-#ifdef __KERNEL__
+#ifndef RADIX_TREE_MAP_SHIFT
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif
#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
@@ -86,16 +80,13 @@
#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
RADIX_TREE_MAP_SHIFT))
-/* Height component in node->path */
-#define RADIX_TREE_HEIGHT_SHIFT (RADIX_TREE_MAX_PATH + 1)
-#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
-
/* Internally used bits of node->count */
#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
#define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
struct radix_tree_node {
- unsigned int path; /* Offset in parent & height from the bottom */
+ unsigned char shift; /* Bits remaining in each slot */
+ unsigned char offset; /* Slot offset in parent */
unsigned int count;
union {
struct {
@@ -115,13 +106,11 @@
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
struct radix_tree_root {
- unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node __rcu *rnode;
};
#define RADIX_TREE_INIT(mask) { \
- .height = 0, \
.gfp_mask = (mask), \
.rnode = NULL, \
}
@@ -131,11 +120,15 @@
#define INIT_RADIX_TREE(root, mask) \
do { \
- (root)->height = 0; \
(root)->gfp_mask = (mask); \
(root)->rnode = NULL; \
} while (0)
+static inline bool radix_tree_empty(struct radix_tree_root *root)
+{
+ return root->rnode == NULL;
+}
+
/**
* Radix-tree synchronization
*
@@ -231,7 +224,7 @@
*/
static inline int radix_tree_deref_retry(void *arg)
{
- return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+ return unlikely(radix_tree_is_internal_node(arg));
}
/**
@@ -252,8 +245,7 @@
*/
static inline int radix_tree_exception(void *arg)
{
- return unlikely((unsigned long)arg &
- (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
+ return unlikely((unsigned long)arg & RADIX_TREE_ENTRY_MASK);
}
/**
@@ -266,7 +258,7 @@
*/
static inline void radix_tree_replace_slot(void **pslot, void *item)
{
- BUG_ON(radix_tree_is_indirect_ptr(item));
+ BUG_ON(radix_tree_is_internal_node(item));
rcu_assign_pointer(*pslot, item);
}
@@ -288,9 +280,12 @@
struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
-unsigned int
-radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
- unsigned long first_index, unsigned int max_items);
+struct radix_tree_node *radix_tree_replace_clear_tags(
+ struct radix_tree_root *root,
+ unsigned long index, void *entry);
+unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
+ void **results, unsigned long first_index,
+ unsigned int max_items);
unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
void ***results, unsigned long *indices,
unsigned long first_index, unsigned int max_items);
@@ -327,8 +322,9 @@
* struct radix_tree_iter - radix tree iterator state
*
* @index: index of current slot
- * @next_index: next-to-last index for this chunk
+ * @next_index: one beyond the last index for this chunk
* @tags: bit-mask for tag-iterating
+ * @shift: shift for the node that holds our slots
*
* This radix tree iterator works in terms of "chunks" of slots. A chunk is a
* subinterval of slots contained within one radix tree leaf node. It is
@@ -341,8 +337,20 @@
unsigned long index;
unsigned long next_index;
unsigned long tags;
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ unsigned int shift;
+#endif
};
+static inline unsigned int iter_shift(struct radix_tree_iter *iter)
+{
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ return iter->shift;
+#else
+ return 0;
+#endif
+}
+
#define RADIX_TREE_ITER_TAG_MASK 0x00FF /* tag index in lower byte */
#define RADIX_TREE_ITER_TAGGED 0x0100 /* lookup tagged slots */
#define RADIX_TREE_ITER_CONTIG 0x0200 /* stop at first hole */
@@ -402,6 +410,12 @@
return NULL;
}
+static inline unsigned long
+__radix_tree_iter_add(struct radix_tree_iter *iter, unsigned long slots)
+{
+ return iter->index + (slots << iter_shift(iter));
+}
+
/**
* radix_tree_iter_next - resume iterating when the chunk may be invalid
* @iter: iterator state
@@ -413,7 +427,7 @@
static inline __must_check
void **radix_tree_iter_next(struct radix_tree_iter *iter)
{
- iter->next_index = iter->index + 1;
+ iter->next_index = __radix_tree_iter_add(iter, 1);
iter->tags = 0;
return NULL;
}
@@ -427,7 +441,12 @@
static __always_inline long
radix_tree_chunk_size(struct radix_tree_iter *iter)
{
- return iter->next_index - iter->index;
+ return (iter->next_index - iter->index) >> iter_shift(iter);
+}
+
+static inline struct radix_tree_node *entry_to_node(void *ptr)
+{
+ return (void *)((unsigned long)ptr & ~RADIX_TREE_INTERNAL_NODE);
}
/**
@@ -445,24 +464,49 @@
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
if (flags & RADIX_TREE_ITER_TAGGED) {
+ void *canon = slot;
+
iter->tags >>= 1;
+ if (unlikely(!iter->tags))
+ return NULL;
+ while (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
+ radix_tree_is_internal_node(slot[1])) {
+ if (entry_to_node(slot[1]) == canon) {
+ iter->tags >>= 1;
+ iter->index = __radix_tree_iter_add(iter, 1);
+ slot++;
+ continue;
+ }
+ iter->next_index = __radix_tree_iter_add(iter, 1);
+ return NULL;
+ }
if (likely(iter->tags & 1ul)) {
- iter->index++;
+ iter->index = __radix_tree_iter_add(iter, 1);
return slot + 1;
}
- if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
+ if (!(flags & RADIX_TREE_ITER_CONTIG)) {
unsigned offset = __ffs(iter->tags);
iter->tags >>= offset;
- iter->index += offset + 1;
+ iter->index = __radix_tree_iter_add(iter, offset + 1);
return slot + offset + 1;
}
} else {
- long size = radix_tree_chunk_size(iter);
+ long count = radix_tree_chunk_size(iter);
+ void *canon = slot;
- while (--size > 0) {
+ while (--count > 0) {
slot++;
- iter->index++;
+ iter->index = __radix_tree_iter_add(iter, 1);
+
+ if (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
+ radix_tree_is_internal_node(*slot)) {
+ if (entry_to_node(*slot) == canon)
+ continue;
+ iter->next_index = iter->index;
+ break;
+ }
+
if (likely(*slot))
return slot;
if (flags & RADIX_TREE_ITER_CONTIG) {
@@ -476,34 +520,6 @@
}
/**
- * radix_tree_for_each_chunk - iterate over chunks
- *
- * @slot: the void** variable for pointer to chunk first slot
- * @root: the struct radix_tree_root pointer
- * @iter: the struct radix_tree_iter pointer
- * @start: iteration starting index
- * @flags: RADIX_TREE_ITER_* and tag index
- *
- * Locks can be released and reacquired between iterations.
- */
-#define radix_tree_for_each_chunk(slot, root, iter, start, flags) \
- for (slot = radix_tree_iter_init(iter, start) ; \
- (slot = radix_tree_next_chunk(root, iter, flags)) ;)
-
-/**
- * radix_tree_for_each_chunk_slot - iterate over slots in one chunk
- *
- * @slot: the void** variable, at the beginning points to chunk first slot
- * @iter: the struct radix_tree_iter pointer
- * @flags: RADIX_TREE_ITER_*, should be constant
- *
- * This macro is designed to be nested inside radix_tree_for_each_chunk().
- * @slot points to the radix tree slot, @iter->index contains its index.
- */
-#define radix_tree_for_each_chunk_slot(slot, iter, flags) \
- for (; slot ; slot = radix_tree_next_slot(slot, iter, flags))
-
-/**
* radix_tree_for_each_slot - iterate over non-empty slots
*
* @slot: the void** variable for pointer to slot
diff --git a/include/linux/random.h b/include/linux/random.h
index 9c29122..e47e533 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -26,7 +26,6 @@
extern int add_random_ready_callback(struct random_ready_callback *rdy);
extern void del_random_ready_callback(struct random_ready_callback *rdy);
extern void get_random_bytes_arch(void *buf, int nbytes);
-void generate_random_uuid(unsigned char uuid_out[16]);
extern int random_int_secret_init(void);
#ifndef MODULE
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0fc28e4..2c036de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,6 +521,7 @@
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
+#define MMF_OOM_REAPED 21 /* mm has been already reaped */
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
@@ -668,6 +669,7 @@
atomic_t sigcnt;
atomic_t live;
int nr_threads;
+ atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
struct list_head thread_head;
wait_queue_head_t wait_chldexit; /* for wait4() */
@@ -2725,14 +2727,24 @@
/* mmdrop drops the mm and the page tables */
extern void __mmdrop(struct mm_struct *);
-static inline void mmdrop(struct mm_struct * mm)
+static inline void mmdrop(struct mm_struct *mm)
{
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
+static inline bool mmget_not_zero(struct mm_struct *mm)
+{
+ return atomic_inc_not_zero(&mm->mm_users);
+}
+
/* mmput gets rid of the mappings and all user-space */
extern void mmput(struct mm_struct *);
+/* same as above but performs the slow path from the async kontext. Can
+ * be called from the atomic context as well
+ */
+extern void mmput_async(struct mm_struct *);
+
/* Grab a reference to a task's mm, if it is not already going away */
extern struct mm_struct *get_task_mm(struct task_struct *task);
/*
@@ -2761,7 +2773,14 @@
}
#endif
extern void flush_thread(void);
-extern void exit_thread(void);
+
+#ifdef CONFIG_HAVE_EXIT_THREAD
+extern void exit_thread(struct task_struct *tsk);
+#else
+static inline void exit_thread(struct task_struct *tsk)
+{
+}
+#endif
extern void exit_files(struct task_struct *);
extern void __cleanup_sighand(struct sighand_struct *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ad22035..0af2bb2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d795472..d022390 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -371,10 +371,10 @@
size_t sigsetsize);
asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig,
siginfo_t __user *uinfo);
-asmlinkage long sys_kill(int pid, int sig);
-asmlinkage long sys_tgkill(int tgid, int pid, int sig);
-asmlinkage long sys_tkill(int pid, int sig);
-asmlinkage long sys_rt_sigqueueinfo(int pid, int sig, siginfo_t __user *uinfo);
+asmlinkage long sys_kill(pid_t pid, int sig);
+asmlinkage long sys_tgkill(pid_t tgid, pid_t pid, int sig);
+asmlinkage long sys_tkill(pid_t pid, int sig);
+asmlinkage long sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user *uinfo);
asmlinkage long sys_sgetmask(void);
asmlinkage long sys_ssetmask(int newmask);
asmlinkage long sys_signal(int sig, __sighandler_t handler);
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 6df2509..2d095fc 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -1,7 +1,7 @@
/*
* UUID/GUID definition
*
- * Copyright (C) 2010, Intel Corp.
+ * Copyright (C) 2010, 2016 Intel Corp.
* Huang Ying <ying.huang@intel.com>
*
* This program is free software; you can redistribute it and/or
@@ -12,16 +12,17 @@
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
#ifndef _LINUX_UUID_H_
#define _LINUX_UUID_H_
#include <uapi/linux/uuid.h>
+/*
+ * The length of a UUID string ("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee")
+ * not including trailing NUL.
+ */
+#define UUID_STRING_LEN 36
static inline int uuid_le_cmp(const uuid_le u1, const uuid_le u2)
{
@@ -33,7 +34,17 @@
return memcmp(&u1, &u2, sizeof(uuid_be));
}
+void generate_random_uuid(unsigned char uuid[16]);
+
extern void uuid_le_gen(uuid_le *u);
extern void uuid_be_gen(uuid_be *u);
+bool __must_check uuid_is_valid(const char *uuid);
+
+extern const u8 uuid_le_index[16];
+extern const u8 uuid_be_index[16];
+
+int uuid_le_to_bin(const char *uuid, uuid_le *u);
+int uuid_be_to_bin(const char *uuid, uuid_be *u);
+
#endif
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index d1f1d33..957adb7 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -4,6 +4,7 @@
#include <linux/spinlock.h>
#include <linux/init.h>
#include <linux/list.h>
+#include <linux/llist.h>
#include <asm/page.h> /* pgprot_t */
#include <linux/rbtree.h>
@@ -44,7 +45,7 @@
unsigned long flags;
struct rb_node rb_node; /* address sorted rbtree */
struct list_head list; /* address sorted list */
- struct list_head purge_list; /* "lazy purge" list */
+ struct llist_node purge_list; /* "lazy purge" list */
struct vm_struct *vm;
struct rcu_head rcu_head;
};
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 34eb160..57a8e98 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -41,10 +41,10 @@
struct zs_pool;
-struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
+struct zs_pool *zs_create_pool(const char *name);
void zs_destroy_pool(struct zs_pool *pool);
-unsigned long zs_malloc(struct zs_pool *pool, size_t size);
+unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
void zs_free(struct zs_pool *pool, unsigned long obj);
void *zs_map_object(struct zs_pool *pool, unsigned long handle,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index e215bf6..36e2d6f 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -10,10 +10,11 @@
#include <trace/events/mmflags.h>
#define COMPACTION_STATUS \
- EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_SKIPPED, "skipped") \
+ EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_CONTINUE, "continue") \
EM( COMPACT_PARTIAL, "partial") \
+ EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
EM( COMPACT_COMPLETE, "complete") \
EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
diff --git a/include/uapi/linux/uuid.h b/include/uapi/linux/uuid.h
index 786f077..3738e5f 100644
--- a/include/uapi/linux/uuid.h
+++ b/include/uapi/linux/uuid.h
@@ -12,10 +12,6 @@
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
#ifndef _UAPI_LINUX_UUID_H_
diff --git a/init/Kconfig b/init/Kconfig
index 79a91a2..a9c4aefd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -862,6 +862,28 @@
13 => 8 KB for each CPU
12 => 4 KB for each CPU
+config NMI_LOG_BUF_SHIFT
+ int "Temporary per-CPU NMI log buffer size (12 => 4KB, 13 => 8KB)"
+ range 10 21
+ default 13
+ depends on PRINTK_NMI
+ help
+ Select the size of a per-CPU buffer where NMI messages are temporary
+ stored. They are copied to the main log buffer in a safe context
+ to avoid a deadlock. The value defines the size as a power of 2.
+
+ NMI messages are rare and limited. The largest one is when
+ a backtrace is printed. It usually fits into 4KB. Select
+ 8KB if you want to be on the safe side.
+
+ Examples:
+ 17 => 128 KB for each CPU
+ 16 => 64 KB for each CPU
+ 15 => 32 KB for each CPU
+ 14 => 16 KB for each CPU
+ 13 => 8 KB for each CPU
+ 12 => 4 KB for each CPU
+
#
# Architectures with an unreliable sched_clock() should select this:
#
@@ -1454,6 +1476,11 @@
very difficult to diagnose system problems, saying N here is
strongly discouraged.
+config PRINTK_NMI
+ def_bool y
+ depends on PRINTK
+ depends on HAVE_NMI
+
config BUG
bool "BUG() support" if EXPERT
default y
diff --git a/init/main.c b/init/main.c
index b3c6e36..bc0f9e0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -569,6 +569,7 @@
timekeeping_init();
time_init();
sched_clock_postinit();
+ printk_nmi_init();
perf_event_init();
profile_init();
call_function_init();
@@ -606,7 +607,6 @@
initrd_start = 0;
}
#endif
- page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
@@ -706,21 +706,20 @@
static bool __init_or_module initcall_blacklisted(initcall_t fn)
{
struct blacklist_entry *entry;
- char *fn_name;
+ char fn_name[KSYM_SYMBOL_LEN];
- fn_name = kasprintf(GFP_KERNEL, "%pf", fn);
- if (!fn_name)
+ if (list_empty(&blacklisted_initcalls))
return false;
+ sprint_symbol_no_offset(fn_name, (unsigned long)fn);
+
list_for_each_entry(entry, &blacklisted_initcalls, next) {
if (!strcmp(fn_name, entry->buf)) {
pr_debug("initcall %s blacklisted\n", fn_name);
- kfree(fn_name);
return true;
}
}
- kfree(fn_name);
return false;
}
#else
@@ -1004,6 +1003,8 @@
sched_init_smp();
page_alloc_init_late();
+ /* Initialize page ext after all struct pages are initializaed */
+ page_ext_init();
do_basic_setup();
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195..75b34fe 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -746,7 +746,7 @@
disassociate_ctty(1);
exit_task_namespaces(tsk);
exit_task_work(tsk);
- exit_thread();
+ exit_thread(tsk);
/*
* Flush inherited counters to the parent - before the parent
diff --git a/kernel/fork.c b/kernel/fork.c
index 3e84515..103d78f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -699,6 +699,26 @@
}
EXPORT_SYMBOL_GPL(__mmdrop);
+static inline void __mmput(struct mm_struct *mm)
+{
+ VM_BUG_ON(atomic_read(&mm->mm_users));
+
+ uprobe_clear_state(mm);
+ exit_aio(mm);
+ ksm_exit(mm);
+ khugepaged_exit(mm); /* must run before exit_mmap */
+ exit_mmap(mm);
+ set_mm_exe_file(mm, NULL);
+ if (!list_empty(&mm->mmlist)) {
+ spin_lock(&mmlist_lock);
+ list_del(&mm->mmlist);
+ spin_unlock(&mmlist_lock);
+ }
+ if (mm->binfmt)
+ module_put(mm->binfmt->module);
+ mmdrop(mm);
+}
+
/*
* Decrement the use count and release all resources for an mm.
*/
@@ -706,25 +726,25 @@
{
might_sleep();
- if (atomic_dec_and_test(&mm->mm_users)) {
- uprobe_clear_state(mm);
- exit_aio(mm);
- ksm_exit(mm);
- khugepaged_exit(mm); /* must run before exit_mmap */
- exit_mmap(mm);
- set_mm_exe_file(mm, NULL);
- if (!list_empty(&mm->mmlist)) {
- spin_lock(&mmlist_lock);
- list_del(&mm->mmlist);
- spin_unlock(&mmlist_lock);
- }
- if (mm->binfmt)
- module_put(mm->binfmt->module);
- mmdrop(mm);
- }
+ if (atomic_dec_and_test(&mm->mm_users))
+ __mmput(mm);
}
EXPORT_SYMBOL_GPL(mmput);
+static void mmput_async_fn(struct work_struct *work)
+{
+ struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+ __mmput(mm);
+}
+
+void mmput_async(struct mm_struct *mm)
+{
+ if (atomic_dec_and_test(&mm->mm_users)) {
+ INIT_WORK(&mm->async_put_work, mmput_async_fn);
+ schedule_work(&mm->async_put_work);
+ }
+}
+
/**
* set_mm_exe_file - change a reference to the mm's executable file
*
@@ -1470,7 +1490,7 @@
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
- goto bad_fork_cleanup_io;
+ goto bad_fork_cleanup_thread;
}
}
@@ -1632,6 +1652,8 @@
bad_fork_free_pid:
if (pid != &init_struct_pid)
free_pid(pid);
+bad_fork_cleanup_thread:
+ exit_thread(p);
bad_fork_cleanup_io:
if (p->io_context)
exit_io_context(p);
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index d65f6f3..8798b6c 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -139,12 +139,7 @@
{
mutex_lock(&irq_domain_mutex);
- /*
- * radix_tree_delete() takes care of destroying the root
- * node when all entries are removed. Shout if there are
- * any mappings left.
- */
- WARN_ON(domain->revmap_tree.height);
+ WARN_ON(!radix_tree_empty(&domain->revmap_tree));
list_del(&domain->link);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 1c03dfb..d5d4082 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -893,6 +893,7 @@
old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
if (old_cpu == PANIC_CPU_INVALID) {
/* This is the 1st CPU which comes here, so go ahead. */
+ printk_nmi_flush_on_panic();
__crash_kexec(regs);
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 535c965..8aa7449 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -160,8 +160,10 @@
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
- if (!crash_kexec_post_notifiers)
+ if (!crash_kexec_post_notifiers) {
+ printk_nmi_flush_on_panic();
__crash_kexec(NULL);
+ }
/*
* Note smp_send_stop is the usual smp shutdown function, which
@@ -176,6 +178,8 @@
*/
atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
+ /* Call flush even twice. It tries harder with a single online CPU */
+ printk_nmi_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
/*
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 85405bd..abb0042 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -1,2 +1,3 @@
obj-y = printk.o
+obj-$(CONFIG_PRINTK_NMI) += nmi.o
obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
new file mode 100644
index 0000000..7fd2838
--- /dev/null
+++ b/kernel/printk/internal.h
@@ -0,0 +1,57 @@
+/*
+ * internal.h - printk internal definitions
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <linux/percpu.h>
+
+typedef __printf(1, 0) int (*printk_func_t)(const char *fmt, va_list args);
+
+int __printf(1, 0) vprintk_default(const char *fmt, va_list args);
+
+#ifdef CONFIG_PRINTK_NMI
+
+extern raw_spinlock_t logbuf_lock;
+
+/*
+ * printk() could not take logbuf_lock in NMI context. Instead,
+ * it temporary stores the strings into a per-CPU buffer.
+ * The alternative implementation is chosen transparently
+ * via per-CPU variable.
+ */
+DECLARE_PER_CPU(printk_func_t, printk_func);
+static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args)
+{
+ return this_cpu_read(printk_func)(fmt, args);
+}
+
+extern atomic_t nmi_message_lost;
+static inline int get_nmi_message_lost(void)
+{
+ return atomic_xchg(&nmi_message_lost, 0);
+}
+
+#else /* CONFIG_PRINTK_NMI */
+
+static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args)
+{
+ return vprintk_default(fmt, args);
+}
+
+static inline int get_nmi_message_lost(void)
+{
+ return 0;
+}
+
+#endif /* CONFIG_PRINTK_NMI */
diff --git a/kernel/printk/nmi.c b/kernel/printk/nmi.c
new file mode 100644
index 0000000..b69eb8a
--- /dev/null
+++ b/kernel/printk/nmi.c
@@ -0,0 +1,260 @@
+/*
+ * nmi.c - Safe printk in NMI context
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/preempt.h>
+#include <linux/spinlock.h>
+#include <linux/debug_locks.h>
+#include <linux/smp.h>
+#include <linux/cpumask.h>
+#include <linux/irq_work.h>
+#include <linux/printk.h>
+
+#include "internal.h"
+
+/*
+ * printk() could not take logbuf_lock in NMI context. Instead,
+ * it uses an alternative implementation that temporary stores
+ * the strings into a per-CPU buffer. The content of the buffer
+ * is later flushed into the main ring buffer via IRQ work.
+ *
+ * The alternative implementation is chosen transparently
+ * via @printk_func per-CPU variable.
+ *
+ * The implementation allows to flush the strings also from another CPU.
+ * There are situations when we want to make sure that all buffers
+ * were handled or when IRQs are blocked.
+ */
+DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
+static int printk_nmi_irq_ready;
+atomic_t nmi_message_lost;
+
+#define NMI_LOG_BUF_LEN ((1 << CONFIG_NMI_LOG_BUF_SHIFT) - \
+ sizeof(atomic_t) - sizeof(struct irq_work))
+
+struct nmi_seq_buf {
+ atomic_t len; /* length of written data */
+ struct irq_work work; /* IRQ work that flushes the buffer */
+ unsigned char buffer[NMI_LOG_BUF_LEN];
+};
+static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
+
+/*
+ * Safe printk() for NMI context. It uses a per-CPU buffer to
+ * store the message. NMIs are not nested, so there is always only
+ * one writer running. But the buffer might get flushed from another
+ * CPU, so we need to be careful.
+ */
+static int vprintk_nmi(const char *fmt, va_list args)
+{
+ struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
+ int add = 0;
+ size_t len;
+
+again:
+ len = atomic_read(&s->len);
+
+ if (len >= sizeof(s->buffer)) {
+ atomic_inc(&nmi_message_lost);
+ return 0;
+ }
+
+ /*
+ * Make sure that all old data have been read before the buffer was
+ * reseted. This is not needed when we just append data.
+ */
+ if (!len)
+ smp_rmb();
+
+ add = vsnprintf(s->buffer + len, sizeof(s->buffer) - len, fmt, args);
+
+ /*
+ * Do it once again if the buffer has been flushed in the meantime.
+ * Note that atomic_cmpxchg() is an implicit memory barrier that
+ * makes sure that the data were written before updating s->len.
+ */
+ if (atomic_cmpxchg(&s->len, len, len + add) != len)
+ goto again;
+
+ /* Get flushed in a more safe context. */
+ if (add && printk_nmi_irq_ready) {
+ /* Make sure that IRQ work is really initialized. */
+ smp_rmb();
+ irq_work_queue(&s->work);
+ }
+
+ return add;
+}
+
+/*
+ * printk one line from the temporary buffer from @start index until
+ * and including the @end index.
+ */
+static void print_nmi_seq_line(struct nmi_seq_buf *s, int start, int end)
+{
+ const char *buf = s->buffer + start;
+
+ /*
+ * The buffers are flushed in NMI only on panic. The messages must
+ * go only into the ring buffer at this stage. Consoles will get
+ * explicitly called later when a crashdump is not generated.
+ */
+ if (in_nmi())
+ printk_deferred("%.*s", (end - start) + 1, buf);
+ else
+ printk("%.*s", (end - start) + 1, buf);
+
+}
+
+/*
+ * Flush data from the associated per_CPU buffer. The function
+ * can be called either via IRQ work or independently.
+ */
+static void __printk_nmi_flush(struct irq_work *work)
+{
+ static raw_spinlock_t read_lock =
+ __RAW_SPIN_LOCK_INITIALIZER(read_lock);
+ struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work);
+ unsigned long flags;
+ size_t len, size;
+ int i, last_i;
+
+ /*
+ * The lock has two functions. First, one reader has to flush all
+ * available message to make the lockless synchronization with
+ * writers easier. Second, we do not want to mix messages from
+ * different CPUs. This is especially important when printing
+ * a backtrace.
+ */
+ raw_spin_lock_irqsave(&read_lock, flags);
+
+ i = 0;
+more:
+ len = atomic_read(&s->len);
+
+ /*
+ * This is just a paranoid check that nobody has manipulated
+ * the buffer an unexpected way. If we printed something then
+ * @len must only increase.
+ */
+ if (i && i >= len)
+ pr_err("printk_nmi_flush: internal error: i=%d >= len=%zu\n",
+ i, len);
+
+ if (!len)
+ goto out; /* Someone else has already flushed the buffer. */
+
+ /* Make sure that data has been written up to the @len */
+ smp_rmb();
+
+ size = min(len, sizeof(s->buffer));
+ last_i = i;
+
+ /* Print line by line. */
+ for (; i < size; i++) {
+ if (s->buffer[i] == '\n') {
+ print_nmi_seq_line(s, last_i, i);
+ last_i = i + 1;
+ }
+ }
+ /* Check if there was a partial line. */
+ if (last_i < size) {
+ print_nmi_seq_line(s, last_i, size - 1);
+ pr_cont("\n");
+ }
+
+ /*
+ * Check that nothing has got added in the meantime and truncate
+ * the buffer. Note that atomic_cmpxchg() is an implicit memory
+ * barrier that makes sure that the data were copied before
+ * updating s->len.
+ */
+ if (atomic_cmpxchg(&s->len, len, 0) != len)
+ goto more;
+
+out:
+ raw_spin_unlock_irqrestore(&read_lock, flags);
+}
+
+/**
+ * printk_nmi_flush - flush all per-cpu nmi buffers.
+ *
+ * The buffers are flushed automatically via IRQ work. This function
+ * is useful only when someone wants to be sure that all buffers have
+ * been flushed at some point.
+ */
+void printk_nmi_flush(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ __printk_nmi_flush(&per_cpu(nmi_print_seq, cpu).work);
+}
+
+/**
+ * printk_nmi_flush_on_panic - flush all per-cpu nmi buffers when the system
+ * goes down.
+ *
+ * Similar to printk_nmi_flush() but it can be called even in NMI context when
+ * the system goes down. It does the best effort to get NMI messages into
+ * the main ring buffer.
+ *
+ * Note that it could try harder when there is only one CPU online.
+ */
+void printk_nmi_flush_on_panic(void)
+{
+ /*
+ * Make sure that we could access the main ring buffer.
+ * Do not risk a double release when more CPUs are up.
+ */
+ if (in_nmi() && raw_spin_is_locked(&logbuf_lock)) {
+ if (num_online_cpus() > 1)
+ return;
+
+ debug_locks_off();
+ raw_spin_lock_init(&logbuf_lock);
+ }
+
+ printk_nmi_flush();
+}
+
+void __init printk_nmi_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu);
+
+ init_irq_work(&s->work, __printk_nmi_flush);
+ }
+
+ /* Make sure that IRQ works are initialized before enabling. */
+ smp_wmb();
+ printk_nmi_irq_ready = 1;
+
+ /* Flush pending messages that did not have scheduled IRQ works. */
+ printk_nmi_flush();
+}
+
+void printk_nmi_enter(void)
+{
+ this_cpu_write(printk_func, vprintk_nmi);
+}
+
+void printk_nmi_exit(void)
+{
+ this_cpu_write(printk_func, vprintk_default);
+}
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index bfbf284..60cdf63 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -55,6 +55,7 @@
#include "console_cmdline.h"
#include "braille.h"
+#include "internal.h"
int console_printk[4] = {
CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */
@@ -244,7 +245,7 @@
* within the scheduler's rq lock. It must be released before calling
* console_unlock() or anything else that might wake up a process.
*/
-static DEFINE_RAW_SPINLOCK(logbuf_lock);
+DEFINE_RAW_SPINLOCK(logbuf_lock);
#ifdef CONFIG_PRINTK
DECLARE_WAIT_QUEUE_HEAD(log_wait);
@@ -1616,6 +1617,7 @@
unsigned long flags;
int this_cpu;
int printed_len = 0;
+ int nmi_message_lost;
bool in_sched = false;
/* cpu currently holding logbuf_lock in this function */
static unsigned int logbuf_cpu = UINT_MAX;
@@ -1666,6 +1668,15 @@
strlen(recursion_msg));
}
+ nmi_message_lost = get_nmi_message_lost();
+ if (unlikely(nmi_message_lost)) {
+ text_len = scnprintf(textbuf, sizeof(textbuf),
+ "BAD LUCK: lost %d message(s) from NMI context!",
+ nmi_message_lost);
+ printed_len += log_store(0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
+ NULL, 0, textbuf, text_len);
+ }
+
/*
* The printf needs to come first; we need the syslog
* prefix which might be passed-in as a parameter.
@@ -1807,14 +1818,6 @@
}
EXPORT_SYMBOL_GPL(vprintk_default);
-/*
- * This allows printk to be diverted to another function per cpu.
- * This is useful for calling printk functions from within NMI
- * without worrying about race conditions that can lock up the
- * box.
- */
-DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
-
/**
* printk - print a kernel message
* @fmt: format string
@@ -1838,21 +1841,11 @@
*/
asmlinkage __visible int printk(const char *fmt, ...)
{
- printk_func_t vprintk_func;
va_list args;
int r;
va_start(args, fmt);
-
- /*
- * If a caller overrides the per_cpu printk_func, then it needs
- * to disable preemption when calling printk(). Otherwise
- * the printk_func should be set to the default. No need to
- * disable preemption here.
- */
- vprintk_func = this_cpu_read(printk_func);
r = vprintk_func(fmt, args);
-
va_end(args);
return r;
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 10a1d7d..6eb99c1 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -13,6 +13,7 @@
#include <linux/ctype.h>
#include <linux/netdevice.h>
#include <linux/kernel.h>
+#include <linux/uuid.h>
#include <linux/slab.h>
#include <linux/compat.h>
@@ -1117,9 +1118,8 @@
/* Only supports reads */
if (oldval && oldlen) {
- char buf[40], *str = buf;
- unsigned char uuid[16];
- int i;
+ char buf[UUID_STRING_LEN + 1];
+ uuid_be uuid;
result = kernel_read(file, 0, buf, sizeof(buf) - 1);
if (result < 0)
@@ -1127,24 +1127,15 @@
buf[result] = '\0';
- /* Convert the uuid to from a string to binary */
- for (i = 0; i < 16; i++) {
- result = -EIO;
- if (!isxdigit(str[0]) || !isxdigit(str[1]))
- goto out;
-
- uuid[i] = (hex_to_bin(str[0]) << 4) |
- hex_to_bin(str[1]);
- str += 2;
- if (*str == '-')
- str++;
- }
+ result = -EIO;
+ if (uuid_be_to_bin(buf, &uuid))
+ goto out;
if (oldlen > 16)
oldlen = 16;
result = -EFAULT;
- if (copy_to_user(oldval, uuid, oldlen))
+ if (copy_to_user(oldval, &uuid, oldlen))
goto out;
copied = oldlen;
diff --git a/lib/Kconfig b/lib/Kconfig
index 61d55bd..d79909d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -362,6 +362,9 @@
for more information.
+config RADIX_TREE_MULTIORDER
+ bool
+
config ASSOCIATIVE_ARRAY
bool
help
diff --git a/lib/gcd.c b/lib/gcd.c
index 3657f12..135ee64 100644
--- a/lib/gcd.c
+++ b/lib/gcd.c
@@ -2,20 +2,77 @@
#include <linux/gcd.h>
#include <linux/export.h>
-/* Greatest common divisor */
+/*
+ * This implements the binary GCD algorithm. (Often attributed to Stein,
+ * but as Knuth has noted, appears in a first-century Chinese math text.)
+ *
+ * This is faster than the division-based algorithm even on x86, which
+ * has decent hardware division.
+ */
+
+#if !defined(CONFIG_CPU_NO_EFFICIENT_FFS) && !defined(CPU_NO_EFFICIENT_FFS)
+
+/* If __ffs is available, the even/odd algorithm benchmarks slower. */
unsigned long gcd(unsigned long a, unsigned long b)
{
- unsigned long r;
+ unsigned long r = a | b;
- if (a < b)
- swap(a, b);
+ if (!a || !b)
+ return r;
- if (!b)
- return a;
- while ((r = a % b) != 0) {
- a = b;
- b = r;
+ b >>= __ffs(b);
+ if (b == 1)
+ return r & -r;
+
+ for (;;) {
+ a >>= __ffs(a);
+ if (a == 1)
+ return r & -r;
+ if (a == b)
+ return a << __ffs(r);
+
+ if (a < b)
+ swap(a, b);
+ a -= b;
}
- return b;
}
+
+#else
+
+/* If normalization is done by loops, the even/odd algorithm is a win. */
+unsigned long gcd(unsigned long a, unsigned long b)
+{
+ unsigned long r = a | b;
+
+ if (!a || !b)
+ return r;
+
+ /* Isolate lsbit of r */
+ r &= -r;
+
+ while (!(b & r))
+ b >>= 1;
+ if (b == r)
+ return r;
+
+ for (;;) {
+ while (!(a & r))
+ a >>= 1;
+ if (a == r)
+ return r;
+ if (a == b)
+ return a;
+
+ if (a < b)
+ swap(a, b);
+ a -= b;
+ a >>= 1;
+ if (a & r)
+ a += b;
+ a >>= 1;
+ }
+}
+
+#endif
+
EXPORT_SYMBOL_GPL(gcd);
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index 6019c53..26caf51 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -16,33 +16,14 @@
#include <linux/delay.h>
#include <linux/kprobes.h>
#include <linux/nmi.h>
-#include <linux/seq_buf.h>
#ifdef arch_trigger_all_cpu_backtrace
/* For reliability, we're prepared to waste bits here. */
static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
-static cpumask_t printtrace_mask;
-
-#define NMI_BUF_SIZE 4096
-
-struct nmi_seq_buf {
- unsigned char buffer[NMI_BUF_SIZE];
- struct seq_buf seq;
-};
-
-/* Safe printing in NMI context */
-static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
/* "in progress" flag of arch_trigger_all_cpu_backtrace */
static unsigned long backtrace_flag;
-static void print_seq_line(struct nmi_seq_buf *s, int start, int end)
-{
- const char *buf = s->buffer + start;
-
- printk("%.*s", (end - start) + 1, buf);
-}
-
/*
* When raise() is called it will be is passed a pointer to the
* backtrace_mask. Architectures that call nmi_cpu_backtrace()
@@ -52,8 +33,7 @@
void nmi_trigger_all_cpu_backtrace(bool include_self,
void (*raise)(cpumask_t *mask))
{
- struct nmi_seq_buf *s;
- int i, cpu, this_cpu = get_cpu();
+ int i, this_cpu = get_cpu();
if (test_and_set_bit(0, &backtrace_flag)) {
/*
@@ -68,17 +48,6 @@
if (!include_self)
cpumask_clear_cpu(this_cpu, to_cpumask(backtrace_mask));
- cpumask_copy(&printtrace_mask, to_cpumask(backtrace_mask));
-
- /*
- * Set up per_cpu seq_buf buffers that the NMIs running on the other
- * CPUs will write to.
- */
- for_each_cpu(cpu, to_cpumask(backtrace_mask)) {
- s = &per_cpu(nmi_print_seq, cpu);
- seq_buf_init(&s->seq, s->buffer, NMI_BUF_SIZE);
- }
-
if (!cpumask_empty(to_cpumask(backtrace_mask))) {
pr_info("Sending NMI to %s CPUs:\n",
(include_self ? "all" : "other"));
@@ -94,73 +63,25 @@
}
/*
- * Now that all the NMIs have triggered, we can dump out their
- * back traces safely to the console.
+ * Force flush any remote buffers that might be stuck in IRQ context
+ * and therefore could not run their irq_work.
*/
- for_each_cpu(cpu, &printtrace_mask) {
- int len, last_i = 0;
+ printk_nmi_flush();
- s = &per_cpu(nmi_print_seq, cpu);
- len = seq_buf_used(&s->seq);
- if (!len)
- continue;
-
- /* Print line by line. */
- for (i = 0; i < len; i++) {
- if (s->buffer[i] == '\n') {
- print_seq_line(s, last_i, i);
- last_i = i + 1;
- }
- }
- /* Check if there was a partial line. */
- if (last_i < len) {
- print_seq_line(s, last_i, len - 1);
- pr_cont("\n");
- }
- }
-
- clear_bit(0, &backtrace_flag);
- smp_mb__after_atomic();
+ clear_bit_unlock(0, &backtrace_flag);
put_cpu();
}
-/*
- * It is not safe to call printk() directly from NMI handlers.
- * It may be fine if the NMI detected a lock up and we have no choice
- * but to do so, but doing a NMI on all other CPUs to get a back trace
- * can be done with a sysrq-l. We don't want that to lock up, which
- * can happen if the NMI interrupts a printk in progress.
- *
- * Instead, we redirect the vprintk() to this nmi_vprintk() that writes
- * the content into a per cpu seq_buf buffer. Then when the NMIs are
- * all done, we can safely dump the contents of the seq_buf to a printk()
- * from a non NMI context.
- */
-static int nmi_vprintk(const char *fmt, va_list args)
-{
- struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
- unsigned int len = seq_buf_used(&s->seq);
-
- seq_buf_vprintf(&s->seq, fmt, args);
- return seq_buf_used(&s->seq) - len;
-}
-
bool nmi_cpu_backtrace(struct pt_regs *regs)
{
int cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
- printk_func_t printk_func_save = this_cpu_read(printk_func);
-
- /* Replace printk to write into the NMI seq */
- this_cpu_write(printk_func, nmi_vprintk);
pr_warn("NMI backtrace for cpu %d\n", cpu);
if (regs)
show_regs(regs);
else
dump_stack();
- this_cpu_write(printk_func, printk_func_save);
-
cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
return true;
}
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1624c41..8b7d845 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -4,6 +4,8 @@
* Copyright (C) 2005 SGI, Christoph Lameter
* Copyright (C) 2006 Nick Piggin
* Copyright (C) 2012 Konstantin Khlebnikov
+ * Copyright (C) 2016 Intel, Matthew Wilcox
+ * Copyright (C) 2016 Intel, Ross Zwisler
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
@@ -37,12 +39,6 @@
/*
- * The height_to_maxindex array needs to be one deeper than the maximum
- * path as height 0 holds only 1 entry.
- */
-static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;
-
-/*
* Radix tree node cache.
*/
static struct kmem_cache *radix_tree_node_cachep;
@@ -64,20 +60,58 @@
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
- int nr;
+ unsigned nr;
/* nodes->private_data points to next preallocated node */
struct radix_tree_node *nodes;
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
-static inline void *ptr_to_indirect(void *ptr)
+static inline void *node_to_entry(void *ptr)
{
- return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
+ return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE);
}
-static inline void *indirect_to_ptr(void *ptr)
+#define RADIX_TREE_RETRY node_to_entry(NULL)
+
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+/* Sibling slots point directly to another slot in the same node */
+static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node)
{
- return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
+ void **ptr = node;
+ return (parent->slots <= ptr) &&
+ (ptr < parent->slots + RADIX_TREE_MAP_SIZE);
+}
+#else
+static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node)
+{
+ return false;
+}
+#endif
+
+static inline unsigned long get_slot_offset(struct radix_tree_node *parent,
+ void **slot)
+{
+ return slot - parent->slots;
+}
+
+static unsigned int radix_tree_descend(struct radix_tree_node *parent,
+ struct radix_tree_node **nodep, unsigned long index)
+{
+ unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK;
+ void **entry = rcu_dereference_raw(parent->slots[offset]);
+
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ if (radix_tree_is_internal_node(entry)) {
+ unsigned long siboff = get_slot_offset(parent, entry);
+ if (siboff < RADIX_TREE_MAP_SIZE) {
+ offset = siboff;
+ entry = rcu_dereference_raw(parent->slots[offset]);
+ }
+ }
+#endif
+
+ *nodep = (void *)entry;
+ return offset;
}
static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
@@ -108,7 +142,7 @@
root->gfp_mask |= (__force gfp_t)(1 << (tag + __GFP_BITS_SHIFT));
}
-static inline void root_tag_clear(struct radix_tree_root *root, unsigned int tag)
+static inline void root_tag_clear(struct radix_tree_root *root, unsigned tag)
{
root->gfp_mask &= (__force gfp_t)~(1 << (tag + __GFP_BITS_SHIFT));
}
@@ -120,7 +154,12 @@
static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag)
{
- return (__force unsigned)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT));
+ return (__force int)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT));
+}
+
+static inline unsigned root_tags_get(struct radix_tree_root *root)
+{
+ return (__force unsigned)root->gfp_mask >> __GFP_BITS_SHIFT;
}
/*
@@ -129,7 +168,7 @@
*/
static inline int any_tag_set(struct radix_tree_node *node, unsigned int tag)
{
- int idx;
+ unsigned idx;
for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) {
if (node->tags[tag][idx])
return 1;
@@ -173,38 +212,45 @@
return size;
}
-#if 0
-static void dump_node(void *slot, int height, int offset)
+#ifndef __KERNEL__
+static void dump_node(struct radix_tree_node *node, unsigned long index)
{
- struct radix_tree_node *node;
- int i;
+ unsigned long i;
- if (!slot)
- return;
+ pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
+ node, node->offset,
+ node->tags[0][0], node->tags[1][0], node->tags[2][0],
+ node->shift, node->count, node->parent);
- if (height == 0) {
- pr_debug("radix entry %p offset %d\n", slot, offset);
- return;
+ for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
+ unsigned long first = index | (i << node->shift);
+ unsigned long last = first | ((1UL << node->shift) - 1);
+ void *entry = node->slots[i];
+ if (!entry)
+ continue;
+ if (is_sibling_entry(node, entry)) {
+ pr_debug("radix sblng %p offset %ld val %p indices %ld-%ld\n",
+ entry, i,
+ *(void **)entry_to_node(entry),
+ first, last);
+ } else if (!radix_tree_is_internal_node(entry)) {
+ pr_debug("radix entry %p offset %ld indices %ld-%ld\n",
+ entry, i, first, last);
+ } else {
+ dump_node(entry_to_node(entry), first);
+ }
}
-
- node = indirect_to_ptr(slot);
- pr_debug("radix node: %p offset %d tags %lx %lx %lx path %x count %d parent %p\n",
- slot, offset, node->tags[0][0], node->tags[1][0],
- node->tags[2][0], node->path, node->count, node->parent);
-
- for (i = 0; i < RADIX_TREE_MAP_SIZE; i++)
- dump_node(node->slots[i], height - 1, i);
}
/* For debug */
static void radix_tree_dump(struct radix_tree_root *root)
{
- pr_debug("radix root: %p height %d rnode %p tags %x\n",
- root, root->height, root->rnode,
+ pr_debug("radix root: %p rnode %p tags %x\n",
+ root, root->rnode,
root->gfp_mask >> __GFP_BITS_SHIFT);
- if (!radix_tree_is_indirect_ptr(root->rnode))
+ if (!radix_tree_is_internal_node(root->rnode))
return;
- dump_node(root->rnode, root->height, 0);
+ dump_node(entry_to_node(root->rnode), 0);
}
#endif
@@ -219,9 +265,9 @@
gfp_t gfp_mask = root_gfp_mask(root);
/*
- * Preload code isn't irq safe and it doesn't make sence to use
- * preloading in the interrupt anyway as all the allocations have to
- * be atomic. So just do normal allocation when in interrupt.
+ * Preload code isn't irq safe and it doesn't make sense to use
+ * preloading during an interrupt anyway as all the allocations have
+ * to be atomic. So just do normal allocation when in interrupt.
*/
if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
struct radix_tree_preload *rtp;
@@ -257,7 +303,7 @@
ret = kmem_cache_alloc(radix_tree_node_cachep,
gfp_mask | __GFP_ACCOUNT);
out:
- BUG_ON(radix_tree_is_indirect_ptr(ret));
+ BUG_ON(radix_tree_is_internal_node(ret));
return ret;
}
@@ -357,38 +403,58 @@
EXPORT_SYMBOL(radix_tree_maybe_preload);
/*
- * Return the maximum key which can be store into a
- * radix tree with height HEIGHT.
+ * The maximum index which can be stored in a radix tree
*/
-static inline unsigned long radix_tree_maxindex(unsigned int height)
+static inline unsigned long shift_maxindex(unsigned int shift)
{
- return height_to_maxindex[height];
+ return (RADIX_TREE_MAP_SIZE << shift) - 1;
+}
+
+static inline unsigned long node_maxindex(struct radix_tree_node *node)
+{
+ return shift_maxindex(node->shift);
+}
+
+static unsigned radix_tree_load_root(struct radix_tree_root *root,
+ struct radix_tree_node **nodep, unsigned long *maxindex)
+{
+ struct radix_tree_node *node = rcu_dereference_raw(root->rnode);
+
+ *nodep = node;
+
+ if (likely(radix_tree_is_internal_node(node))) {
+ node = entry_to_node(node);
+ *maxindex = node_maxindex(node);
+ return node->shift + RADIX_TREE_MAP_SHIFT;
+ }
+
+ *maxindex = 0;
+ return 0;
}
/*
* Extend a radix tree so it can store key @index.
*/
static int radix_tree_extend(struct radix_tree_root *root,
- unsigned long index, unsigned order)
+ unsigned long index, unsigned int shift)
{
- struct radix_tree_node *node;
struct radix_tree_node *slot;
- unsigned int height;
+ unsigned int maxshift;
int tag;
- /* Figure out what the height should be. */
- height = root->height + 1;
- while (index > radix_tree_maxindex(height))
- height++;
+ /* Figure out what the shift should be. */
+ maxshift = shift;
+ while (index > shift_maxindex(maxshift))
+ maxshift += RADIX_TREE_MAP_SHIFT;
- if ((root->rnode == NULL) && (order == 0)) {
- root->height = height;
+ slot = root->rnode;
+ if (!slot)
goto out;
- }
do {
- unsigned int newheight;
- if (!(node = radix_tree_node_alloc(root)))
+ struct radix_tree_node *node = radix_tree_node_alloc(root);
+
+ if (!node)
return -ENOMEM;
/* Propagate the aggregated tag info into the new root */
@@ -397,25 +463,20 @@
tag_set(node, tag, 0);
}
- /* Increase the height. */
- newheight = root->height+1;
- BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
- node->path = newheight;
+ BUG_ON(shift > BITS_PER_LONG);
+ node->shift = shift;
+ node->offset = 0;
node->count = 1;
node->parent = NULL;
- slot = root->rnode;
- if (radix_tree_is_indirect_ptr(slot) && newheight > 1) {
- slot = indirect_to_ptr(slot);
- slot->parent = node;
- slot = ptr_to_indirect(slot);
- }
+ if (radix_tree_is_internal_node(slot))
+ entry_to_node(slot)->parent = node;
node->slots[0] = slot;
- node = ptr_to_indirect(node);
- rcu_assign_pointer(root->rnode, node);
- root->height = newheight;
- } while (height > root->height);
+ slot = node_to_entry(node);
+ rcu_assign_pointer(root->rnode, slot);
+ shift += RADIX_TREE_MAP_SHIFT;
+ } while (shift <= maxshift);
out:
- return 0;
+ return maxshift + RADIX_TREE_MAP_SHIFT;
}
/**
@@ -439,71 +500,70 @@
unsigned order, struct radix_tree_node **nodep,
void ***slotp)
{
- struct radix_tree_node *node = NULL, *slot;
- unsigned int height, shift, offset;
- int error;
+ struct radix_tree_node *node = NULL, *child;
+ void **slot = (void **)&root->rnode;
+ unsigned long maxindex;
+ unsigned int shift, offset = 0;
+ unsigned long max = index | ((1UL << order) - 1);
- BUG_ON((0 < order) && (order < RADIX_TREE_MAP_SHIFT));
+ shift = radix_tree_load_root(root, &child, &maxindex);
/* Make sure the tree is high enough. */
- if (index > radix_tree_maxindex(root->height)) {
- error = radix_tree_extend(root, index, order);
- if (error)
+ if (max > maxindex) {
+ int error = radix_tree_extend(root, max, shift);
+ if (error < 0)
return error;
+ shift = error;
+ child = root->rnode;
+ if (order == shift)
+ shift += RADIX_TREE_MAP_SHIFT;
}
- slot = root->rnode;
-
- height = root->height;
- shift = height * RADIX_TREE_MAP_SHIFT;
-
- offset = 0; /* uninitialised var warning */
while (shift > order) {
- if (slot == NULL) {
+ shift -= RADIX_TREE_MAP_SHIFT;
+ if (child == NULL) {
/* Have to add a child node. */
- if (!(slot = radix_tree_node_alloc(root)))
+ child = radix_tree_node_alloc(root);
+ if (!child)
return -ENOMEM;
- slot->path = height;
- slot->parent = node;
- if (node) {
- rcu_assign_pointer(node->slots[offset],
- ptr_to_indirect(slot));
+ child->shift = shift;
+ child->offset = offset;
+ child->parent = node;
+ rcu_assign_pointer(*slot, node_to_entry(child));
+ if (node)
node->count++;
- slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
- } else
- rcu_assign_pointer(root->rnode,
- ptr_to_indirect(slot));
- } else if (!radix_tree_is_indirect_ptr(slot))
+ } else if (!radix_tree_is_internal_node(child))
break;
/* Go a level down */
- height--;
- shift -= RADIX_TREE_MAP_SHIFT;
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- node = indirect_to_ptr(slot);
- slot = node->slots[offset];
+ node = entry_to_node(child);
+ offset = radix_tree_descend(node, &child, index);
+ slot = &node->slots[offset];
}
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
/* Insert pointers to the canonical entry */
- if ((shift - order) > 0) {
- int i, n = 1 << (shift - order);
+ if (order > shift) {
+ unsigned i, n = 1 << (order - shift);
offset = offset & ~(n - 1);
- slot = ptr_to_indirect(&node->slots[offset]);
+ slot = &node->slots[offset];
+ child = node_to_entry(slot);
for (i = 0; i < n; i++) {
- if (node->slots[offset + i])
+ if (slot[i])
return -EEXIST;
}
for (i = 1; i < n; i++) {
- rcu_assign_pointer(node->slots[offset + i], slot);
+ rcu_assign_pointer(slot[i], child);
node->count++;
}
}
+#endif
if (nodep)
*nodep = node;
if (slotp)
- *slotp = node ? node->slots + offset : (void **)&root->rnode;
+ *slotp = slot;
return 0;
}
@@ -523,7 +583,7 @@
void **slot;
int error;
- BUG_ON(radix_tree_is_indirect_ptr(item));
+ BUG_ON(radix_tree_is_internal_node(item));
error = __radix_tree_create(root, index, order, &node, &slot);
if (error)
@@ -533,12 +593,13 @@
rcu_assign_pointer(*slot, item);
if (node) {
+ unsigned offset = get_slot_offset(node, slot);
node->count++;
- BUG_ON(tag_get(node, 0, index & RADIX_TREE_MAP_MASK));
- BUG_ON(tag_get(node, 1, index & RADIX_TREE_MAP_MASK));
+ BUG_ON(tag_get(node, 0, offset));
+ BUG_ON(tag_get(node, 1, offset));
+ BUG_ON(tag_get(node, 2, offset));
} else {
- BUG_ON(root_tag_get(root, 0));
- BUG_ON(root_tag_get(root, 1));
+ BUG_ON(root_tags_get(root));
}
return 0;
@@ -563,44 +624,25 @@
struct radix_tree_node **nodep, void ***slotp)
{
struct radix_tree_node *node, *parent;
- unsigned int height, shift;
+ unsigned long maxindex;
void **slot;
- node = rcu_dereference_raw(root->rnode);
- if (node == NULL)
+ restart:
+ parent = NULL;
+ slot = (void **)&root->rnode;
+ radix_tree_load_root(root, &node, &maxindex);
+ if (index > maxindex)
return NULL;
- if (!radix_tree_is_indirect_ptr(node)) {
- if (index > 0)
- return NULL;
+ while (radix_tree_is_internal_node(node)) {
+ unsigned offset;
- if (nodep)
- *nodep = NULL;
- if (slotp)
- *slotp = (void **)&root->rnode;
- return node;
+ if (node == RADIX_TREE_RETRY)
+ goto restart;
+ parent = entry_to_node(node);
+ offset = radix_tree_descend(parent, &node, index);
+ slot = parent->slots + offset;
}
- node = indirect_to_ptr(node);
-
- height = node->path & RADIX_TREE_HEIGHT_MASK;
- if (index > radix_tree_maxindex(height))
- return NULL;
-
- shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-
- do {
- parent = node;
- slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK);
- node = rcu_dereference_raw(*slot);
- if (node == NULL)
- return NULL;
- if (!radix_tree_is_indirect_ptr(node))
- break;
- node = indirect_to_ptr(node);
-
- shift -= RADIX_TREE_MAP_SHIFT;
- height--;
- } while (height > 0);
if (nodep)
*nodep = parent;
@@ -654,59 +696,72 @@
* radix_tree_tag_set - set a tag on a radix tree node
* @root: radix tree root
* @index: index key
- * @tag: tag index
+ * @tag: tag index
*
* Set the search tag (which must be < RADIX_TREE_MAX_TAGS)
* corresponding to @index in the radix tree. From
* the root all the way down to the leaf node.
*
- * Returns the address of the tagged item. Setting a tag on a not-present
+ * Returns the address of the tagged item. Setting a tag on a not-present
* item is a bug.
*/
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag)
{
- unsigned int height, shift;
- struct radix_tree_node *slot;
+ struct radix_tree_node *node, *parent;
+ unsigned long maxindex;
- height = root->height;
- BUG_ON(index > radix_tree_maxindex(height));
+ radix_tree_load_root(root, &node, &maxindex);
+ BUG_ON(index > maxindex);
- slot = indirect_to_ptr(root->rnode);
- shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+ while (radix_tree_is_internal_node(node)) {
+ unsigned offset;
- while (height > 0) {
- int offset;
+ parent = entry_to_node(node);
+ offset = radix_tree_descend(parent, &node, index);
+ BUG_ON(!node);
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- if (!tag_get(slot, tag, offset))
- tag_set(slot, tag, offset);
- slot = slot->slots[offset];
- BUG_ON(slot == NULL);
- if (!radix_tree_is_indirect_ptr(slot))
- break;
- slot = indirect_to_ptr(slot);
- shift -= RADIX_TREE_MAP_SHIFT;
- height--;
+ if (!tag_get(parent, tag, offset))
+ tag_set(parent, tag, offset);
}
/* set the root's tag bit */
- if (slot && !root_tag_get(root, tag))
+ if (!root_tag_get(root, tag))
root_tag_set(root, tag);
- return slot;
+ return node;
}
EXPORT_SYMBOL(radix_tree_tag_set);
+static void node_tag_clear(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ unsigned int tag, unsigned int offset)
+{
+ while (node) {
+ if (!tag_get(node, tag, offset))
+ return;
+ tag_clear(node, tag, offset);
+ if (any_tag_set(node, tag))
+ return;
+
+ offset = node->offset;
+ node = node->parent;
+ }
+
+ /* clear the root's tag bit */
+ if (root_tag_get(root, tag))
+ root_tag_clear(root, tag);
+}
+
/**
* radix_tree_tag_clear - clear a tag on a radix tree node
* @root: radix tree root
* @index: index key
- * @tag: tag index
+ * @tag: tag index
*
* Clear the search tag (which must be < RADIX_TREE_MAX_TAGS)
- * corresponding to @index in the radix tree. If
- * this causes the leaf node to have no tags set then clear the tag in the
+ * corresponding to @index in the radix tree. If this causes
+ * the leaf node to have no tags set then clear the tag in the
* next-to-leaf node, etc.
*
* Returns the address of the tagged item on success, else NULL. ie:
@@ -715,52 +770,25 @@
void *radix_tree_tag_clear(struct radix_tree_root *root,
unsigned long index, unsigned int tag)
{
- struct radix_tree_node *node = NULL;
- struct radix_tree_node *slot = NULL;
- unsigned int height, shift;
+ struct radix_tree_node *node, *parent;
+ unsigned long maxindex;
int uninitialized_var(offset);
- height = root->height;
- if (index > radix_tree_maxindex(height))
- goto out;
+ radix_tree_load_root(root, &node, &maxindex);
+ if (index > maxindex)
+ return NULL;
- shift = height * RADIX_TREE_MAP_SHIFT;
- slot = root->rnode;
+ parent = NULL;
- while (shift) {
- if (slot == NULL)
- goto out;
- if (!radix_tree_is_indirect_ptr(slot))
- break;
- slot = indirect_to_ptr(slot);
-
- shift -= RADIX_TREE_MAP_SHIFT;
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- node = slot;
- slot = slot->slots[offset];
+ while (radix_tree_is_internal_node(node)) {
+ parent = entry_to_node(node);
+ offset = radix_tree_descend(parent, &node, index);
}
- if (slot == NULL)
- goto out;
+ if (node)
+ node_tag_clear(root, parent, tag, offset);
- while (node) {
- if (!tag_get(node, tag, offset))
- goto out;
- tag_clear(node, tag, offset);
- if (any_tag_set(node, tag))
- goto out;
-
- index >>= RADIX_TREE_MAP_SHIFT;
- offset = index & RADIX_TREE_MAP_MASK;
- node = node->parent;
- }
-
- /* clear the root's tag bit */
- if (root_tag_get(root, tag))
- root_tag_clear(root, tag);
-
-out:
- return slot;
+ return node;
}
EXPORT_SYMBOL(radix_tree_tag_clear);
@@ -768,7 +796,7 @@
* radix_tree_tag_get - get a tag on a radix tree node
* @root: radix tree root
* @index: index key
- * @tag: tag index (< RADIX_TREE_MAX_TAGS)
+ * @tag: tag index (< RADIX_TREE_MAX_TAGS)
*
* Return values:
*
@@ -782,48 +810,44 @@
int radix_tree_tag_get(struct radix_tree_root *root,
unsigned long index, unsigned int tag)
{
- unsigned int height, shift;
- struct radix_tree_node *node;
+ struct radix_tree_node *node, *parent;
+ unsigned long maxindex;
- /* check the root's tag bit */
if (!root_tag_get(root, tag))
return 0;
- node = rcu_dereference_raw(root->rnode);
+ radix_tree_load_root(root, &node, &maxindex);
+ if (index > maxindex)
+ return 0;
if (node == NULL)
return 0;
- if (!radix_tree_is_indirect_ptr(node))
- return (index == 0);
- node = indirect_to_ptr(node);
+ while (radix_tree_is_internal_node(node)) {
+ unsigned offset;
- height = node->path & RADIX_TREE_HEIGHT_MASK;
- if (index > radix_tree_maxindex(height))
- return 0;
+ parent = entry_to_node(node);
+ offset = radix_tree_descend(parent, &node, index);
- shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-
- for ( ; ; ) {
- int offset;
-
- if (node == NULL)
+ if (!node)
return 0;
- node = indirect_to_ptr(node);
-
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- if (!tag_get(node, tag, offset))
+ if (!tag_get(parent, tag, offset))
return 0;
- if (height == 1)
- return 1;
- node = rcu_dereference_raw(node->slots[offset]);
- if (!radix_tree_is_indirect_ptr(node))
- return 1;
- shift -= RADIX_TREE_MAP_SHIFT;
- height--;
+ if (node == RADIX_TREE_RETRY)
+ break;
}
+
+ return 1;
}
EXPORT_SYMBOL(radix_tree_tag_get);
+static inline void __set_iter_shift(struct radix_tree_iter *iter,
+ unsigned int shift)
+{
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ iter->shift = shift;
+#endif
+}
+
/**
* radix_tree_next_chunk - find next chunk of slots for iteration
*
@@ -835,9 +859,9 @@
void **radix_tree_next_chunk(struct radix_tree_root *root,
struct radix_tree_iter *iter, unsigned flags)
{
- unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
- struct radix_tree_node *rnode, *node;
- unsigned long index, offset, height;
+ unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK;
+ struct radix_tree_node *node, *child;
+ unsigned long index, offset, maxindex;
if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
return NULL;
@@ -855,33 +879,28 @@
if (!index && iter->index)
return NULL;
- rnode = rcu_dereference_raw(root->rnode);
- if (radix_tree_is_indirect_ptr(rnode)) {
- rnode = indirect_to_ptr(rnode);
- } else if (rnode && !index) {
+ restart:
+ radix_tree_load_root(root, &child, &maxindex);
+ if (index > maxindex)
+ return NULL;
+ if (!child)
+ return NULL;
+
+ if (!radix_tree_is_internal_node(child)) {
/* Single-slot tree */
- iter->index = 0;
- iter->next_index = 1;
+ iter->index = index;
+ iter->next_index = maxindex + 1;
iter->tags = 1;
+ __set_iter_shift(iter, 0);
return (void **)&root->rnode;
- } else
- return NULL;
+ }
-restart:
- height = rnode->path & RADIX_TREE_HEIGHT_MASK;
- shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- offset = index >> shift;
+ do {
+ node = entry_to_node(child);
+ offset = radix_tree_descend(node, &child, index);
- /* Index outside of the tree */
- if (offset >= RADIX_TREE_MAP_SIZE)
- return NULL;
-
- node = rnode;
- while (1) {
- struct radix_tree_node *slot;
if ((flags & RADIX_TREE_ITER_TAGGED) ?
- !test_bit(offset, node->tags[tag]) :
- !node->slots[offset]) {
+ !tag_get(node, tag, offset) : !child) {
/* Hole detected */
if (flags & RADIX_TREE_ITER_CONTIG)
return NULL;
@@ -893,35 +912,30 @@
offset + 1);
else
while (++offset < RADIX_TREE_MAP_SIZE) {
- if (node->slots[offset])
+ void *slot = node->slots[offset];
+ if (is_sibling_entry(node, slot))
+ continue;
+ if (slot)
break;
}
- index &= ~((RADIX_TREE_MAP_SIZE << shift) - 1);
- index += offset << shift;
+ index &= ~node_maxindex(node);
+ index += offset << node->shift;
/* Overflow after ~0UL */
if (!index)
return NULL;
if (offset == RADIX_TREE_MAP_SIZE)
goto restart;
+ child = rcu_dereference_raw(node->slots[offset]);
}
- /* This is leaf-node */
- if (!shift)
- break;
-
- slot = rcu_dereference_raw(node->slots[offset]);
- if (slot == NULL)
+ if ((child == NULL) || (child == RADIX_TREE_RETRY))
goto restart;
- if (!radix_tree_is_indirect_ptr(slot))
- break;
- node = indirect_to_ptr(slot);
- shift -= RADIX_TREE_MAP_SHIFT;
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- }
+ } while (radix_tree_is_internal_node(child));
/* Update the iterator state */
- iter->index = index;
- iter->next_index = (index | RADIX_TREE_MAP_MASK) + 1;
+ iter->index = (index &~ node_maxindex(node)) | (offset << node->shift);
+ iter->next_index = (index | node_maxindex(node)) + 1;
+ __set_iter_shift(iter, node->shift);
/* Construct iter->tags bit-mask from node->tags[tag] array */
if (flags & RADIX_TREE_ITER_TAGGED) {
@@ -967,7 +981,7 @@
* set is outside the range we are scanning. This reults in dangling tags and
* can lead to problems with later tag operations (e.g. livelocks on lookups).
*
- * The function returns number of leaves where the tag was set and sets
+ * The function returns the number of leaves where the tag was set and sets
* *first_indexp to the first unscanned index.
* WARNING! *first_indexp can wrap if last_index is ULONG_MAX. Caller must
* be prepared to handle that.
@@ -977,14 +991,13 @@
unsigned long nr_to_tag,
unsigned int iftag, unsigned int settag)
{
- unsigned int height = root->height;
- struct radix_tree_node *node = NULL;
- struct radix_tree_node *slot;
- unsigned int shift;
+ struct radix_tree_node *parent, *node, *child;
+ unsigned long maxindex;
unsigned long tagged = 0;
unsigned long index = *first_indexp;
- last_index = min(last_index, radix_tree_maxindex(height));
+ radix_tree_load_root(root, &child, &maxindex);
+ last_index = min(last_index, maxindex);
if (index > last_index)
return 0;
if (!nr_to_tag)
@@ -993,80 +1006,62 @@
*first_indexp = last_index + 1;
return 0;
}
- if (height == 0) {
+ if (!radix_tree_is_internal_node(child)) {
*first_indexp = last_index + 1;
root_tag_set(root, settag);
return 1;
}
- shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
- slot = indirect_to_ptr(root->rnode);
+ node = entry_to_node(child);
for (;;) {
- unsigned long upindex;
- int offset;
-
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- if (!slot->slots[offset])
+ unsigned offset = radix_tree_descend(node, &child, index);
+ if (!child)
goto next;
- if (!tag_get(slot, iftag, offset))
+ if (!tag_get(node, iftag, offset))
goto next;
- if (shift) {
- node = slot;
- slot = slot->slots[offset];
- if (radix_tree_is_indirect_ptr(slot)) {
- slot = indirect_to_ptr(slot);
- shift -= RADIX_TREE_MAP_SHIFT;
- continue;
- } else {
- slot = node;
- node = node->parent;
- }
+ /* Sibling slots never have tags set on them */
+ if (radix_tree_is_internal_node(child)) {
+ node = entry_to_node(child);
+ continue;
}
/* tag the leaf */
- tagged += 1 << shift;
- tag_set(slot, settag, offset);
+ tagged++;
+ tag_set(node, settag, offset);
/* walk back up the path tagging interior nodes */
- upindex = index;
- while (node) {
- upindex >>= RADIX_TREE_MAP_SHIFT;
- offset = upindex & RADIX_TREE_MAP_MASK;
-
- /* stop if we find a node with the tag already set */
- if (tag_get(node, settag, offset))
+ parent = node;
+ for (;;) {
+ offset = parent->offset;
+ parent = parent->parent;
+ if (!parent)
break;
- tag_set(node, settag, offset);
- node = node->parent;
+ /* stop if we find a node with the tag already set */
+ if (tag_get(parent, settag, offset))
+ break;
+ tag_set(parent, settag, offset);
}
-
- /*
- * Small optimization: now clear that node pointer.
- * Since all of this slot's ancestors now have the tag set
- * from setting it above, we have no further need to walk
- * back up the tree setting tags, until we update slot to
- * point to another radix_tree_node.
- */
- node = NULL;
-
-next:
- /* Go to next item at level determined by 'shift' */
- index = ((index >> shift) + 1) << shift;
+ next:
+ /* Go to next entry in node */
+ index = ((index >> node->shift) + 1) << node->shift;
/* Overflow can happen when last_index is ~0UL... */
if (index > last_index || !index)
break;
- if (tagged >= nr_to_tag)
- break;
- while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
+ offset = (index >> node->shift) & RADIX_TREE_MAP_MASK;
+ while (offset == 0) {
/*
* We've fully scanned this node. Go up. Because
* last_index is guaranteed to be in the tree, what
* we do below cannot wander astray.
*/
- slot = slot->parent;
- shift += RADIX_TREE_MAP_SHIFT;
+ node = node->parent;
+ offset = (index >> node->shift) & RADIX_TREE_MAP_MASK;
}
+ if (is_sibling_entry(node, node->slots[offset]))
+ goto next;
+ if (tagged >= nr_to_tag)
+ break;
}
/*
* We need not to tag the root tag if there is no tag which is set with
@@ -1095,9 +1090,10 @@
*
* Like radix_tree_lookup, radix_tree_gang_lookup may be called under
* rcu_read_lock. In this case, rather than the returned results being
- * an atomic snapshot of the tree at a single point in time, the semantics
- * of an RCU protected gang lookup are as though multiple radix_tree_lookups
- * have been issued in individual locks, and results stored in 'results'.
+ * an atomic snapshot of the tree at a single point in time, the
+ * semantics of an RCU protected gang lookup are as though multiple
+ * radix_tree_lookups have been issued in individual locks, and results
+ * stored in 'results'.
*/
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
@@ -1114,7 +1110,7 @@
results[ret] = rcu_dereference_raw(*slot);
if (!results[ret])
continue;
- if (radix_tree_is_indirect_ptr(results[ret])) {
+ if (radix_tree_is_internal_node(results[ret])) {
slot = radix_tree_iter_retry(&iter);
continue;
}
@@ -1197,7 +1193,7 @@
results[ret] = rcu_dereference_raw(*slot);
if (!results[ret])
continue;
- if (radix_tree_is_indirect_ptr(results[ret])) {
+ if (radix_tree_is_internal_node(results[ret])) {
slot = radix_tree_iter_retry(&iter);
continue;
}
@@ -1247,58 +1243,48 @@
#if defined(CONFIG_SHMEM) && defined(CONFIG_SWAP)
#include <linux/sched.h> /* for cond_resched() */
+struct locate_info {
+ unsigned long found_index;
+ bool stop;
+};
+
/*
* This linear search is at present only useful to shmem_unuse_inode().
*/
static unsigned long __locate(struct radix_tree_node *slot, void *item,
- unsigned long index, unsigned long *found_index)
+ unsigned long index, struct locate_info *info)
{
- unsigned int shift, height;
unsigned long i;
- height = slot->path & RADIX_TREE_HEIGHT_MASK;
- shift = (height-1) * RADIX_TREE_MAP_SHIFT;
+ do {
+ unsigned int shift = slot->shift;
- for ( ; height > 1; height--) {
- i = (index >> shift) & RADIX_TREE_MAP_MASK;
- for (;;) {
- if (slot->slots[i] != NULL)
- break;
- index &= ~((1UL << shift) - 1);
- index += 1UL << shift;
- if (index == 0)
- goto out; /* 32-bit wraparound */
- i++;
- if (i == RADIX_TREE_MAP_SIZE)
+ for (i = (index >> shift) & RADIX_TREE_MAP_MASK;
+ i < RADIX_TREE_MAP_SIZE;
+ i++, index += (1UL << shift)) {
+ struct radix_tree_node *node =
+ rcu_dereference_raw(slot->slots[i]);
+ if (node == RADIX_TREE_RETRY)
goto out;
- }
-
- slot = rcu_dereference_raw(slot->slots[i]);
- if (slot == NULL)
- goto out;
- if (!radix_tree_is_indirect_ptr(slot)) {
- if (slot == item) {
- *found_index = index + i;
- index = 0;
- } else {
- index += shift;
+ if (!radix_tree_is_internal_node(node)) {
+ if (node == item) {
+ info->found_index = index;
+ info->stop = true;
+ goto out;
+ }
+ continue;
}
- goto out;
+ node = entry_to_node(node);
+ if (is_sibling_entry(slot, node))
+ continue;
+ slot = node;
+ break;
}
- slot = indirect_to_ptr(slot);
- shift -= RADIX_TREE_MAP_SHIFT;
- }
+ } while (i < RADIX_TREE_MAP_SIZE);
- /* Bottom level: check items */
- for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
- if (slot->slots[i] == item) {
- *found_index = index + i;
- index = 0;
- goto out;
- }
- }
- index += RADIX_TREE_MAP_SIZE;
out:
+ if ((index == 0) && (i == RADIX_TREE_MAP_SIZE))
+ info->stop = true;
return index;
}
@@ -1316,32 +1302,35 @@
struct radix_tree_node *node;
unsigned long max_index;
unsigned long cur_index = 0;
- unsigned long found_index = -1;
+ struct locate_info info = {
+ .found_index = -1,
+ .stop = false,
+ };
do {
rcu_read_lock();
node = rcu_dereference_raw(root->rnode);
- if (!radix_tree_is_indirect_ptr(node)) {
+ if (!radix_tree_is_internal_node(node)) {
rcu_read_unlock();
if (node == item)
- found_index = 0;
+ info.found_index = 0;
break;
}
- node = indirect_to_ptr(node);
- max_index = radix_tree_maxindex(node->path &
- RADIX_TREE_HEIGHT_MASK);
+ node = entry_to_node(node);
+
+ max_index = node_maxindex(node);
if (cur_index > max_index) {
rcu_read_unlock();
break;
}
- cur_index = __locate(node, item, cur_index, &found_index);
+ cur_index = __locate(node, item, cur_index, &info);
rcu_read_unlock();
cond_resched();
- } while (cur_index != 0 && cur_index <= max_index);
+ } while (!info.stop && cur_index <= max_index);
- return found_index;
+ return info.found_index;
}
#else
unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
@@ -1351,47 +1340,45 @@
#endif /* CONFIG_SHMEM && CONFIG_SWAP */
/**
- * radix_tree_shrink - shrink height of a radix tree to minimal
+ * radix_tree_shrink - shrink radix tree to minimum height
* @root radix tree root
*/
-static inline void radix_tree_shrink(struct radix_tree_root *root)
+static inline bool radix_tree_shrink(struct radix_tree_root *root)
{
- /* try to shrink tree height */
- while (root->height > 0) {
- struct radix_tree_node *to_free = root->rnode;
- struct radix_tree_node *slot;
+ bool shrunk = false;
- BUG_ON(!radix_tree_is_indirect_ptr(to_free));
- to_free = indirect_to_ptr(to_free);
+ for (;;) {
+ struct radix_tree_node *node = root->rnode;
+ struct radix_tree_node *child;
+
+ if (!radix_tree_is_internal_node(node))
+ break;
+ node = entry_to_node(node);
/*
* The candidate node has more than one child, or its child
- * is not at the leftmost slot, or it is a multiorder entry,
- * we cannot shrink.
+ * is not at the leftmost slot, or the child is a multiorder
+ * entry, we cannot shrink.
*/
- if (to_free->count != 1)
+ if (node->count != 1)
break;
- slot = to_free->slots[0];
- if (!slot)
+ child = node->slots[0];
+ if (!child)
break;
+ if (!radix_tree_is_internal_node(child) && node->shift)
+ break;
+
+ if (radix_tree_is_internal_node(child))
+ entry_to_node(child)->parent = NULL;
/*
* We don't need rcu_assign_pointer(), since we are simply
* moving the node from one part of the tree to another: if it
* was safe to dereference the old pointer to it
- * (to_free->slots[0]), it will be safe to dereference the new
+ * (node->slots[0]), it will be safe to dereference the new
* one (root->rnode) as far as dependent read barriers go.
*/
- if (root->height > 1) {
- if (!radix_tree_is_indirect_ptr(slot))
- break;
-
- slot = indirect_to_ptr(slot);
- slot->parent = NULL;
- slot = ptr_to_indirect(slot);
- }
- root->rnode = slot;
- root->height--;
+ root->rnode = child;
/*
* We have a dilemma here. The node's slot[0] must not be
@@ -1403,7 +1390,7 @@
* their slot to become empty sooner or later.
*
* For example, lockless pagecache will look up a slot, deref
- * the page pointer, and if the page is 0 refcount it means it
+ * the page pointer, and if the page has 0 refcount it means it
* was concurrently deleted from pagecache so try the deref
* again. Fortunately there is already a requirement for logic
* to retry the entire slot lookup -- the indirect pointer
@@ -1411,12 +1398,14 @@
* also results in a stale slot). So tag the slot as indirect
* to force callers to retry.
*/
- if (root->height == 0)
- *((unsigned long *)&to_free->slots[0]) |=
- RADIX_TREE_INDIRECT_PTR;
+ if (!radix_tree_is_internal_node(child))
+ node->slots[0] = RADIX_TREE_RETRY;
- radix_tree_node_free(to_free);
+ radix_tree_node_free(node);
+ shrunk = true;
}
+
+ return shrunk;
}
/**
@@ -1439,24 +1428,17 @@
struct radix_tree_node *parent;
if (node->count) {
- if (node == indirect_to_ptr(root->rnode)) {
- radix_tree_shrink(root);
- if (root->height == 0)
- deleted = true;
- }
+ if (node == entry_to_node(root->rnode))
+ deleted |= radix_tree_shrink(root);
return deleted;
}
parent = node->parent;
if (parent) {
- unsigned int offset;
-
- offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
- parent->slots[offset] = NULL;
+ parent->slots[node->offset] = NULL;
parent->count--;
} else {
root_tag_clear_all(root);
- root->height = 0;
root->rnode = NULL;
}
@@ -1469,6 +1451,20 @@
return deleted;
}
+static inline void delete_sibling_entries(struct radix_tree_node *node,
+ void *ptr, unsigned offset)
+{
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ int i;
+ for (i = 1; offset + i < RADIX_TREE_MAP_SIZE; i++) {
+ if (node->slots[offset + i] != ptr)
+ break;
+ node->slots[offset + i] = NULL;
+ node->count--;
+ }
+#endif
+}
+
/**
* radix_tree_delete_item - delete an item from a radix tree
* @root: radix tree root
@@ -1484,7 +1480,7 @@
unsigned long index, void *item)
{
struct radix_tree_node *node;
- unsigned int offset, i;
+ unsigned int offset;
void **slot;
void *entry;
int tag;
@@ -1502,24 +1498,13 @@
return entry;
}
- offset = index & RADIX_TREE_MAP_MASK;
+ offset = get_slot_offset(node, slot);
- /*
- * Clear all tags associated with the item to be deleted.
- * This way of doing it would be inefficient, but seldom is any set.
- */
- for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (tag_get(node, tag, offset))
- radix_tree_tag_clear(root, index, tag);
- }
+ /* Clear all tags associated with the item to be deleted. */
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ node_tag_clear(root, node, tag, offset);
- /* Delete any sibling slots pointing to this slot */
- for (i = 1; offset + i < RADIX_TREE_MAP_SIZE; i++) {
- if (node->slots[offset + i] != ptr_to_indirect(slot))
- break;
- node->slots[offset + i] = NULL;
- node->count--;
- }
+ delete_sibling_entries(node, node_to_entry(slot), offset);
node->slots[offset] = NULL;
node->count--;
@@ -1544,6 +1529,28 @@
}
EXPORT_SYMBOL(radix_tree_delete);
+struct radix_tree_node *radix_tree_replace_clear_tags(
+ struct radix_tree_root *root,
+ unsigned long index, void *entry)
+{
+ struct radix_tree_node *node;
+ void **slot;
+
+ __radix_tree_lookup(root, index, &node, &slot);
+
+ if (node) {
+ unsigned int tag, offset = get_slot_offset(node, slot);
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+ node_tag_clear(root, node, tag, offset);
+ } else {
+ /* Clear root node tags */
+ root->gfp_mask &= __GFP_BITS_MASK;
+ }
+
+ radix_tree_replace_slot(slot, entry);
+ return node;
+}
+
/**
* radix_tree_tagged - test whether any items in the tree are tagged
* @root: radix tree root
@@ -1564,45 +1571,24 @@
INIT_LIST_HEAD(&node->private_list);
}
-static __init unsigned long __maxindex(unsigned int height)
-{
- unsigned int width = height * RADIX_TREE_MAP_SHIFT;
- int shift = RADIX_TREE_INDEX_BITS - width;
-
- if (shift < 0)
- return ~0UL;
- if (shift >= BITS_PER_LONG)
- return 0UL;
- return ~0UL >> shift;
-}
-
-static __init void radix_tree_init_maxindex(void)
-{
- unsigned int i;
-
- for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
- height_to_maxindex[i] = __maxindex(i);
-}
-
static int radix_tree_callback(struct notifier_block *nfb,
- unsigned long action,
- void *hcpu)
+ unsigned long action, void *hcpu)
{
- int cpu = (long)hcpu;
- struct radix_tree_preload *rtp;
- struct radix_tree_node *node;
+ int cpu = (long)hcpu;
+ struct radix_tree_preload *rtp;
+ struct radix_tree_node *node;
- /* Free per-cpu pool of perloaded nodes */
- if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
- rtp = &per_cpu(radix_tree_preloads, cpu);
- while (rtp->nr) {
+ /* Free per-cpu pool of preloaded nodes */
+ if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
+ rtp = &per_cpu(radix_tree_preloads, cpu);
+ while (rtp->nr) {
node = rtp->nodes;
rtp->nodes = node->private_data;
kmem_cache_free(radix_tree_node_cachep, node);
rtp->nr--;
- }
- }
- return NOTIFY_OK;
+ }
+ }
+ return NOTIFY_OK;
}
void __init radix_tree_init(void)
@@ -1611,6 +1597,5 @@
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
- radix_tree_init_maxindex();
hotcpu_notifier(radix_tree_callback, 0);
}
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 3384032..33f655e 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -1,5 +1,6 @@
#include <linux/compiler.h>
#include <linux/export.h>
+#include <linux/kasan-checks.h>
#include <linux/uaccess.h>
#include <linux/kernel.h>
#include <linux/errno.h>
@@ -109,6 +110,7 @@
unsigned long max = max_addr - src_addr;
long retval;
+ kasan_check_write(dst, count);
user_access_begin();
retval = do_strncpy_from_user(dst, src, count, max);
user_access_end();
diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index 82169fb..5e51872b 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -12,9 +12,12 @@
#define pr_fmt(fmt) "kasan test: %s " fmt, __func__
#include <linux/kernel.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/string.h>
+#include <linux/uaccess.h>
#include <linux/module.h>
static noinline void __init kmalloc_oob_right(void)
@@ -344,6 +347,70 @@
*(volatile char *)p;
}
+static noinline void __init ksize_unpoisons_memory(void)
+{
+ char *ptr;
+ size_t size = 123, real_size = size;
+
+ pr_info("ksize() unpoisons the whole allocated chunk\n");
+ ptr = kmalloc(size, GFP_KERNEL);
+ if (!ptr) {
+ pr_err("Allocation failed\n");
+ return;
+ }
+ real_size = ksize(ptr);
+ /* This access doesn't trigger an error. */
+ ptr[size] = 'x';
+ /* This one does. */
+ ptr[real_size] = 'y';
+ kfree(ptr);
+}
+
+static noinline void __init copy_user_test(void)
+{
+ char *kmem;
+ char __user *usermem;
+ size_t size = 10;
+ int unused;
+
+ kmem = kmalloc(size, GFP_KERNEL);
+ if (!kmem)
+ return;
+
+ usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE,
+ PROT_READ | PROT_WRITE | PROT_EXEC,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0);
+ if (IS_ERR(usermem)) {
+ pr_err("Failed to allocate user memory\n");
+ kfree(kmem);
+ return;
+ }
+
+ pr_info("out-of-bounds in copy_from_user()\n");
+ unused = copy_from_user(kmem, usermem, size + 1);
+
+ pr_info("out-of-bounds in copy_to_user()\n");
+ unused = copy_to_user(usermem, kmem, size + 1);
+
+ pr_info("out-of-bounds in __copy_from_user()\n");
+ unused = __copy_from_user(kmem, usermem, size + 1);
+
+ pr_info("out-of-bounds in __copy_to_user()\n");
+ unused = __copy_to_user(usermem, kmem, size + 1);
+
+ pr_info("out-of-bounds in __copy_from_user_inatomic()\n");
+ unused = __copy_from_user_inatomic(kmem, usermem, size + 1);
+
+ pr_info("out-of-bounds in __copy_to_user_inatomic()\n");
+ unused = __copy_to_user_inatomic(usermem, kmem, size + 1);
+
+ pr_info("out-of-bounds in strncpy_from_user()\n");
+ unused = strncpy_from_user(kmem, usermem, size + 1);
+
+ vm_munmap((unsigned long)usermem, PAGE_SIZE);
+ kfree(kmem);
+}
+
static int __init kmalloc_tests_init(void)
{
kmalloc_oob_right();
@@ -367,6 +434,8 @@
kmem_cache_oob();
kasan_stack_oob();
kasan_global_oob();
+ ksize_unpoisons_memory();
+ copy_user_test();
return -EAGAIN;
}
diff --git a/lib/uuid.c b/lib/uuid.c
index 398821e..e116ae5 100644
--- a/lib/uuid.c
+++ b/lib/uuid.c
@@ -1,7 +1,7 @@
/*
* Unified UUID/GUID definition
*
- * Copyright (C) 2009, Intel Corp.
+ * Copyright (C) 2009, 2016 Intel Corp.
* Huang Ying <ying.huang@intel.com>
*
* This program is free software; you can redistribute it and/or
@@ -12,17 +12,40 @@
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
#include <linux/kernel.h>
+#include <linux/ctype.h>
+#include <linux/errno.h>
#include <linux/export.h>
#include <linux/uuid.h>
#include <linux/random.h>
+const u8 uuid_le_index[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15};
+EXPORT_SYMBOL(uuid_le_index);
+const u8 uuid_be_index[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
+EXPORT_SYMBOL(uuid_be_index);
+
+/***************************************************************
+ * Random UUID interface
+ *
+ * Used here for a Boot ID, but can be useful for other kernel
+ * drivers.
+ ***************************************************************/
+
+/*
+ * Generate random UUID
+ */
+void generate_random_uuid(unsigned char uuid[16])
+{
+ get_random_bytes(uuid, 16);
+ /* Set UUID version to 4 --- truly random generation */
+ uuid[6] = (uuid[6] & 0x0F) | 0x40;
+ /* Set the UUID variant to DCE */
+ uuid[8] = (uuid[8] & 0x3F) | 0x80;
+}
+EXPORT_SYMBOL(generate_random_uuid);
+
static void __uuid_gen_common(__u8 b[16])
{
prandom_bytes(b, 16);
@@ -45,3 +68,61 @@
bu->b[6] = (bu->b[6] & 0x0F) | 0x40;
}
EXPORT_SYMBOL_GPL(uuid_be_gen);
+
+/**
+ * uuid_is_valid - checks if UUID string valid
+ * @uuid: UUID string to check
+ *
+ * Description:
+ * It checks if the UUID string is following the format:
+ * xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
+ * where x is a hex digit.
+ *
+ * Return: true if input is valid UUID string.
+ */
+bool uuid_is_valid(const char *uuid)
+{
+ unsigned int i;
+
+ for (i = 0; i < UUID_STRING_LEN; i++) {
+ if (i == 8 || i == 13 || i == 18 || i == 23) {
+ if (uuid[i] != '-')
+ return false;
+ } else if (!isxdigit(uuid[i])) {
+ return false;
+ }
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(uuid_is_valid);
+
+static int __uuid_to_bin(const char *uuid, __u8 b[16], const u8 ei[16])
+{
+ static const u8 si[16] = {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34};
+ unsigned int i;
+
+ if (!uuid_is_valid(uuid))
+ return -EINVAL;
+
+ for (i = 0; i < 16; i++) {
+ int hi = hex_to_bin(uuid[si[i]] + 0);
+ int lo = hex_to_bin(uuid[si[i]] + 1);
+
+ b[ei[i]] = (hi << 4) | lo;
+ }
+
+ return 0;
+}
+
+int uuid_le_to_bin(const char *uuid, uuid_le *u)
+{
+ return __uuid_to_bin(uuid, u->b, uuid_le_index);
+}
+EXPORT_SYMBOL(uuid_le_to_bin);
+
+int uuid_be_to_bin(const char *uuid, uuid_be *u)
+{
+ return __uuid_to_bin(uuid, u->b, uuid_be_index);
+}
+EXPORT_SYMBOL(uuid_be_to_bin);
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index ccb664b..0967771 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -30,6 +30,7 @@
#include <linux/ioport.h>
#include <linux/dcache.h>
#include <linux/cred.h>
+#include <linux/uuid.h>
#include <net/addrconf.h>
#ifdef CONFIG_BLOCK
#include <linux/blkdev.h>
@@ -1304,19 +1305,17 @@
char *uuid_string(char *buf, char *end, const u8 *addr,
struct printf_spec spec, const char *fmt)
{
- char uuid[sizeof("xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")];
+ char uuid[UUID_STRING_LEN + 1];
char *p = uuid;
int i;
- static const u8 be[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
- static const u8 le[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15};
- const u8 *index = be;
+ const u8 *index = uuid_be_index;
bool uc = false;
switch (*(++fmt)) {
case 'L':
uc = true; /* fall-through */
case 'l':
- index = le;
+ index = uuid_le_index;
break;
case 'B':
uc = true;
@@ -1324,7 +1323,10 @@
}
for (i = 0; i < 16; i++) {
- p = hex_byte_pack(p, addr[index[i]]);
+ if (uc)
+ p = hex_byte_pack_upper(p, addr[index[i]]);
+ else
+ p = hex_byte_pack(p, addr[index[i]]);
switch (i) {
case 3:
case 5:
@@ -1337,13 +1339,6 @@
*p = 0;
- if (uc) {
- p = uuid;
- do {
- *p = toupper(*p);
- } while (*(++p));
- }
-
return string(buf, end, uuid, spec);
}
diff --git a/mm/Kconfig b/mm/Kconfig
index b0432b7..2664c11 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -404,6 +404,7 @@
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
select COMPACTION
+ select RADIX_TREE_MULTIORDER
help
Transparent Hugepages allows the kernel to use huge pages and
huge tlb transparently to the applications whenever possible.
@@ -567,7 +568,7 @@
zsmalloc.
config ZBUD
- tristate "Low density storage for compressed pages"
+ tristate "Low (Up to 2x) density storage for compressed pages"
default n
help
A special purpose allocator for storing compressed pages.
@@ -576,6 +577,16 @@
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.
+config Z3FOLD
+ tristate "Up to 3x density storage for compressed pages"
+ depends on ZPOOL
+ default n
+ help
+ A special purpose allocator for storing compressed pages.
+ It is designed to store up to three compressed pages per physical
+ page. It is a ZBUD derivative so the simplicity and determinism are
+ still there.
+
config ZSMALLOC
tristate "Memory allocator for compressed pages"
depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index deb467e..78c6f7d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -89,6 +89,7 @@
obj-$(CONFIG_ZPOOL) += zpool.o
obj-$(CONFIG_ZBUD) += zbud.o
obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
+obj-$(CONFIG_Z3FOLD) += z3fold.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0c6317b..ed173b8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -979,20 +978,7 @@
*/
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
-
- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that a particular
- * WQ is congested if the worker thread is looping without
- * ever sleeping. Therefore we have to do a short sleep
- * here rather than calling cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
+ cond_resched();
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
if (ret < 0)
diff --git a/mm/compaction.c b/mm/compaction.c
index eda3c22..1427366 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1229,7 +1229,7 @@
return order == -1;
}
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
@@ -1252,7 +1252,10 @@
if (cc->direct_compaction)
zone->compact_blockskip_flush = true;
- return COMPACT_COMPLETE;
+ if (cc->whole_zone)
+ return COMPACT_COMPLETE;
+ else
+ return COMPACT_PARTIAL_SKIPPED;
}
if (is_via_compact_memory(cc->order))
@@ -1292,8 +1295,9 @@
return COMPACT_NO_SUITABLE_PAGE;
}
-static int compact_finished(struct zone *zone, struct compact_control *cc,
- const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+ struct compact_control *cc,
+ const int migratetype)
{
int ret;
@@ -1312,9 +1316,10 @@
* COMPACT_PARTIAL - If the allocation would succeed without compaction
* COMPACT_CONTINUE - If compaction should run now
*/
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
- int classzone_idx)
+ int classzone_idx,
+ unsigned long wmark_target)
{
int fragindex;
unsigned long watermark;
@@ -1337,7 +1342,8 @@
* allocated and for a short time, the footprint is higher
*/
watermark += (2UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
+ if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
+ alloc_flags, wmark_target))
return COMPACT_SKIPPED;
/*
@@ -1358,13 +1364,14 @@
return COMPACT_CONTINUE;
}
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
- ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
+ ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
+ zone_page_state(zone, NR_FREE_PAGES));
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;
@@ -1372,9 +1379,42 @@
return ret;
}
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags)
{
- int ret;
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure at least one zone would pass __compaction_suitable if we continue
+ * retrying the reclaim.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+ enum compact_result compact_result;
+
+ /*
+ * Do not consider all the reclaimable memory because we do not
+ * want to trash just for a single high order allocation which
+ * is even not guaranteed to appear even if __compaction_suitable
+ * is happy about the watermark check.
+ */
+ available = zone_reclaimable_pages(zone) / order;
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+ compact_result = __compaction_suitable(zone, order, alloc_flags,
+ ac_classzone_idx(ac), available);
+ if (compact_result != COMPACT_SKIPPED &&
+ compact_result != COMPACT_NOT_SUITABLE_ZONE)
+ return true;
+ }
+
+ return false;
+}
+
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
+{
+ enum compact_result ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1382,15 +1422,12 @@
ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
- switch (ret) {
- case COMPACT_PARTIAL:
- case COMPACT_SKIPPED:
- /* Compaction is likely to fail */
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
return ret;
- case COMPACT_CONTINUE:
- /* Fall through to compaction */
- ;
- }
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);
/*
* Clear pageblock skip if there were failures recently and compaction
@@ -1415,6 +1452,10 @@
zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
+
+ if (cc->migrate_pfn == start_pfn)
+ cc->whole_zone = true;
+
cc->last_migrated_pfn = 0;
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
@@ -1530,11 +1571,11 @@
return ret;
}
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum migrate_mode mode, int *contended,
unsigned int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -1572,7 +1613,7 @@
*
* This is the main entry point for direct page compaction.
*/
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
{
@@ -1580,7 +1621,7 @@
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_SKIPPED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
*contended = COMPACT_CONTENDED_NONE;
@@ -1594,11 +1635,13 @@
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
- int status;
+ enum compact_result status;
int zone_contended;
- if (compaction_deferred(zone, order))
+ if (compaction_deferred(zone, order)) {
+ rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;
+ }
status = compact_zone_order(zone, order, gfp_mask, mode,
&zone_contended, alloc_flags,
@@ -1634,7 +1677,8 @@
goto break_loop;
}
- if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
+ if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+ status == COMPACT_PARTIAL_SKIPPED)) {
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
@@ -1669,7 +1713,7 @@
* If at least one zone wasn't deferred or skipped, we report if all
* zones that were tried were lock contended.
*/
- if (rc > COMPACT_SKIPPED && all_zones_contended)
+ if (rc > COMPACT_INACTIVE && all_zones_contended)
*contended = COMPACT_CONTENDED_LOCK;
return rc;
@@ -1818,7 +1862,7 @@
struct zone *zone;
enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
- for (zoneid = 0; zoneid < classzone_idx; zoneid++) {
+ for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
zone = &pgdat->node_zones[zoneid];
if (!populated_zone(zone))
@@ -1853,7 +1897,7 @@
cc.classzone_idx);
count_vm_event(KCOMPACTD_WAKE);
- for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) {
+ for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
int status;
zone = &pgdat->node_zones[zoneid];
@@ -1881,7 +1925,7 @@
cc.classzone_idx, 0)) {
success = true;
compaction_defer_reset(zone, cc.order, false);
- } else if (status == COMPACT_COMPLETE) {
+ } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
/*
* We use sync migration mode here, so we defer like
* sync direct compaction does.
diff --git a/mm/filemap.c b/mm/filemap.c
index 0169033..9665b1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -114,14 +114,11 @@
struct page *page, void *shadow)
{
struct radix_tree_node *node;
- unsigned long index;
- unsigned int offset;
- unsigned int tag;
- void **slot;
VM_BUG_ON(!PageLocked(page));
- __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+ node = radix_tree_replace_clear_tags(&mapping->page_tree, page->index,
+ shadow);
if (shadow) {
mapping->nrexceptional++;
@@ -135,23 +132,9 @@
}
mapping->nrpages--;
- if (!node) {
- /* Clear direct pointer tags in root node */
- mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
- radix_tree_replace_slot(slot, shadow);
+ if (!node)
return;
- }
- /* Clear tree tags for the removed page */
- index = page->index;
- offset = index & RADIX_TREE_MAP_MASK;
- for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (test_bit(offset, node->tags[tag]))
- radix_tree_tag_clear(&mapping->page_tree, index, tag);
- }
-
- /* Delete page, swap shadow entry */
- radix_tree_replace_slot(slot, shadow);
workingset_node_pages_dec(node);
if (shadow)
workingset_node_shadows_inc(node);
@@ -713,8 +696,12 @@
* The page might have been evicted from cache only
* recently, in which case it should be activated like
* any other repeatedly accessed page.
+ * The exception is pages getting rewritten; evicting other
+ * data from the working set, only to cache data that will
+ * get overwritten with something else, is a waste of memory.
*/
- if (shadow && workingset_refault(shadow)) {
+ if (!(gfp_mask & __GFP_WRITE) &&
+ shadow && workingset_refault(shadow)) {
SetPageActive(page);
workingset_activation(page);
} else
@@ -2187,7 +2174,7 @@
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false);
+ do_set_pte(vma, addr, page, pte, false, false, true);
unlock_page(page);
goto next;
unlock:
@@ -2574,7 +2561,7 @@
pgoff_t index, unsigned flags)
{
struct page *page;
- int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
+ int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
if (flags & AOP_FLAG_NOFS)
fgp_flags |= FGP_NOFS;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 66675ee..41ef754 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -89,6 +89,7 @@
static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static unsigned long khugepaged_sleep_expire;
static struct task_struct *khugepaged_thread __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
@@ -467,6 +468,7 @@
return -EINVAL;
khugepaged_scan_sleep_millisecs = msecs;
+ khugepaged_sleep_expire = 0;
wake_up_interruptible(&khugepaged_wait);
return count;
@@ -494,6 +496,7 @@
return -EINVAL;
khugepaged_alloc_sleep_millisecs = msecs;
+ khugepaged_sleep_expire = 0;
wake_up_interruptible(&khugepaged_wait);
return count;
@@ -764,10 +767,7 @@
static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
+ return pmd_mkhuge(mk_pmd(page, prot));
}
static inline struct list_head *page_deferred_list(struct page *page)
@@ -2794,15 +2794,25 @@
put_page(hpage);
}
+static bool khugepaged_should_wakeup(void)
+{
+ return kthread_should_stop() ||
+ time_after_eq(jiffies, khugepaged_sleep_expire);
+}
+
static void khugepaged_wait_work(void)
{
if (khugepaged_has_work()) {
- if (!khugepaged_scan_sleep_millisecs)
+ const unsigned long scan_sleep_jiffies =
+ msecs_to_jiffies(khugepaged_scan_sleep_millisecs);
+
+ if (!scan_sleep_jiffies)
return;
+ khugepaged_sleep_expire = jiffies + scan_sleep_jiffies;
wait_event_freezable_timeout(khugepaged_wait,
- kthread_should_stop(),
- msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
+ khugepaged_should_wakeup(),
+ scan_sleep_jiffies);
return;
}
@@ -3026,8 +3036,10 @@
return;
/*
- * Caller holds the mmap_sem write mode, so a huge pmd cannot
- * materialize from under us.
+ * Caller holds the mmap_sem write mode or the anon_vma lock,
+ * so a huge pmd cannot materialize from under us (khugepaged
+ * holds both the mmap_sem write mode and the anon_vma lock
+ * write mode).
*/
__split_huge_pmd(vma, pmd, address, freeze);
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index d8fb10d..eec1150 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -67,26 +67,42 @@
return false;
}
+static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
+ struct hugetlb_cgroup *parent_h_cgroup)
+{
+ int idx;
+
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+ struct page_counter *counter = &h_cgroup->hugepage[idx];
+ struct page_counter *parent = NULL;
+ unsigned long limit;
+ int ret;
+
+ if (parent_h_cgroup)
+ parent = &parent_h_cgroup->hugepage[idx];
+ page_counter_init(counter, parent);
+
+ limit = round_down(PAGE_COUNTER_MAX,
+ 1 << huge_page_order(&hstates[idx]));
+ ret = page_counter_limit(counter, limit);
+ VM_BUG_ON(ret);
+ }
+}
+
static struct cgroup_subsys_state *
hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
struct hugetlb_cgroup *h_cgroup;
- int idx;
h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
if (!h_cgroup)
return ERR_PTR(-ENOMEM);
- if (parent_h_cgroup) {
- for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
- page_counter_init(&h_cgroup->hugepage[idx],
- &parent_h_cgroup->hugepage[idx]);
- } else {
+ if (!parent_h_cgroup)
root_h_cgroup = h_cgroup;
- for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
- page_counter_init(&h_cgroup->hugepage[idx], NULL);
- }
+
+ hugetlb_cgroup_init(h_cgroup, parent_h_cgroup);
return &h_cgroup->css;
}
@@ -285,6 +301,7 @@
return ret;
idx = MEMFILE_IDX(of_cft(of)->private);
+ nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_LIMIT:
diff --git a/mm/internal.h b/mm/internal.h
index 3ac544f..f6f3353 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -174,6 +174,7 @@
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool direct_compaction; /* False from kcompactd or /proc/... */
+ bool whole_zone; /* Whole zone has been scanned */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const unsigned int alloc_flags; /* alloc flags of a direct compactor */
diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile
index 131daad..1548749 100644
--- a/mm/kasan/Makefile
+++ b/mm/kasan/Makefile
@@ -8,3 +8,4 @@
CFLAGS_kasan.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
obj-y := kasan.o report.o kasan_init.o
+obj-$(CONFIG_SLAB) += quarantine.o
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 38f1dd7..18b6a2b 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -273,32 +273,48 @@
return memory_is_poisoned_n(addr, size);
}
-
-static __always_inline void check_memory_region(unsigned long addr,
- size_t size, bool write)
+static __always_inline void check_memory_region_inline(unsigned long addr,
+ size_t size, bool write,
+ unsigned long ret_ip)
{
if (unlikely(size == 0))
return;
if (unlikely((void *)addr <
kasan_shadow_to_mem((void *)KASAN_SHADOW_START))) {
- kasan_report(addr, size, write, _RET_IP_);
+ kasan_report(addr, size, write, ret_ip);
return;
}
if (likely(!memory_is_poisoned(addr, size)))
return;
- kasan_report(addr, size, write, _RET_IP_);
+ kasan_report(addr, size, write, ret_ip);
}
-void __asan_loadN(unsigned long addr, size_t size);
-void __asan_storeN(unsigned long addr, size_t size);
+static void check_memory_region(unsigned long addr,
+ size_t size, bool write,
+ unsigned long ret_ip)
+{
+ check_memory_region_inline(addr, size, write, ret_ip);
+}
+
+void kasan_check_read(const void *p, unsigned int size)
+{
+ check_memory_region((unsigned long)p, size, false, _RET_IP_);
+}
+EXPORT_SYMBOL(kasan_check_read);
+
+void kasan_check_write(const void *p, unsigned int size)
+{
+ check_memory_region((unsigned long)p, size, true, _RET_IP_);
+}
+EXPORT_SYMBOL(kasan_check_write);
#undef memset
void *memset(void *addr, int c, size_t len)
{
- __asan_storeN((unsigned long)addr, len);
+ check_memory_region((unsigned long)addr, len, true, _RET_IP_);
return __memset(addr, c, len);
}
@@ -306,8 +322,8 @@
#undef memmove
void *memmove(void *dest, const void *src, size_t len)
{
- __asan_loadN((unsigned long)src, len);
- __asan_storeN((unsigned long)dest, len);
+ check_memory_region((unsigned long)src, len, false, _RET_IP_);
+ check_memory_region((unsigned long)dest, len, true, _RET_IP_);
return __memmove(dest, src, len);
}
@@ -315,8 +331,8 @@
#undef memcpy
void *memcpy(void *dest, const void *src, size_t len)
{
- __asan_loadN((unsigned long)src, len);
- __asan_storeN((unsigned long)dest, len);
+ check_memory_region((unsigned long)src, len, false, _RET_IP_);
+ check_memory_region((unsigned long)dest, len, true, _RET_IP_);
return __memcpy(dest, src, len);
}
@@ -388,6 +404,16 @@
}
#endif
+void kasan_cache_shrink(struct kmem_cache *cache)
+{
+ quarantine_remove_cache(cache);
+}
+
+void kasan_cache_destroy(struct kmem_cache *cache)
+{
+ quarantine_remove_cache(cache);
+}
+
void kasan_poison_slab(struct page *page)
{
kasan_poison_shadow(page_address(page),
@@ -482,7 +508,7 @@
kasan_kmalloc(cache, object, cache->object_size, flags);
}
-void kasan_slab_free(struct kmem_cache *cache, void *object)
+void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
{
unsigned long size = cache->object_size;
unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE);
@@ -491,18 +517,43 @@
if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
return;
+ kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
+}
+
+bool kasan_slab_free(struct kmem_cache *cache, void *object)
+{
#ifdef CONFIG_SLAB
- if (cache->flags & SLAB_KASAN) {
- struct kasan_free_meta *free_info =
- get_free_info(cache, object);
+ /* RCU slabs could be legally used after free within the RCU period */
+ if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ return false;
+
+ if (likely(cache->flags & SLAB_KASAN)) {
struct kasan_alloc_meta *alloc_info =
get_alloc_info(cache, object);
- alloc_info->state = KASAN_STATE_FREE;
- set_track(&free_info->track, GFP_NOWAIT);
- }
-#endif
+ struct kasan_free_meta *free_info =
+ get_free_info(cache, object);
- kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
+ switch (alloc_info->state) {
+ case KASAN_STATE_ALLOC:
+ alloc_info->state = KASAN_STATE_QUARANTINE;
+ quarantine_put(free_info, cache);
+ set_track(&free_info->track, GFP_NOWAIT);
+ kasan_poison_slab_free(cache, object);
+ return true;
+ case KASAN_STATE_QUARANTINE:
+ case KASAN_STATE_FREE:
+ pr_err("Double free");
+ dump_stack();
+ break;
+ default:
+ break;
+ }
+ }
+ return false;
+#else
+ kasan_poison_slab_free(cache, object);
+ return false;
+#endif
}
void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size,
@@ -511,6 +562,9 @@
unsigned long redzone_start;
unsigned long redzone_end;
+ if (flags & __GFP_RECLAIM)
+ quarantine_reduce();
+
if (unlikely(object == NULL))
return;
@@ -541,6 +595,9 @@
unsigned long redzone_start;
unsigned long redzone_end;
+ if (flags & __GFP_RECLAIM)
+ quarantine_reduce();
+
if (unlikely(ptr == NULL))
return;
@@ -649,22 +706,22 @@
}
EXPORT_SYMBOL(__asan_unregister_globals);
-#define DEFINE_ASAN_LOAD_STORE(size) \
- void __asan_load##size(unsigned long addr) \
- { \
- check_memory_region(addr, size, false); \
- } \
- EXPORT_SYMBOL(__asan_load##size); \
- __alias(__asan_load##size) \
- void __asan_load##size##_noabort(unsigned long); \
- EXPORT_SYMBOL(__asan_load##size##_noabort); \
- void __asan_store##size(unsigned long addr) \
- { \
- check_memory_region(addr, size, true); \
- } \
- EXPORT_SYMBOL(__asan_store##size); \
- __alias(__asan_store##size) \
- void __asan_store##size##_noabort(unsigned long); \
+#define DEFINE_ASAN_LOAD_STORE(size) \
+ void __asan_load##size(unsigned long addr) \
+ { \
+ check_memory_region_inline(addr, size, false, _RET_IP_);\
+ } \
+ EXPORT_SYMBOL(__asan_load##size); \
+ __alias(__asan_load##size) \
+ void __asan_load##size##_noabort(unsigned long); \
+ EXPORT_SYMBOL(__asan_load##size##_noabort); \
+ void __asan_store##size(unsigned long addr) \
+ { \
+ check_memory_region_inline(addr, size, true, _RET_IP_); \
+ } \
+ EXPORT_SYMBOL(__asan_store##size); \
+ __alias(__asan_store##size) \
+ void __asan_store##size##_noabort(unsigned long); \
EXPORT_SYMBOL(__asan_store##size##_noabort)
DEFINE_ASAN_LOAD_STORE(1);
@@ -675,7 +732,7 @@
void __asan_loadN(unsigned long addr, size_t size)
{
- check_memory_region(addr, size, false);
+ check_memory_region(addr, size, false, _RET_IP_);
}
EXPORT_SYMBOL(__asan_loadN);
@@ -685,7 +742,7 @@
void __asan_storeN(unsigned long addr, size_t size)
{
- check_memory_region(addr, size, true);
+ check_memory_region(addr, size, true, _RET_IP_);
}
EXPORT_SYMBOL(__asan_storeN);
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index 30a2f0b..7f7ac51 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -62,6 +62,7 @@
enum kasan_state {
KASAN_STATE_INIT,
KASAN_STATE_ALLOC,
+ KASAN_STATE_QUARANTINE,
KASAN_STATE_FREE
};
@@ -79,9 +80,14 @@
u32 reserved;
};
+struct qlist_node {
+ struct qlist_node *next;
+};
struct kasan_free_meta {
- /* Allocator freelist pointer, unused by KASAN. */
- void **freelist;
+ /* This field is used while the object is in the quarantine.
+ * Otherwise it might be used for the allocator freelist.
+ */
+ struct qlist_node quarantine_link;
struct kasan_track track;
};
@@ -105,4 +111,15 @@
void kasan_report(unsigned long addr, size_t size,
bool is_write, unsigned long ip);
+#ifdef CONFIG_SLAB
+void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache);
+void quarantine_reduce(void);
+void quarantine_remove_cache(struct kmem_cache *cache);
+#else
+static inline void quarantine_put(struct kasan_free_meta *info,
+ struct kmem_cache *cache) { }
+static inline void quarantine_reduce(void) { }
+static inline void quarantine_remove_cache(struct kmem_cache *cache) { }
+#endif
+
#endif
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
new file mode 100644
index 0000000..4973505
--- /dev/null
+++ b/mm/kasan/quarantine.c
@@ -0,0 +1,291 @@
+/*
+ * KASAN quarantine.
+ *
+ * Author: Alexander Potapenko <glider@google.com>
+ * Copyright (C) 2016 Google, Inc.
+ *
+ * Based on code by Dmitry Chernenkov.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ */
+
+#include <linux/gfp.h>
+#include <linux/hash.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/percpu.h>
+#include <linux/printk.h>
+#include <linux/shrinker.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include "../slab.h"
+#include "kasan.h"
+
+/* Data structure and operations for quarantine queues. */
+
+/*
+ * Each queue is a signle-linked list, which also stores the total size of
+ * objects inside of it.
+ */
+struct qlist_head {
+ struct qlist_node *head;
+ struct qlist_node *tail;
+ size_t bytes;
+};
+
+#define QLIST_INIT { NULL, NULL, 0 }
+
+static bool qlist_empty(struct qlist_head *q)
+{
+ return !q->head;
+}
+
+static void qlist_init(struct qlist_head *q)
+{
+ q->head = q->tail = NULL;
+ q->bytes = 0;
+}
+
+static void qlist_put(struct qlist_head *q, struct qlist_node *qlink,
+ size_t size)
+{
+ if (unlikely(qlist_empty(q)))
+ q->head = qlink;
+ else
+ q->tail->next = qlink;
+ q->tail = qlink;
+ qlink->next = NULL;
+ q->bytes += size;
+}
+
+static void qlist_move_all(struct qlist_head *from, struct qlist_head *to)
+{
+ if (unlikely(qlist_empty(from)))
+ return;
+
+ if (qlist_empty(to)) {
+ *to = *from;
+ qlist_init(from);
+ return;
+ }
+
+ to->tail->next = from->head;
+ to->tail = from->tail;
+ to->bytes += from->bytes;
+
+ qlist_init(from);
+}
+
+static void qlist_move(struct qlist_head *from, struct qlist_node *last,
+ struct qlist_head *to, size_t size)
+{
+ if (unlikely(last == from->tail)) {
+ qlist_move_all(from, to);
+ return;
+ }
+ if (qlist_empty(to))
+ to->head = from->head;
+ else
+ to->tail->next = from->head;
+ to->tail = last;
+ from->head = last->next;
+ last->next = NULL;
+ from->bytes -= size;
+ to->bytes += size;
+}
+
+
+/*
+ * The object quarantine consists of per-cpu queues and a global queue,
+ * guarded by quarantine_lock.
+ */
+static DEFINE_PER_CPU(struct qlist_head, cpu_quarantine);
+
+static struct qlist_head global_quarantine;
+static DEFINE_SPINLOCK(quarantine_lock);
+
+/* Maximum size of the global queue. */
+static unsigned long quarantine_size;
+
+/*
+ * The fraction of physical memory the quarantine is allowed to occupy.
+ * Quarantine doesn't support memory shrinker with SLAB allocator, so we keep
+ * the ratio low to avoid OOM.
+ */
+#define QUARANTINE_FRACTION 32
+
+#define QUARANTINE_LOW_SIZE (READ_ONCE(quarantine_size) * 3 / 4)
+#define QUARANTINE_PERCPU_SIZE (1 << 20)
+
+static struct kmem_cache *qlink_to_cache(struct qlist_node *qlink)
+{
+ return virt_to_head_page(qlink)->slab_cache;
+}
+
+static void *qlink_to_object(struct qlist_node *qlink, struct kmem_cache *cache)
+{
+ struct kasan_free_meta *free_info =
+ container_of(qlink, struct kasan_free_meta,
+ quarantine_link);
+
+ return ((void *)free_info) - cache->kasan_info.free_meta_offset;
+}
+
+static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache)
+{
+ void *object = qlink_to_object(qlink, cache);
+ struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object);
+ unsigned long flags;
+
+ local_irq_save(flags);
+ alloc_info->state = KASAN_STATE_FREE;
+ ___cache_free(cache, object, _THIS_IP_);
+ local_irq_restore(flags);
+}
+
+static void qlist_free_all(struct qlist_head *q, struct kmem_cache *cache)
+{
+ struct qlist_node *qlink;
+
+ if (unlikely(qlist_empty(q)))
+ return;
+
+ qlink = q->head;
+ while (qlink) {
+ struct kmem_cache *obj_cache =
+ cache ? cache : qlink_to_cache(qlink);
+ struct qlist_node *next = qlink->next;
+
+ qlink_free(qlink, obj_cache);
+ qlink = next;
+ }
+ qlist_init(q);
+}
+
+void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache)
+{
+ unsigned long flags;
+ struct qlist_head *q;
+ struct qlist_head temp = QLIST_INIT;
+
+ local_irq_save(flags);
+
+ q = this_cpu_ptr(&cpu_quarantine);
+ qlist_put(q, &info->quarantine_link, cache->size);
+ if (unlikely(q->bytes > QUARANTINE_PERCPU_SIZE))
+ qlist_move_all(q, &temp);
+
+ local_irq_restore(flags);
+
+ if (unlikely(!qlist_empty(&temp))) {
+ spin_lock_irqsave(&quarantine_lock, flags);
+ qlist_move_all(&temp, &global_quarantine);
+ spin_unlock_irqrestore(&quarantine_lock, flags);
+ }
+}
+
+void quarantine_reduce(void)
+{
+ size_t new_quarantine_size;
+ unsigned long flags;
+ struct qlist_head to_free = QLIST_INIT;
+ size_t size_to_free = 0;
+ struct qlist_node *last;
+
+ if (likely(READ_ONCE(global_quarantine.bytes) <=
+ READ_ONCE(quarantine_size)))
+ return;
+
+ spin_lock_irqsave(&quarantine_lock, flags);
+
+ /*
+ * Update quarantine size in case of hotplug. Allocate a fraction of
+ * the installed memory to quarantine minus per-cpu queue limits.
+ */
+ new_quarantine_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) /
+ QUARANTINE_FRACTION;
+ new_quarantine_size -= QUARANTINE_PERCPU_SIZE * num_online_cpus();
+ WRITE_ONCE(quarantine_size, new_quarantine_size);
+
+ last = global_quarantine.head;
+ while (last) {
+ struct kmem_cache *cache = qlink_to_cache(last);
+
+ size_to_free += cache->size;
+ if (!last->next || size_to_free >
+ global_quarantine.bytes - QUARANTINE_LOW_SIZE)
+ break;
+ last = last->next;
+ }
+ qlist_move(&global_quarantine, last, &to_free, size_to_free);
+
+ spin_unlock_irqrestore(&quarantine_lock, flags);
+
+ qlist_free_all(&to_free, NULL);
+}
+
+static void qlist_move_cache(struct qlist_head *from,
+ struct qlist_head *to,
+ struct kmem_cache *cache)
+{
+ struct qlist_node *prev = NULL, *curr;
+
+ if (unlikely(qlist_empty(from)))
+ return;
+
+ curr = from->head;
+ while (curr) {
+ struct qlist_node *qlink = curr;
+ struct kmem_cache *obj_cache = qlink_to_cache(qlink);
+
+ if (obj_cache == cache) {
+ if (unlikely(from->head == qlink)) {
+ from->head = curr->next;
+ prev = curr;
+ } else
+ prev->next = curr->next;
+ if (unlikely(from->tail == qlink))
+ from->tail = curr->next;
+ from->bytes -= cache->size;
+ qlist_put(to, qlink, cache->size);
+ } else {
+ prev = curr;
+ }
+ curr = curr->next;
+ }
+}
+
+static void per_cpu_remove_cache(void *arg)
+{
+ struct kmem_cache *cache = arg;
+ struct qlist_head to_free = QLIST_INIT;
+ struct qlist_head *q;
+
+ q = this_cpu_ptr(&cpu_quarantine);
+ qlist_move_cache(q, &to_free, cache);
+ qlist_free_all(&to_free, cache);
+}
+
+void quarantine_remove_cache(struct kmem_cache *cache)
+{
+ unsigned long flags;
+ struct qlist_head to_free = QLIST_INIT;
+
+ on_each_cpu(per_cpu_remove_cache, cache, 1);
+
+ spin_lock_irqsave(&quarantine_lock, flags);
+ qlist_move_cache(&global_quarantine, &to_free, cache);
+ spin_unlock_irqrestore(&quarantine_lock, flags);
+
+ qlist_free_all(&to_free, cache);
+}
diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index 60869a5a..b3c122d 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -151,6 +151,7 @@
print_track(&alloc_info->track);
break;
case KASAN_STATE_FREE:
+ case KASAN_STATE_QUARANTINE:
pr_err("Object freed, allocated with size %u bytes\n",
alloc_info->alloc_size);
free_info = get_free_info(cache, object);
diff --git a/mm/memblock.c b/mm/memblock.c
index b570ddd..ac12489 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -606,22 +606,14 @@
return memblock_add_range(&memblock.memory, base, size, nid, 0);
}
-static int __init_memblock memblock_add_region(phys_addr_t base,
- phys_addr_t size,
- int nid,
- unsigned long flags)
+int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
memblock_dbg("memblock_add: [%#016llx-%#016llx] flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size - 1,
- flags, (void *)_RET_IP_);
+ 0UL, (void *)_RET_IP_);
- return memblock_add_range(&memblock.memory, base, size, nid, flags);
-}
-
-int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
-{
- return memblock_add_region(base, size, MAX_NUMNODES, 0);
+ return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
}
/**
@@ -732,22 +724,14 @@
return memblock_remove_range(&memblock.reserved, base, size);
}
-static int __init_memblock memblock_reserve_region(phys_addr_t base,
- phys_addr_t size,
- int nid,
- unsigned long flags)
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
{
memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size - 1,
- flags, (void *)_RET_IP_);
+ 0UL, (void *)_RET_IP_);
- return memblock_add_range(&memblock.reserved, base, size, nid, flags);
-}
-
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
-{
- return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
+ return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
}
/**
@@ -840,7 +824,7 @@
{
struct memblock_type *type = &memblock.reserved;
- if (*idx >= 0 && *idx < type->cnt) {
+ if (*idx < type->cnt) {
struct memblock_region *r = &type->regions[*idx];
phys_addr_t base = r->base;
phys_addr_t size = r->size;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d71d387..b3f16ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2652,8 +2652,7 @@
}
/*
- * Reclaims as many pages from the given memcg as possible and moves
- * the rest to the parent.
+ * Reclaims as many pages from the given memcg as possible.
*
* Caller is responsible for holding css reference for memcg.
*/
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ca5acee..2fcca6b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -184,8 +184,8 @@
struct siginfo si;
int ret;
- pr_err("MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
- pfn, t->comm, t->pid);
+ pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
si.si_signo = SIGBUS;
si.si_errno = 0;
si.si_addr = (void *)addr;
@@ -208,7 +208,7 @@
ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
}
if (ret < 0)
- pr_info("MCE: Error sending signal to %s:%d: %d\n",
+ pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
t->comm, t->pid, ret);
return ret;
}
@@ -289,7 +289,7 @@
} else {
tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
if (!tk) {
- pr_err("MCE: Out of memory while machine check handling\n");
+ pr_err("Memory failure: Out of memory while machine check handling\n");
return;
}
}
@@ -303,7 +303,7 @@
* a SIGKILL because the error is not contained anymore.
*/
if (tk->addr == -EFAULT) {
- pr_info("MCE: Unable to find user space address %lx in %s\n",
+ pr_info("Memory failure: Unable to find user space address %lx in %s\n",
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
}
@@ -334,7 +334,7 @@
* signal and then access the memory. Just kill it.
*/
if (fail || tk->addr_valid == 0) {
- pr_err("MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+ pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
pfn, tk->tsk->comm, tk->tsk->pid);
force_sig(SIGKILL, tk->tsk);
}
@@ -347,7 +347,7 @@
*/
else if (kill_proc(tk->tsk, tk->addr, trapno,
pfn, page, flags) < 0)
- pr_err("MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+ pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
pfn, tk->tsk->comm, tk->tsk->pid);
}
put_task_struct(tk->tsk);
@@ -559,7 +559,7 @@
*/
static int me_unknown(struct page *p, unsigned long pfn)
{
- pr_err("MCE %#lx: Unknown page state\n", pfn);
+ pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
return MF_FAILED;
}
@@ -604,11 +604,12 @@
if (mapping->a_ops->error_remove_page) {
err = mapping->a_ops->error_remove_page(mapping, p);
if (err != 0) {
- pr_info("MCE %#lx: Failed to punch page: %d\n",
+ pr_info("Memory failure: %#lx: Failed to punch page: %d\n",
pfn, err);
} else if (page_has_private(p) &&
!try_to_release_page(p, GFP_NOIO)) {
- pr_info("MCE %#lx: failed to release buffers\n", pfn);
+ pr_info("Memory failure: %#lx: failed to release buffers\n",
+ pfn);
} else {
ret = MF_RECOVERED;
}
@@ -620,7 +621,8 @@
if (invalidate_inode_page(p))
ret = MF_RECOVERED;
else
- pr_info("MCE %#lx: Failed to invalidate\n", pfn);
+ pr_info("Memory failure: %#lx: Failed to invalidate\n",
+ pfn);
}
return ret;
}
@@ -833,7 +835,7 @@
{
trace_memory_failure_event(pfn, type, result);
- pr_err("MCE %#lx: recovery action for %s: %s\n",
+ pr_err("Memory failure: %#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
}
@@ -849,7 +851,7 @@
if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
count--;
if (count != 0) {
- pr_err("MCE %#lx: %s still referenced by %d users\n",
+ pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
pfn, action_page_types[ps->type], count);
result = MF_FAILED;
}
@@ -882,7 +884,7 @@
* tries to touch the "partially handled" page.
*/
if (!PageAnon(head)) {
- pr_err("MCE: %#lx: non anonymous thp\n",
+ pr_err("Memory failure: %#lx: non anonymous thp\n",
page_to_pfn(page));
return 0;
}
@@ -892,7 +894,8 @@
if (head == compound_head(page))
return 1;
- pr_info("MCE: %#lx cannot catch tail\n", page_to_pfn(page));
+ pr_info("Memory failure: %#lx cannot catch tail\n",
+ page_to_pfn(page));
put_page(head);
}
@@ -931,12 +934,13 @@
return SWAP_SUCCESS;
if (PageKsm(p)) {
- pr_err("MCE %#lx: can't handle KSM pages.\n", pfn);
+ pr_err("Memory failure: %#lx: can't handle KSM pages.\n", pfn);
return SWAP_FAIL;
}
if (PageSwapCache(p)) {
- pr_err("MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+ pr_err("Memory failure: %#lx: keeping poisoned page in swap cache\n",
+ pfn);
ttu |= TTU_IGNORE_HWPOISON;
}
@@ -954,7 +958,7 @@
} else {
kill = 0;
ttu |= TTU_IGNORE_HWPOISON;
- pr_info("MCE %#lx: corrupted page was clean: dropped without side effects\n",
+ pr_info("Memory failure: %#lx: corrupted page was clean: dropped without side effects\n",
pfn);
}
}
@@ -972,7 +976,7 @@
ret = try_to_unmap(hpage, ttu);
if (ret != SWAP_SUCCESS)
- pr_err("MCE %#lx: failed to unmap page (mapcount=%d)\n",
+ pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
/*
@@ -1040,14 +1044,16 @@
panic("Memory failure from trap %d on page %lx", trapno, pfn);
if (!pfn_valid(pfn)) {
- pr_err("MCE %#lx: memory outside kernel control\n", pfn);
+ pr_err("Memory failure: %#lx: memory outside kernel control\n",
+ pfn);
return -ENXIO;
}
p = pfn_to_page(pfn);
orig_head = hpage = compound_head(p);
if (TestSetPageHWPoison(p)) {
- pr_err("MCE %#lx: already hardware poisoned\n", pfn);
+ pr_err("Memory failure: %#lx: already hardware poisoned\n",
+ pfn);
return 0;
}
@@ -1112,9 +1118,11 @@
if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
unlock_page(hpage);
if (!PageAnon(hpage))
- pr_err("MCE: %#lx: non anonymous thp\n", pfn);
+ pr_err("Memory failure: %#lx: non anonymous thp\n",
+ pfn);
else
- pr_err("MCE: %#lx: thp split failed\n", pfn);
+ pr_err("Memory failure: %#lx: thp split failed\n",
+ pfn);
if (TestClearPageHWPoison(p))
num_poisoned_pages_sub(nr_pages);
put_hwpoison_page(p);
@@ -1178,7 +1186,7 @@
* unpoison always clear PG_hwpoison inside page lock
*/
if (!PageHWPoison(p)) {
- pr_err("MCE %#lx: just unpoisoned\n", pfn);
+ pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
num_poisoned_pages_sub(nr_pages);
unlock_page(hpage);
put_hwpoison_page(hpage);
@@ -1395,25 +1403,25 @@
page = compound_head(p);
if (!PageHWPoison(p)) {
- unpoison_pr_info("MCE: Page was already unpoisoned %#lx\n",
+ unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
pfn, &unpoison_rs);
return 0;
}
if (page_count(page) > 1) {
- unpoison_pr_info("MCE: Someone grabs the hwpoison page %#lx\n",
+ unpoison_pr_info("Unpoison: Someone grabs the hwpoison page %#lx\n",
pfn, &unpoison_rs);
return 0;
}
if (page_mapped(page)) {
- unpoison_pr_info("MCE: Someone maps the hwpoison page %#lx\n",
+ unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n",
pfn, &unpoison_rs);
return 0;
}
if (page_mapping(page)) {
- unpoison_pr_info("MCE: the hwpoison page has non-NULL mapping %#lx\n",
+ unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
pfn, &unpoison_rs);
return 0;
}
@@ -1424,7 +1432,7 @@
* In such case, we yield to memory_failure() and make unpoison fail.
*/
if (!PageHuge(page) && PageTransHuge(page)) {
- unpoison_pr_info("MCE: Memory failure is now running on %#lx\n",
+ unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
pfn, &unpoison_rs);
return 0;
}
@@ -1439,13 +1447,13 @@
* to the end.
*/
if (PageHuge(page)) {
- unpoison_pr_info("MCE: Memory failure is now running on free hugepage %#lx\n",
+ unpoison_pr_info("Unpoison: Memory failure is now running on free hugepage %#lx\n",
pfn, &unpoison_rs);
return 0;
}
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
- unpoison_pr_info("MCE: Software-unpoisoned free page %#lx\n",
+ unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
pfn, &unpoison_rs);
return 0;
}
@@ -1458,7 +1466,7 @@
* the free buddy page pool.
*/
if (TestClearPageHWPoison(page)) {
- unpoison_pr_info("MCE: Software-unpoisoned page %#lx\n",
+ unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
pfn, &unpoison_rs);
num_poisoned_pages_sub(nr_pages);
freeit = 1;
diff --git a/mm/memory.c b/mm/memory.c
index 07493e3..a1b93d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1744,6 +1744,7 @@
unsigned long next;
unsigned long end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
+ unsigned long remap_pfn = pfn;
int err;
/*
@@ -1770,7 +1771,7 @@
vma->vm_pgoff = pfn;
}
- err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
+ err = track_pfn_remap(vma, &prot, remap_pfn, addr, PAGE_ALIGN(size));
if (err)
return -EINVAL;
@@ -1789,7 +1790,7 @@
} while (pgd++, addr = next, addr != end);
if (err)
- untrack_pfn(vma, pfn, PAGE_ALIGN(size));
+ untrack_pfn(vma, remap_pfn, PAGE_ALIGN(size));
return err;
}
@@ -2875,7 +2876,7 @@
* vm_ops->map_pages.
*/
void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon)
+ struct page *page, pte_t *pte, bool write, bool anon, bool old)
{
pte_t entry;
@@ -2883,6 +2884,8 @@
entry = mk_pte(page, vma->vm_page_prot);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (old)
+ entry = pte_mkold(entry);
if (anon) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address, false);
@@ -2896,8 +2899,16 @@
update_mmu_cache(vma, address, pte);
}
+/*
+ * If architecture emulates "accessed" or "young" bit without HW support,
+ * there is no much gain with fault_around.
+ */
static unsigned long fault_around_bytes __read_mostly =
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+ PAGE_SIZE;
+#else
rounddown_pow_of_two(65536);
+#endif
#ifdef CONFIG_DEBUG_FS
static int fault_around_bytes_get(void *data, u64 *val)
@@ -3020,9 +3031,20 @@
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
if (!pte_same(*pte, orig_pte))
goto unlock_out;
+ do_fault_around(vma, address, pte, pgoff, flags);
+ /* Check if the fault is handled by faultaround */
+ if (!pte_same(*pte, orig_pte)) {
+ /*
+ * Faultaround produce old pte, but the pte we've
+ * handler fault for should be young.
+ */
+ pte_t entry = pte_mkyoung(*pte);
+ if (ptep_set_access_flags(vma, address, pte, entry, 0))
+ update_mmu_cache(vma, address, pte);
+ goto unlock_out;
+ }
pte_unmap_unlock(pte, ptl);
}
@@ -3037,7 +3059,7 @@
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, false, false);
+ do_set_pte(vma, address, fault_page, pte, false, false, false);
unlock_page(fault_page);
unlock_out:
pte_unmap_unlock(pte, ptl);
@@ -3089,7 +3111,7 @@
}
goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
+ do_set_pte(vma, address, new_page, pte, true, true, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
@@ -3146,7 +3168,7 @@
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, true, false);
+ do_set_pte(vma, address, fault_page, pte, true, false, false);
pte_unmap_unlock(pte, ptl);
if (set_page_dirty(fault_page))
diff --git a/mm/mempool.c b/mm/mempool.c
index 9b7a14a..9e075f8 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -105,7 +105,7 @@
static void kasan_poison_element(mempool_t *pool, void *element)
{
if (pool->alloc == mempool_alloc_slab)
- kasan_slab_free(pool->pool_data, element);
+ kasan_poison_slab_free(pool->pool_data, element);
if (pool->alloc == mempool_kmalloc)
kasan_kfree(element);
if (pool->alloc == mempool_alloc_pages)
diff --git a/mm/migrate.c b/mm/migrate.c
index 53ab639..9baf41c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1171,6 +1171,7 @@
switch(rc) {
case -ENOMEM:
+ nr_failed++;
goto out;
case -EAGAIN:
retry++;
diff --git a/mm/mmap.c b/mm/mmap.c
index fba246b..b9274a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -66,7 +66,7 @@
int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
#endif
-static bool ignore_rlimit_data = true;
+static bool ignore_rlimit_data;
core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
static void unmap_region(struct mm_struct *mm,
@@ -2886,13 +2886,17 @@
if (is_data_mapping(flags) &&
mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT) {
- if (ignore_rlimit_data)
- pr_warn_once("%s (%d): VmData %lu exceed data ulimit %lu. Will be forbidden soon.\n",
+ /* Workaround for Valgrind */
+ if (rlimit(RLIMIT_DATA) == 0 &&
+ mm->data_vm + npages <= rlimit_max(RLIMIT_DATA) >> PAGE_SHIFT)
+ return true;
+ if (!ignore_rlimit_data) {
+ pr_warn_once("%s (%d): VmData %lu exceed data ulimit %lu. Update limits or use boot option ignore_rlimit_data.\n",
current->comm, current->pid,
(mm->data_vm + npages) << PAGE_SHIFT,
rlimit(RLIMIT_DATA));
- else
return false;
+ }
}
return true;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 415f7eb..5bb2f76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -174,8 +174,13 @@
if (!p)
return 0;
+ /*
+ * Do not even consider tasks which are explicitly marked oom
+ * unkillable or have been already oom reaped.
+ */
adj = (long)p->signal->oom_score_adj;
- if (adj == OOM_SCORE_ADJ_MIN) {
+ if (adj == OOM_SCORE_ADJ_MIN ||
+ test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
task_unlock(p);
return 0;
}
@@ -278,12 +283,8 @@
* This task already has access to memory reserves and is being killed.
* Don't allow any other task to have access to the reserves.
*/
- if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
- if (!is_sysrq_oom(oc))
- return OOM_SCAN_ABORT;
- }
- if (!task->mm)
- return OOM_SCAN_CONTINUE;
+ if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
+ return OOM_SCAN_ABORT;
/*
* If task is allocating a lot of memory and has been marked to be
@@ -302,12 +303,12 @@
static struct task_struct *select_bad_process(struct oom_control *oc,
unsigned int *ppoints, unsigned long totalpages)
{
- struct task_struct *g, *p;
+ struct task_struct *p;
struct task_struct *chosen = NULL;
unsigned long chosen_points = 0;
rcu_read_lock();
- for_each_process_thread(g, p) {
+ for_each_process(p) {
unsigned int points;
switch (oom_scan_process_thread(oc, p, totalpages)) {
@@ -326,9 +327,6 @@
points = oom_badness(p, NULL, oc->nodemask, totalpages);
if (!points || points < chosen_points)
continue;
- /* Prefer thread group leaders for display purposes */
- if (points == chosen_points && thread_group_leader(chosen))
- continue;
chosen = p;
chosen_points = points;
@@ -441,7 +439,6 @@
static struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);
-
static bool __oom_reap_task(struct task_struct *tsk)
{
struct mmu_gather tlb;
@@ -513,9 +510,14 @@
* This task can be safely ignored because we cannot do much more
* to release its memory.
*/
- tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+ set_bit(MMF_OOM_REAPED, &mm->flags);
out:
- mmput(mm);
+ /*
+ * Drop our reference but make sure the mmput slow path is called from a
+ * different context because we shouldn't risk we get stuck there and
+ * put the oom_reaper out of the way.
+ */
+ mmput_async(mm);
return ret;
}
@@ -664,6 +666,7 @@
/* OOM killer might race with memcg OOM */
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
return;
+ atomic_inc(&tsk->signal->oom_victims);
/*
* Make sure that the task is woken up from uninterruptible sleep
* if it is frozen because OOM killer wouldn't be able to free
@@ -681,6 +684,7 @@
{
if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
return;
+ atomic_dec(&tsk->signal->oom_victims);
if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3b88795..b9956fd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -411,8 +411,8 @@
bg_thresh = thresh / 2;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- bg_thresh += bg_thresh / 4;
- thresh += thresh / 4;
+ bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
+ thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
dtc->thresh = thresh;
dtc->bg_thresh = bg_thresh;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c469c1..f8f3bfc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -522,12 +522,6 @@
static unsigned long nr_shown;
static unsigned long nr_unshown;
- /* Don't complain about poisoned pages */
- if (PageHWPoison(page)) {
- page_mapcount_reset(page); /* remove PageBuddy */
- return;
- }
-
/*
* Allow a burst of 60 reports, then keep quiet for that minute;
* or allow a steady drip of one report per second.
@@ -613,14 +607,7 @@
{
if (!buf)
return -EINVAL;
-
- if (strcmp(buf, "on") == 0)
- _debug_pagealloc_enabled = true;
-
- if (strcmp(buf, "off") == 0)
- _debug_pagealloc_enabled = false;
-
- return 0;
+ return kstrtobool(buf, &_debug_pagealloc_enabled);
}
early_param("debug_pagealloc", early_debug_pagealloc);
@@ -1000,7 +987,6 @@
trace_mm_page_free(page, order);
kmemcheck_free_shadow(page, order);
- kasan_free_pages(page, order);
/*
* Check tail pages before head page information is cleared to
@@ -1042,6 +1028,7 @@
arch_free_page(page, order);
kernel_poison_pages(page, 1 << order, 0);
kernel_map_pages(page, 1 << order, 0);
+ kasan_free_pages(page, order);
return true;
}
@@ -1212,7 +1199,7 @@
* marks the pages PageReserved. The remaining valid pages are later
* sent to the buddy page allocator.
*/
-void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);
@@ -1661,6 +1648,9 @@
if (unlikely(page->flags & __PG_HWPOISON)) {
bad_reason = "HWPoisoned (hardware-corrupted)";
bad_flags = __PG_HWPOISON;
+ /* Don't complain about hwpoisoned pages */
+ page_mapcount_reset(page); /* remove PageBuddy */
+ return;
}
if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
bad_reason = "PAGE_FLAGS_CHECK_AT_PREP flag set";
@@ -2750,10 +2740,9 @@
* one free page of a suitable size. Checking now avoids taking the zone lock
* to check in the allocation paths if no pages are free.
*/
-static bool __zone_watermark_ok(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx,
- unsigned int alloc_flags,
- long free_pages)
+bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
+ int classzone_idx, unsigned int alloc_flags,
+ long free_pages)
{
long min = mark;
int o;
@@ -3180,34 +3169,33 @@
return page;
}
+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
- unsigned long compact_result;
struct page *page;
+ int contended_compaction;
if (!order)
return NULL;
current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- mode, contended_compaction);
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ mode, &contended_compaction);
current->flags &= ~PF_MEMALLOC;
- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_INACTIVE)
return NULL;
- default:
- break;
- }
/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -3233,19 +3221,112 @@
*/
count_vm_event(COMPACTFAIL);
+ /*
+ * In all zones where compaction was attempted (and not
+ * deferred or skipped), lock contention has been detected.
+ * For THP allocation we do not want to disrupt the others
+ * so we fallback to base pages instead.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ *compact_result = COMPACT_CONTENDED;
+
+ /*
+ * If compaction was aborted due to need_resched(), we do not
+ * want to further increase allocation latency, unless it is
+ * khugepaged trying to collapse.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_SCHED
+ && !(current->flags & PF_KTHREAD))
+ *compact_result = COMPACT_CONTENDED;
+
cond_resched();
return NULL;
}
+
+static inline bool
+should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+ enum compact_result compact_result, enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ int max_retries = MAX_COMPACT_RETRIES;
+
+ if (!order)
+ return false;
+
+ /*
+ * compaction considers all the zone as desperately out of memory
+ * so it doesn't really make much sense to retry except when the
+ * failure could be caused by weak migration mode.
+ */
+ if (compaction_failed(compact_result)) {
+ if (*migrate_mode == MIGRATE_ASYNC) {
+ *migrate_mode = MIGRATE_SYNC_LIGHT;
+ return true;
+ }
+ return false;
+ }
+
+ /*
+ * make sure the compaction wasn't deferred or didn't bail out early
+ * due to locks contention before we declare that we should give up.
+ * But do not retry if the given zonelist is not suitable for
+ * compaction.
+ */
+ if (compaction_withdrawn(compact_result))
+ return compaction_zonelist_suitable(ac, order, alloc_flags);
+
+ /*
+ * !costly requests are much more important than __GFP_REPEAT
+ * costly ones because they are de facto nofail and invoke OOM
+ * killer to move on while costly can fail and users are ready
+ * to cope with that. 1/4 retries is rather arbitrary but we
+ * would need much more detailed feedback from compaction to
+ * make a better decision.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_retries /= 4;
+ if (compaction_retries <= max_retries)
+ return true;
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
+ *compact_result = COMPACT_SKIPPED;
return NULL;
}
+
+static inline bool
+should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
+ enum compact_result compact_result,
+ enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
+ return false;
+
+ /*
+ * There are setups with compaction disabled which would prefer to loop
+ * inside the allocator rather than hit the oom killer prematurely.
+ * Let's give them a good hope and keep retrying while the order-0
+ * watermarks are OK.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
+ ac_classzone_idx(ac), alloc_flags))
+ return true;
+ }
+ return false;
+}
#endif /* CONFIG_COMPACTION */
/* Perform direct synchronous page reclaim */
@@ -3377,6 +3458,101 @@
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
}
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+ struct alloc_context *ac, int alloc_flags,
+ bool did_some_progress, int no_progress_loops)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure we converge to OOM if we cannot make any progress
+ * several times in the row.
+ */
+ if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;
+
+ /*
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
+ * If none of the target zones can satisfy our allocation request even
+ * if all reclaimable pages are considered then we are screwed and have
+ * to go OOM.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+ unsigned long reclaimable;
+
+ available = reclaimable = zone_reclaimable_pages(zone);
+ available -= DIV_ROUND_UP(no_progress_loops * available,
+ MAX_RECLAIM_RETRIES);
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ /*
+ * Would the allocation succeed if we reclaimed the whole
+ * available?
+ */
+ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ ac_classzone_idx(ac), alloc_flags, available)) {
+ /*
+ * If we didn't make any progress and have a lot of
+ * dirty + writeback pages then we should wait for
+ * an IO to complete to slow down the reclaim and
+ * prevent from pre mature OOM
+ */
+ if (!did_some_progress) {
+ unsigned long writeback;
+ unsigned long dirty;
+
+ writeback = zone_page_state_snapshot(zone,
+ NR_WRITEBACK);
+ dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+ if (2*(writeback + dirty) > reclaimable) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ return true;
+ }
+ }
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+
+ return true;
+ }
+ }
+
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3384,11 +3560,11 @@
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
struct page *page = NULL;
unsigned int alloc_flags;
- unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
- int contended_compaction = COMPACT_CONTENDED_NONE;
+ enum compact_result compact_result;
+ int compaction_retries = 0;
+ int no_progress_loops = 0;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3475,8 +3651,7 @@
*/
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
@@ -3489,35 +3664,19 @@
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;
/*
- * In all zones where compaction was attempted (and not
- * deferred or skipped), lock contention has been detected.
- * For THP allocation we do not want to disrupt the others
- * so we fallback to base pages instead.
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
*/
- if (contended_compaction == COMPACT_CONTENDED_LOCK)
- goto nopage;
-
- /*
- * If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
- */
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
+ if(compact_result == COMPACT_CONTENDED)
goto nopage;
}
- /*
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse.
- */
- if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_SYNC_LIGHT;
+ if (order && compaction_made_progress(compact_result))
+ compaction_retries++;
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
@@ -3529,14 +3688,38 @@
if (gfp_mask & __GFP_NORETRY)
goto noretry;
- /* Keep reclaiming pages as long as there is reasonable progress */
- pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
- ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zoneref->zone, BLK_RW_ASYNC, HZ/50);
+ /*
+ * Do not retry costly high order allocations unless they are
+ * __GFP_REPEAT
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+ goto noretry;
+
+ /*
+ * Costly allocations might have made a progress but this doesn't mean
+ * their order will become available due to high fragmentation so
+ * always increment the no progress counter for them
+ */
+ if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+ no_progress_loops = 0;
+ else
+ no_progress_loops++;
+
+ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, no_progress_loops))
goto retry;
- }
+
+ /*
+ * It doesn't make any sense to retry for the compaction if the order-0
+ * reclaim is not able to make any progress because the current
+ * implementation of the compaction depends on the sufficient amount
+ * of free memory (see __compaction_suitable)
+ */
+ if (did_some_progress > 0 &&
+ should_compact_retry(ac, order, alloc_flags,
+ compact_result, &migration_mode,
+ compaction_retries))
+ goto retry;
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
@@ -3544,19 +3727,28 @@
goto got_pg;
/* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ no_progress_loops = 0;
goto retry;
+ }
noretry:
/*
- * High-order allocations do not necessarily loop after
- * direct reclaim and reclaim/compaction depends on compaction
- * being called after reclaim so call directly if necessary
+ * High-order allocations do not necessarily loop after direct reclaim
+ * and reclaim/compaction depends on compaction being called after
+ * reclaim so call directly if necessary.
+ * It can become very expensive to allocate transparent hugepages at
+ * fault, so use asynchronous memory compaction for THP unless it is
+ * khugepaged trying to collapse. All other requests should tolerate
+ * at least light sync migration.
*/
+ if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ else
+ migration_mode = MIGRATE_SYNC_LIGHT;
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
@@ -6671,49 +6863,6 @@
}
/*
- * The inactive anon list should be small enough that the VM never has to
- * do too much work, but large enough that each inactive page has a chance
- * to be referenced again before it is swapped out.
- *
- * The inactive_anon ratio is the target ratio of ACTIVE_ANON to
- * INACTIVE_ANON pages on this zone's LRU, maintained by the
- * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of
- * the anonymous pages are kept on the inactive list.
- *
- * total target max
- * memory ratio inactive anon
- * -------------------------------------
- * 10MB 1 5MB
- * 100MB 1 50MB
- * 1GB 3 250MB
- * 10GB 10 0.9GB
- * 100GB 31 3GB
- * 1TB 101 10GB
- * 10TB 320 32GB
- */
-static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
-{
- unsigned int gb, ratio;
-
- /* Zone size in gigabytes */
- gb = zone->managed_pages >> (30 - PAGE_SHIFT);
- if (gb)
- ratio = int_sqrt(10 * gb);
- else
- ratio = 1;
-
- zone->inactive_ratio = ratio;
-}
-
-static void __meminit setup_per_zone_inactive_ratio(void)
-{
- struct zone *zone;
-
- for_each_zone(zone)
- calculate_zone_inactive_ratio(zone);
-}
-
-/*
* Initialise min_free_kbytes.
*
* For small machines we want it small (128k min). For large machines
@@ -6758,7 +6907,6 @@
setup_per_zone_wmarks();
refresh_zone_stat_thresholds();
setup_per_zone_lowmem_reserve();
- setup_per_zone_inactive_ratio();
return 0;
}
core_initcall(init_per_zone_wmark_min)
diff --git a/mm/page_poison.c b/mm/page_poison.c
index 479e7ea..1eae5fa 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -13,13 +13,7 @@
{
if (!buf)
return -EINVAL;
-
- if (strcmp(buf, "on") == 0)
- want_page_poisoning = true;
- else if (strcmp(buf, "off") == 0)
- want_page_poisoning = false;
-
- return 0;
+ return strtobool(buf, &want_page_poisoning);
}
early_param("page_poison", early_page_poison_param);
diff --git a/mm/slab.c b/mm/slab.c
index c11bf50..cc8bbc1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3547,9 +3547,17 @@
static inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
- struct array_cache *ac = cpu_cache_get(cachep);
+ /* Put the object into the quarantine, don't touch it for now. */
+ if (kasan_slab_free(cachep, objp))
+ return;
- kasan_slab_free(cachep, objp);
+ ___cache_free(cachep, objp, caller);
+}
+
+void ___cache_free(struct kmem_cache *cachep, void *objp,
+ unsigned long caller)
+{
+ struct array_cache *ac = cpu_cache_get(cachep);
check_irq_off();
kmemleak_free_recursive(objp, cachep->flags);
@@ -4493,7 +4501,7 @@
/* We assume that ksize callers could use the whole allocated area,
* so we need to unpoison this area.
*/
- kasan_krealloc(objp, size, GFP_NOWAIT);
+ kasan_unpoison_shadow(objp, size);
return size;
}
diff --git a/mm/slab.h b/mm/slab.h
index 5969769..dedb1a9 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -462,4 +462,6 @@
void slab_stop(struct seq_file *m, void *p);
int memcg_slab_show(struct seq_file *m, void *p);
+void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
+
#endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3239bfd..a65dad7 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -715,6 +715,7 @@
get_online_cpus();
get_online_mems();
+ kasan_cache_destroy(s);
mutex_lock(&slab_mutex);
s->refcount--;
@@ -753,6 +754,7 @@
get_online_cpus();
get_online_mems();
+ kasan_cache_shrink(cachep);
ret = __kmem_cache_shrink(cachep, false);
put_online_mems();
put_online_cpus();
diff --git a/mm/slub.c b/mm/slub.c
index cf1faa4..825ff45 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3635,8 +3635,9 @@
{
size_t size = __ksize(object);
/* We assume that ksize callers could use whole allocated area,
- so we need unpoison this area. */
- kasan_krealloc(object, size, GFP_NOWAIT);
+ * so we need to unpoison this area.
+ */
+ kasan_unpoison_shadow(object, size);
return size;
}
EXPORT_SYMBOL(ksize);
diff --git a/mm/swap.c b/mm/swap.c
index 03aacbc..9591614 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,9 @@
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+#ifdef CONFIG_SMP
+static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
+#endif
/*
* This path almost never happens for VM activity - pages are normally
@@ -274,8 +277,6 @@
}
#ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
-
static void activate_page_drain(int cpu)
{
struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ae7d20b..6e32918 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -274,13 +274,12 @@
/*** Global kva allocator ***/
-#define VM_LAZY_FREE 0x01
-#define VM_LAZY_FREEING 0x02
#define VM_VM_AREA 0x04
static DEFINE_SPINLOCK(vmap_area_lock);
/* Export for kexec only */
LIST_HEAD(vmap_area_list);
+static LLIST_HEAD(vmap_purge_list);
static struct rb_root vmap_area_root = RB_ROOT;
/* The vmap cache globals are protected by vmap_area_lock */
@@ -601,7 +600,7 @@
int sync, int force_flush)
{
static DEFINE_SPINLOCK(purge_lock);
- LIST_HEAD(valist);
+ struct llist_node *valist;
struct vmap_area *va;
struct vmap_area *n_va;
int nr = 0;
@@ -620,20 +619,14 @@
if (sync)
purge_fragmented_blocks_allcpus();
- rcu_read_lock();
- list_for_each_entry_rcu(va, &vmap_area_list, list) {
- if (va->flags & VM_LAZY_FREE) {
- if (va->va_start < *start)
- *start = va->va_start;
- if (va->va_end > *end)
- *end = va->va_end;
- nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
- list_add_tail(&va->purge_list, &valist);
- va->flags |= VM_LAZY_FREEING;
- va->flags &= ~VM_LAZY_FREE;
- }
+ valist = llist_del_all(&vmap_purge_list);
+ llist_for_each_entry(va, valist, purge_list) {
+ if (va->va_start < *start)
+ *start = va->va_start;
+ if (va->va_end > *end)
+ *end = va->va_end;
+ nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
}
- rcu_read_unlock();
if (nr)
atomic_sub(nr, &vmap_lazy_nr);
@@ -643,7 +636,7 @@
if (nr) {
spin_lock(&vmap_area_lock);
- list_for_each_entry_safe(va, n_va, &valist, purge_list)
+ llist_for_each_entry_safe(va, n_va, valist, purge_list)
__free_vmap_area(va);
spin_unlock(&vmap_area_lock);
}
@@ -678,9 +671,15 @@
*/
static void free_vmap_area_noflush(struct vmap_area *va)
{
- va->flags |= VM_LAZY_FREE;
- atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
- if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages()))
+ int nr_lazy;
+
+ nr_lazy = atomic_add_return((va->va_end - va->va_start) >> PAGE_SHIFT,
+ &vmap_lazy_nr);
+
+ /* After this point, we may free va at any time */
+ llist_add(&va->purge_list, &vmap_purge_list);
+
+ if (unlikely(nr_lazy > lazy_max_pages()))
try_purge_vmap_area_lazy();
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcfdfc1..c4a2f451 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,7 +191,7 @@
}
#endif
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
{
unsigned long nr;
@@ -1862,83 +1862,63 @@
free_hot_cold_page_list(&l_hold, true);
}
-#ifdef CONFIG_SWAP
-static bool inactive_anon_is_low_global(struct zone *zone)
-{
- unsigned long active, inactive;
-
- active = zone_page_state(zone, NR_ACTIVE_ANON);
- inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-
- return inactive * zone->inactive_ratio < active;
-}
-
-/**
- * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @lruvec: LRU vector to check
+/*
+ * The inactive anon list should be small enough that the VM never has
+ * to do too much work.
*
- * Returns true if the zone does not have enough inactive anon pages,
- * meaning some active anon pages need to be deactivated.
+ * The inactive file list should be small enough to leave most memory
+ * to the established workingset on the scan-resistant active list,
+ * but large enough to avoid thrashing the aggregate readahead window.
+ *
+ * Both inactive lists should also be large enough that each inactive
+ * page has a chance to be referenced again before it is reclaimed.
+ *
+ * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
+ * on this LRU, maintained by the pageout code. A zone->inactive_ratio
+ * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
+ *
+ * total target max
+ * memory ratio inactive
+ * -------------------------------------
+ * 10MB 1 5MB
+ * 100MB 1 50MB
+ * 1GB 3 250MB
+ * 10GB 10 0.9GB
+ * 100GB 31 3GB
+ * 1TB 101 10GB
+ * 10TB 320 32GB
*/
-static bool inactive_anon_is_low(struct lruvec *lruvec)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
{
+ unsigned long inactive_ratio;
+ unsigned long inactive;
+ unsigned long active;
+ unsigned long gb;
+
/*
* If we don't have swap space, anonymous page deactivation
* is pointless.
*/
- if (!total_swap_pages)
+ if (!file && !total_swap_pages)
return false;
- if (!mem_cgroup_disabled())
- return mem_cgroup_inactive_anon_is_low(lruvec);
+ inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+ active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
- return inactive_anon_is_low_global(lruvec_zone(lruvec));
-}
-#else
-static inline bool inactive_anon_is_low(struct lruvec *lruvec)
-{
- return false;
-}
-#endif
-
-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @lruvec: LRU vector to check
- *
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
- *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
- *
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
- */
-static bool inactive_file_is_low(struct lruvec *lruvec)
-{
- unsigned long inactive;
- unsigned long active;
-
- inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
- active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
-
- return active > inactive;
-}
-
-static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
-{
- if (is_file_lru(lru))
- return inactive_file_is_low(lruvec);
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
else
- return inactive_anon_is_low(lruvec);
+ inactive_ratio = 1;
+
+ return inactive * inactive_ratio < active;
}
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, lru))
+ if (inactive_list_is_low(lruvec, is_file_lru(lru)))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
@@ -2059,7 +2039,7 @@
* lruvec even if it has plenty of old anonymous pages unless the
* system is under heavy pressure.
*/
- if (!inactive_file_is_low(lruvec) &&
+ if (!inactive_list_is_low(lruvec, true) &&
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
@@ -2301,7 +2281,7 @@
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (inactive_anon_is_low(lruvec))
+ if (inactive_list_is_low(lruvec, false))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
@@ -2479,7 +2459,7 @@
* Returns true if compaction should go ahead for a high-order request, or
* the high-order allocation would succeed without compaction.
*/
-static inline bool compaction_ready(struct zone *zone, int order)
+static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
{
unsigned long balance_gap, watermark;
bool watermark_ok;
@@ -2493,7 +2473,7 @@
balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
- watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
/*
* If compaction is deferred, reclaim up to a point where
@@ -2506,7 +2486,7 @@
* If compaction is not ready to start and allocation is not likely
* to succeed without it, then keep reclaiming.
*/
- if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
+ if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
return false;
return watermark_ok;
@@ -2527,10 +2507,8 @@
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
@@ -2538,7 +2516,6 @@
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
- bool reclaimable = false;
/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2586,7 +2563,7 @@
if (IS_ENABLED(CONFIG_COMPACTION) &&
sc->order > PAGE_ALLOC_COSTLY_ORDER &&
zonelist_zone_idx(z) <= requested_highidx &&
- compaction_ready(zone, sc->order)) {
+ compaction_ready(zone, sc->order, requested_highidx)) {
sc->compaction_ready = true;
continue;
}
@@ -2603,17 +2580,10 @@
&nr_soft_scanned);
sc->nr_reclaimed += nr_soft_reclaimed;
sc->nr_scanned += nr_soft_scanned;
- if (nr_soft_reclaimed)
- reclaimable = true;
/* need some check for avoid more shrink_zone() */
}
- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
- reclaimable = true;
-
- if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
- reclaimable = true;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
}
/*
@@ -2621,8 +2591,6 @@
* promoted it to __GFP_HIGHMEM.
*/
sc->gfp_mask = orig_mask;
-
- return reclaimable;
}
/*
@@ -2647,7 +2615,6 @@
int initial_priority = sc->priority;
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
- bool zones_reclaimable;
retry:
delayacct_freepages_start();
@@ -2658,7 +2625,7 @@
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ shrink_zones(zonelist, sc);
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2705,10 +2672,6 @@
goto retry;
}
- /* Any of the zones still reclaimable? Don't OOM. */
- if (zones_reclaimable)
- return 1;
-
return 0;
}
@@ -2962,7 +2925,7 @@
do {
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
- if (inactive_anon_is_low(lruvec))
+ if (inactive_list_is_low(lruvec, false))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5b72a8a..77e42ef 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1352,7 +1352,6 @@
static struct workqueue_struct *vmstat_wq;
static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
int sysctl_stat_interval __read_mostly = HZ;
-static cpumask_var_t cpu_stat_off;
#ifdef CONFIG_PROC_FS
static void refresh_vm_stats(struct work_struct *work)
@@ -1421,24 +1420,10 @@
* Counters were updated so we expect more updates
* to occur in the future. Keep on running the
* update worker thread.
- * If we were marked on cpu_stat_off clear the flag
- * so that vmstat_shepherd doesn't schedule us again.
*/
- if (!cpumask_test_and_clear_cpu(smp_processor_id(),
- cpu_stat_off)) {
- queue_delayed_work_on(smp_processor_id(), vmstat_wq,
+ queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(&vmstat_work),
round_jiffies_relative(sysctl_stat_interval));
- }
- } else {
- /*
- * We did not update any counters so the app may be in
- * a mode where it does not cause counter updates.
- * We may be uselessly running vmstat_update.
- * Defer the checking for differentials to the
- * shepherd thread on a different processor.
- */
- cpumask_set_cpu(smp_processor_id(), cpu_stat_off);
}
}
@@ -1470,16 +1455,17 @@
return false;
}
+/*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
void quiet_vmstat(void)
{
if (system_state != SYSTEM_RUNNING)
return;
- /*
- * If we are already in hands of the shepherd then there
- * is nothing for us to do here.
- */
- if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
return;
if (!need_update(smp_processor_id()))
@@ -1494,7 +1480,6 @@
refresh_cpu_vm_stats(false);
}
-
/*
* Shepherd worker thread that checks the
* differentials of processors that have their worker
@@ -1511,20 +1496,11 @@
get_online_cpus();
/* Check processors whose vmstat worker threads have been disabled */
- for_each_cpu(cpu, cpu_stat_off) {
+ for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
- if (need_update(cpu)) {
- if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
- queue_delayed_work_on(cpu, vmstat_wq, dw, 0);
- } else {
- /*
- * Cancel the work if quiet_vmstat has put this
- * cpu on cpu_stat_off because the work item might
- * be still scheduled
- */
- cancel_delayed_work(dw);
- }
+ if (!delayed_work_pending(dw) && need_update(cpu))
+ queue_delayed_work_on(cpu, vmstat_wq, dw, 0);
}
put_online_cpus();
@@ -1540,10 +1516,6 @@
INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
vmstat_update);
- if (!alloc_cpumask_var(&cpu_stat_off, GFP_KERNEL))
- BUG();
- cpumask_copy(cpu_stat_off, cpu_online_mask);
-
vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
@@ -1578,16 +1550,13 @@
case CPU_ONLINE_FROZEN:
refresh_zone_stat_thresholds();
node_set_state(cpu_to_node(cpu), N_CPU);
- cpumask_set_cpu(cpu, cpu_stat_off);
break;
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
- cpumask_clear_cpu(cpu, cpu_stat_off);
break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
- cpumask_set_cpu(cpu, cpu_stat_off);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
diff --git a/mm/z3fold.c b/mm/z3fold.c
new file mode 100644
index 0000000..34917d5
--- /dev/null
+++ b/mm/z3fold.c
@@ -0,0 +1,792 @@
+/*
+ * z3fold.c
+ *
+ * Author: Vitaly Wool <vitaly.wool@konsulko.com>
+ * Copyright (C) 2016, Sony Mobile Communications Inc.
+ *
+ * This implementation is based on zbud written by Seth Jennings.
+ *
+ * z3fold is an special purpose allocator for storing compressed pages. It
+ * can store up to three compressed pages per page which improves the
+ * compression ratio of zbud while retaining its main concepts (e. g. always
+ * storing an integral number of objects per page) and simplicity.
+ * It still has simple and deterministic reclaim properties that make it
+ * preferable to a higher density approach (with no requirement on integral
+ * number of object per page) when reclaim is used.
+ *
+ * As in zbud, pages are divided into "chunks". The size of the chunks is
+ * fixed at compile time and is determined by NCHUNKS_ORDER below.
+ *
+ * z3fold doesn't export any API and is meant to be used via zpool API.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/atomic.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/preempt.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/zpool.h>
+
+/*****************
+ * Structures
+*****************/
+/*
+ * NCHUNKS_ORDER determines the internal allocation granularity, effectively
+ * adjusting internal fragmentation. It also determines the number of
+ * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
+ * allocation granularity will be in chunks of size PAGE_SIZE/64. As one chunk
+ * in allocated page is occupied by z3fold header, NCHUNKS will be calculated
+ * to 63 which shows the max number of free chunks in z3fold page, also there
+ * will be 63 freelists per pool.
+ */
+#define NCHUNKS_ORDER 6
+
+#define CHUNK_SHIFT (PAGE_SHIFT - NCHUNKS_ORDER)
+#define CHUNK_SIZE (1 << CHUNK_SHIFT)
+#define ZHDR_SIZE_ALIGNED CHUNK_SIZE
+#define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
+
+#define BUDDY_MASK ((1 << NCHUNKS_ORDER) - 1)
+
+struct z3fold_pool;
+struct z3fold_ops {
+ int (*evict)(struct z3fold_pool *pool, unsigned long handle);
+};
+
+/**
+ * struct z3fold_pool - stores metadata for each z3fold pool
+ * @lock: protects all pool fields and first|last_chunk fields of any
+ * z3fold page in the pool
+ * @unbuddied: array of lists tracking z3fold pages that contain 2- buddies;
+ * the lists each z3fold page is added to depends on the size of
+ * its free region.
+ * @buddied: list tracking the z3fold pages that contain 3 buddies;
+ * these z3fold pages are full
+ * @lru: list tracking the z3fold pages in LRU order by most recently
+ * added buddy.
+ * @pages_nr: number of z3fold pages in the pool.
+ * @ops: pointer to a structure of user defined operations specified at
+ * pool creation time.
+ *
+ * This structure is allocated at pool creation time and maintains metadata
+ * pertaining to a particular z3fold pool.
+ */
+struct z3fold_pool {
+ spinlock_t lock;
+ struct list_head unbuddied[NCHUNKS];
+ struct list_head buddied;
+ struct list_head lru;
+ u64 pages_nr;
+ const struct z3fold_ops *ops;
+ struct zpool *zpool;
+ const struct zpool_ops *zpool_ops;
+};
+
+enum buddy {
+ HEADLESS = 0,
+ FIRST,
+ MIDDLE,
+ LAST,
+ BUDDIES_MAX
+};
+
+/*
+ * struct z3fold_header - z3fold page metadata occupying the first chunk of each
+ * z3fold page, except for HEADLESS pages
+ * @buddy: links the z3fold page into the relevant list in the pool
+ * @first_chunks: the size of the first buddy in chunks, 0 if free
+ * @middle_chunks: the size of the middle buddy in chunks, 0 if free
+ * @last_chunks: the size of the last buddy in chunks, 0 if free
+ * @first_num: the starting number (for the first handle)
+ */
+struct z3fold_header {
+ struct list_head buddy;
+ unsigned short first_chunks;
+ unsigned short middle_chunks;
+ unsigned short last_chunks;
+ unsigned short start_middle;
+ unsigned short first_num:NCHUNKS_ORDER;
+};
+
+/*
+ * Internal z3fold page flags
+ */
+enum z3fold_page_flags {
+ UNDER_RECLAIM = 0,
+ PAGE_HEADLESS,
+ MIDDLE_CHUNK_MAPPED,
+};
+
+/*****************
+ * Helpers
+*****************/
+
+/* Converts an allocation size in bytes to size in z3fold chunks */
+static int size_to_chunks(size_t size)
+{
+ return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT;
+}
+
+#define for_each_unbuddied_list(_iter, _begin) \
+ for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++)
+
+/* Initializes the z3fold header of a newly allocated z3fold page */
+static struct z3fold_header *init_z3fold_page(struct page *page)
+{
+ struct z3fold_header *zhdr = page_address(page);
+
+ INIT_LIST_HEAD(&page->lru);
+ clear_bit(UNDER_RECLAIM, &page->private);
+ clear_bit(PAGE_HEADLESS, &page->private);
+ clear_bit(MIDDLE_CHUNK_MAPPED, &page->private);
+
+ zhdr->first_chunks = 0;
+ zhdr->middle_chunks = 0;
+ zhdr->last_chunks = 0;
+ zhdr->first_num = 0;
+ zhdr->start_middle = 0;
+ INIT_LIST_HEAD(&zhdr->buddy);
+ return zhdr;
+}
+
+/* Resets the struct page fields and frees the page */
+static void free_z3fold_page(struct z3fold_header *zhdr)
+{
+ __free_page(virt_to_page(zhdr));
+}
+
+/*
+ * Encodes the handle of a particular buddy within a z3fold page
+ * Pool lock should be held as this function accesses first_num
+ */
+static unsigned long encode_handle(struct z3fold_header *zhdr, enum buddy bud)
+{
+ unsigned long handle;
+
+ handle = (unsigned long)zhdr;
+ if (bud != HEADLESS)
+ handle += (bud + zhdr->first_num) & BUDDY_MASK;
+ return handle;
+}
+
+/* Returns the z3fold page where a given handle is stored */
+static struct z3fold_header *handle_to_z3fold_header(unsigned long handle)
+{
+ return (struct z3fold_header *)(handle & PAGE_MASK);
+}
+
+/* Returns buddy number */
+static enum buddy handle_to_buddy(unsigned long handle)
+{
+ struct z3fold_header *zhdr = handle_to_z3fold_header(handle);
+ return (handle - zhdr->first_num) & BUDDY_MASK;
+}
+
+/*
+ * Returns the number of free chunks in a z3fold page.
+ * NB: can't be used with HEADLESS pages.
+ */
+static int num_free_chunks(struct z3fold_header *zhdr)
+{
+ int nfree;
+ /*
+ * If there is a middle object, pick up the bigger free space
+ * either before or after it. Otherwise just subtract the number
+ * of chunks occupied by the first and the last objects.
+ */
+ if (zhdr->middle_chunks != 0) {
+ int nfree_before = zhdr->first_chunks ?
+ 0 : zhdr->start_middle - 1;
+ int nfree_after = zhdr->last_chunks ?
+ 0 : NCHUNKS - zhdr->start_middle - zhdr->middle_chunks;
+ nfree = max(nfree_before, nfree_after);
+ } else
+ nfree = NCHUNKS - zhdr->first_chunks - zhdr->last_chunks;
+ return nfree;
+}
+
+/*****************
+ * API Functions
+*****************/
+/**
+ * z3fold_create_pool() - create a new z3fold pool
+ * @gfp: gfp flags when allocating the z3fold pool structure
+ * @ops: user-defined operations for the z3fold pool
+ *
+ * Return: pointer to the new z3fold pool or NULL if the metadata allocation
+ * failed.
+ */
+static struct z3fold_pool *z3fold_create_pool(gfp_t gfp,
+ const struct z3fold_ops *ops)
+{
+ struct z3fold_pool *pool;
+ int i;
+
+ pool = kzalloc(sizeof(struct z3fold_pool), gfp);
+ if (!pool)
+ return NULL;
+ spin_lock_init(&pool->lock);
+ for_each_unbuddied_list(i, 0)
+ INIT_LIST_HEAD(&pool->unbuddied[i]);
+ INIT_LIST_HEAD(&pool->buddied);
+ INIT_LIST_HEAD(&pool->lru);
+ pool->pages_nr = 0;
+ pool->ops = ops;
+ return pool;
+}
+
+/**
+ * z3fold_destroy_pool() - destroys an existing z3fold pool
+ * @pool: the z3fold pool to be destroyed
+ *
+ * The pool should be emptied before this function is called.
+ */
+static void z3fold_destroy_pool(struct z3fold_pool *pool)
+{
+ kfree(pool);
+}
+
+/* Has to be called with lock held */
+static int z3fold_compact_page(struct z3fold_header *zhdr)
+{
+ struct page *page = virt_to_page(zhdr);
+ void *beg = zhdr;
+
+
+ if (!test_bit(MIDDLE_CHUNK_MAPPED, &page->private) &&
+ zhdr->middle_chunks != 0 &&
+ zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
+ memmove(beg + ZHDR_SIZE_ALIGNED,
+ beg + (zhdr->start_middle << CHUNK_SHIFT),
+ zhdr->middle_chunks << CHUNK_SHIFT);
+ zhdr->first_chunks = zhdr->middle_chunks;
+ zhdr->middle_chunks = 0;
+ zhdr->start_middle = 0;
+ zhdr->first_num++;
+ return 1;
+ }
+ return 0;
+}
+
+/**
+ * z3fold_alloc() - allocates a region of a given size
+ * @pool: z3fold pool from which to allocate
+ * @size: size in bytes of the desired allocation
+ * @gfp: gfp flags used if the pool needs to grow
+ * @handle: handle of the new allocation
+ *
+ * This function will attempt to find a free region in the pool large enough to
+ * satisfy the allocation request. A search of the unbuddied lists is
+ * performed first. If no suitable free region is found, then a new page is
+ * allocated and added to the pool to satisfy the request.
+ *
+ * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used
+ * as z3fold pool pages.
+ *
+ * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
+ * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
+ * a new page.
+ */
+static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp,
+ unsigned long *handle)
+{
+ int chunks = 0, i, freechunks;
+ struct z3fold_header *zhdr = NULL;
+ enum buddy bud;
+ struct page *page;
+
+ if (!size || (gfp & __GFP_HIGHMEM))
+ return -EINVAL;
+
+ if (size > PAGE_SIZE)
+ return -ENOSPC;
+
+ if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE)
+ bud = HEADLESS;
+ else {
+ chunks = size_to_chunks(size);
+ spin_lock(&pool->lock);
+
+ /* First, try to find an unbuddied z3fold page. */
+ zhdr = NULL;
+ for_each_unbuddied_list(i, chunks) {
+ if (!list_empty(&pool->unbuddied[i])) {
+ zhdr = list_first_entry(&pool->unbuddied[i],
+ struct z3fold_header, buddy);
+ page = virt_to_page(zhdr);
+ if (zhdr->first_chunks == 0) {
+ if (zhdr->middle_chunks != 0 &&
+ chunks >= zhdr->start_middle)
+ bud = LAST;
+ else
+ bud = FIRST;
+ } else if (zhdr->last_chunks == 0)
+ bud = LAST;
+ else if (zhdr->middle_chunks == 0)
+ bud = MIDDLE;
+ else {
+ pr_err("No free chunks in unbuddied\n");
+ WARN_ON(1);
+ continue;
+ }
+ list_del(&zhdr->buddy);
+ goto found;
+ }
+ }
+ bud = FIRST;
+ spin_unlock(&pool->lock);
+ }
+
+ /* Couldn't find unbuddied z3fold page, create new one */
+ page = alloc_page(gfp);
+ if (!page)
+ return -ENOMEM;
+ spin_lock(&pool->lock);
+ pool->pages_nr++;
+ zhdr = init_z3fold_page(page);
+
+ if (bud == HEADLESS) {
+ set_bit(PAGE_HEADLESS, &page->private);
+ goto headless;
+ }
+
+found:
+ if (bud == FIRST)
+ zhdr->first_chunks = chunks;
+ else if (bud == LAST)
+ zhdr->last_chunks = chunks;
+ else {
+ zhdr->middle_chunks = chunks;
+ zhdr->start_middle = zhdr->first_chunks + 1;
+ }
+
+ if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
+ zhdr->middle_chunks == 0) {
+ /* Add to unbuddied list */
+ freechunks = num_free_chunks(zhdr);
+ list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ } else {
+ /* Add to buddied list */
+ list_add(&zhdr->buddy, &pool->buddied);
+ }
+
+headless:
+ /* Add/move z3fold page to beginning of LRU */
+ if (!list_empty(&page->lru))
+ list_del(&page->lru);
+
+ list_add(&page->lru, &pool->lru);
+
+ *handle = encode_handle(zhdr, bud);
+ spin_unlock(&pool->lock);
+
+ return 0;
+}
+
+/**
+ * z3fold_free() - frees the allocation associated with the given handle
+ * @pool: pool in which the allocation resided
+ * @handle: handle associated with the allocation returned by z3fold_alloc()
+ *
+ * In the case that the z3fold page in which the allocation resides is under
+ * reclaim, as indicated by the PG_reclaim flag being set, this function
+ * only sets the first|last_chunks to 0. The page is actually freed
+ * once both buddies are evicted (see z3fold_reclaim_page() below).
+ */
+static void z3fold_free(struct z3fold_pool *pool, unsigned long handle)
+{
+ struct z3fold_header *zhdr;
+ int freechunks;
+ struct page *page;
+ enum buddy bud;
+
+ spin_lock(&pool->lock);
+ zhdr = handle_to_z3fold_header(handle);
+ page = virt_to_page(zhdr);
+
+ if (test_bit(PAGE_HEADLESS, &page->private)) {
+ /* HEADLESS page stored */
+ bud = HEADLESS;
+ } else {
+ bud = (handle - zhdr->first_num) & BUDDY_MASK;
+
+ switch (bud) {
+ case FIRST:
+ zhdr->first_chunks = 0;
+ break;
+ case MIDDLE:
+ zhdr->middle_chunks = 0;
+ zhdr->start_middle = 0;
+ break;
+ case LAST:
+ zhdr->last_chunks = 0;
+ break;
+ default:
+ pr_err("%s: unknown bud %d\n", __func__, bud);
+ WARN_ON(1);
+ spin_unlock(&pool->lock);
+ return;
+ }
+ }
+
+ if (test_bit(UNDER_RECLAIM, &page->private)) {
+ /* z3fold page is under reclaim, reclaim will free */
+ spin_unlock(&pool->lock);
+ return;
+ }
+
+ if (bud != HEADLESS) {
+ /* Remove from existing buddy list */
+ list_del(&zhdr->buddy);
+ }
+
+ if (bud == HEADLESS ||
+ (zhdr->first_chunks == 0 && zhdr->middle_chunks == 0 &&
+ zhdr->last_chunks == 0)) {
+ /* z3fold page is empty, free */
+ list_del(&page->lru);
+ clear_bit(PAGE_HEADLESS, &page->private);
+ free_z3fold_page(zhdr);
+ pool->pages_nr--;
+ } else {
+ z3fold_compact_page(zhdr);
+ /* Add to the unbuddied list */
+ freechunks = num_free_chunks(zhdr);
+ list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ }
+
+ spin_unlock(&pool->lock);
+}
+
+/**
+ * z3fold_reclaim_page() - evicts allocations from a pool page and frees it
+ * @pool: pool from which a page will attempt to be evicted
+ * @retires: number of pages on the LRU list for which eviction will
+ * be attempted before failing
+ *
+ * z3fold reclaim is different from normal system reclaim in that it is done
+ * from the bottom, up. This is because only the bottom layer, z3fold, has
+ * information on how the allocations are organized within each z3fold page.
+ * This has the potential to create interesting locking situations between
+ * z3fold and the user, however.
+ *
+ * To avoid these, this is how z3fold_reclaim_page() should be called:
+
+ * The user detects a page should be reclaimed and calls z3fold_reclaim_page().
+ * z3fold_reclaim_page() will remove a z3fold page from the pool LRU list and
+ * call the user-defined eviction handler with the pool and handle as
+ * arguments.
+ *
+ * If the handle can not be evicted, the eviction handler should return
+ * non-zero. z3fold_reclaim_page() will add the z3fold page back to the
+ * appropriate list and try the next z3fold page on the LRU up to
+ * a user defined number of retries.
+ *
+ * If the handle is successfully evicted, the eviction handler should
+ * return 0 _and_ should have called z3fold_free() on the handle. z3fold_free()
+ * contains logic to delay freeing the page if the page is under reclaim,
+ * as indicated by the setting of the PG_reclaim flag on the underlying page.
+ *
+ * If all buddies in the z3fold page are successfully evicted, then the
+ * z3fold page can be freed.
+ *
+ * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are
+ * no pages to evict or an eviction handler is not registered, -EAGAIN if
+ * the retry limit was hit.
+ */
+static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries)
+{
+ int i, ret = 0, freechunks;
+ struct z3fold_header *zhdr;
+ struct page *page;
+ unsigned long first_handle = 0, middle_handle = 0, last_handle = 0;
+
+ spin_lock(&pool->lock);
+ if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
+ retries == 0) {
+ spin_unlock(&pool->lock);
+ return -EINVAL;
+ }
+ for (i = 0; i < retries; i++) {
+ page = list_last_entry(&pool->lru, struct page, lru);
+ list_del(&page->lru);
+
+ /* Protect z3fold page against free */
+ set_bit(UNDER_RECLAIM, &page->private);
+ zhdr = page_address(page);
+ if (!test_bit(PAGE_HEADLESS, &page->private)) {
+ list_del(&zhdr->buddy);
+ /*
+ * We need encode the handles before unlocking, since
+ * we can race with free that will set
+ * (first|last)_chunks to 0
+ */
+ first_handle = 0;
+ last_handle = 0;
+ middle_handle = 0;
+ if (zhdr->first_chunks)
+ first_handle = encode_handle(zhdr, FIRST);
+ if (zhdr->middle_chunks)
+ middle_handle = encode_handle(zhdr, MIDDLE);
+ if (zhdr->last_chunks)
+ last_handle = encode_handle(zhdr, LAST);
+ } else {
+ first_handle = encode_handle(zhdr, HEADLESS);
+ last_handle = middle_handle = 0;
+ }
+
+ spin_unlock(&pool->lock);
+
+ /* Issue the eviction callback(s) */
+ if (middle_handle) {
+ ret = pool->ops->evict(pool, middle_handle);
+ if (ret)
+ goto next;
+ }
+ if (first_handle) {
+ ret = pool->ops->evict(pool, first_handle);
+ if (ret)
+ goto next;
+ }
+ if (last_handle) {
+ ret = pool->ops->evict(pool, last_handle);
+ if (ret)
+ goto next;
+ }
+next:
+ spin_lock(&pool->lock);
+ clear_bit(UNDER_RECLAIM, &page->private);
+ if ((test_bit(PAGE_HEADLESS, &page->private) && ret == 0) ||
+ (zhdr->first_chunks == 0 && zhdr->last_chunks == 0 &&
+ zhdr->middle_chunks == 0)) {
+ /*
+ * All buddies are now free, free the z3fold page and
+ * return success.
+ */
+ clear_bit(PAGE_HEADLESS, &page->private);
+ free_z3fold_page(zhdr);
+ pool->pages_nr--;
+ spin_unlock(&pool->lock);
+ return 0;
+ } else if (zhdr->first_chunks != 0 &&
+ zhdr->last_chunks != 0 && zhdr->middle_chunks != 0) {
+ /* Full, add to buddied list */
+ list_add(&zhdr->buddy, &pool->buddied);
+ } else if (!test_bit(PAGE_HEADLESS, &page->private)) {
+ z3fold_compact_page(zhdr);
+ /* add to unbuddied list */
+ freechunks = num_free_chunks(zhdr);
+ list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
+ }
+
+ /* add to beginning of LRU */
+ list_add(&page->lru, &pool->lru);
+ }
+ spin_unlock(&pool->lock);
+ return -EAGAIN;
+}
+
+/**
+ * z3fold_map() - maps the allocation associated with the given handle
+ * @pool: pool in which the allocation resides
+ * @handle: handle associated with the allocation to be mapped
+ *
+ * Extracts the buddy number from handle and constructs the pointer to the
+ * correct starting chunk within the page.
+ *
+ * Returns: a pointer to the mapped allocation
+ */
+static void *z3fold_map(struct z3fold_pool *pool, unsigned long handle)
+{
+ struct z3fold_header *zhdr;
+ struct page *page;
+ void *addr;
+ enum buddy buddy;
+
+ spin_lock(&pool->lock);
+ zhdr = handle_to_z3fold_header(handle);
+ addr = zhdr;
+ page = virt_to_page(zhdr);
+
+ if (test_bit(PAGE_HEADLESS, &page->private))
+ goto out;
+
+ buddy = handle_to_buddy(handle);
+ switch (buddy) {
+ case FIRST:
+ addr += ZHDR_SIZE_ALIGNED;
+ break;
+ case MIDDLE:
+ addr += zhdr->start_middle << CHUNK_SHIFT;
+ set_bit(MIDDLE_CHUNK_MAPPED, &page->private);
+ break;
+ case LAST:
+ addr += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT);
+ break;
+ default:
+ pr_err("unknown buddy id %d\n", buddy);
+ WARN_ON(1);
+ addr = NULL;
+ break;
+ }
+out:
+ spin_unlock(&pool->lock);
+ return addr;
+}
+
+/**
+ * z3fold_unmap() - unmaps the allocation associated with the given handle
+ * @pool: pool in which the allocation resides
+ * @handle: handle associated with the allocation to be unmapped
+ */
+static void z3fold_unmap(struct z3fold_pool *pool, unsigned long handle)
+{
+ struct z3fold_header *zhdr;
+ struct page *page;
+ enum buddy buddy;
+
+ spin_lock(&pool->lock);
+ zhdr = handle_to_z3fold_header(handle);
+ page = virt_to_page(zhdr);
+
+ if (test_bit(PAGE_HEADLESS, &page->private)) {
+ spin_unlock(&pool->lock);
+ return;
+ }
+
+ buddy = handle_to_buddy(handle);
+ if (buddy == MIDDLE)
+ clear_bit(MIDDLE_CHUNK_MAPPED, &page->private);
+ spin_unlock(&pool->lock);
+}
+
+/**
+ * z3fold_get_pool_size() - gets the z3fold pool size in pages
+ * @pool: pool whose size is being queried
+ *
+ * Returns: size in pages of the given pool. The pool lock need not be
+ * taken to access pages_nr.
+ */
+static u64 z3fold_get_pool_size(struct z3fold_pool *pool)
+{
+ return pool->pages_nr;
+}
+
+/*****************
+ * zpool
+ ****************/
+
+static int z3fold_zpool_evict(struct z3fold_pool *pool, unsigned long handle)
+{
+ if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict)
+ return pool->zpool_ops->evict(pool->zpool, handle);
+ else
+ return -ENOENT;
+}
+
+static const struct z3fold_ops z3fold_zpool_ops = {
+ .evict = z3fold_zpool_evict
+};
+
+static void *z3fold_zpool_create(const char *name, gfp_t gfp,
+ const struct zpool_ops *zpool_ops,
+ struct zpool *zpool)
+{
+ struct z3fold_pool *pool;
+
+ pool = z3fold_create_pool(gfp, zpool_ops ? &z3fold_zpool_ops : NULL);
+ if (pool) {
+ pool->zpool = zpool;
+ pool->zpool_ops = zpool_ops;
+ }
+ return pool;
+}
+
+static void z3fold_zpool_destroy(void *pool)
+{
+ z3fold_destroy_pool(pool);
+}
+
+static int z3fold_zpool_malloc(void *pool, size_t size, gfp_t gfp,
+ unsigned long *handle)
+{
+ return z3fold_alloc(pool, size, gfp, handle);
+}
+static void z3fold_zpool_free(void *pool, unsigned long handle)
+{
+ z3fold_free(pool, handle);
+}
+
+static int z3fold_zpool_shrink(void *pool, unsigned int pages,
+ unsigned int *reclaimed)
+{
+ unsigned int total = 0;
+ int ret = -EINVAL;
+
+ while (total < pages) {
+ ret = z3fold_reclaim_page(pool, 8);
+ if (ret < 0)
+ break;
+ total++;
+ }
+
+ if (reclaimed)
+ *reclaimed = total;
+
+ return ret;
+}
+
+static void *z3fold_zpool_map(void *pool, unsigned long handle,
+ enum zpool_mapmode mm)
+{
+ return z3fold_map(pool, handle);
+}
+static void z3fold_zpool_unmap(void *pool, unsigned long handle)
+{
+ z3fold_unmap(pool, handle);
+}
+
+static u64 z3fold_zpool_total_size(void *pool)
+{
+ return z3fold_get_pool_size(pool) * PAGE_SIZE;
+}
+
+static struct zpool_driver z3fold_zpool_driver = {
+ .type = "z3fold",
+ .owner = THIS_MODULE,
+ .create = z3fold_zpool_create,
+ .destroy = z3fold_zpool_destroy,
+ .malloc = z3fold_zpool_malloc,
+ .free = z3fold_zpool_free,
+ .shrink = z3fold_zpool_shrink,
+ .map = z3fold_zpool_map,
+ .unmap = z3fold_zpool_unmap,
+ .total_size = z3fold_zpool_total_size,
+};
+
+MODULE_ALIAS("zpool-z3fold");
+
+static int __init init_z3fold(void)
+{
+ /* Make sure the z3fold header will fit in one chunk */
+ BUILD_BUG_ON(sizeof(struct z3fold_header) > ZHDR_SIZE_ALIGNED);
+ zpool_register_driver(&z3fold_zpool_driver);
+
+ return 0;
+}
+
+static void __exit exit_z3fold(void)
+{
+ zpool_unregister_driver(&z3fold_zpool_driver);
+}
+
+module_init(init_z3fold);
+module_exit(exit_z3fold);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vitaly Wool <vitalywool@gmail.com>");
+MODULE_DESCRIPTION("3-Fold Allocator for Compressed Pages");
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index fe47fbb..72698db 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -247,7 +247,6 @@
struct size_class **size_class;
struct kmem_cache *handle_cachep;
- gfp_t flags; /* allocation flags used when growing pool */
atomic_long_t pages_allocated;
struct zs_pool_stats stats;
@@ -295,10 +294,10 @@
kmem_cache_destroy(pool->handle_cachep);
}
-static unsigned long alloc_handle(struct zs_pool *pool)
+static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp)
{
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
- pool->flags & ~__GFP_HIGHMEM);
+ gfp & ~__GFP_HIGHMEM);
}
static void free_handle(struct zs_pool *pool, unsigned long handle)
@@ -324,7 +323,12 @@
const struct zpool_ops *zpool_ops,
struct zpool *zpool)
{
- return zs_create_pool(name, gfp);
+ /*
+ * Ignore global gfp flags: zs_malloc() may be invoked from
+ * different contexts and its caller must provide a valid
+ * gfp mask.
+ */
+ return zs_create_pool(name);
}
static void zs_zpool_destroy(void *pool)
@@ -335,7 +339,7 @@
static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
unsigned long *handle)
{
- *handle = zs_malloc(pool, size);
+ *handle = zs_malloc(pool, size, gfp);
return *handle ? 0 : -1;
}
static void zs_zpool_free(void *pool, unsigned long handle)
@@ -413,26 +417,28 @@
return PagePrivate2(page);
}
-static void get_zspage_mapping(struct page *page, unsigned int *class_idx,
+static void get_zspage_mapping(struct page *first_page,
+ unsigned int *class_idx,
enum fullness_group *fullness)
{
unsigned long m;
- BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
- m = (unsigned long)page->mapping;
+ m = (unsigned long)first_page->mapping;
*fullness = m & FULLNESS_MASK;
*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
}
-static void set_zspage_mapping(struct page *page, unsigned int class_idx,
+static void set_zspage_mapping(struct page *first_page,
+ unsigned int class_idx,
enum fullness_group fullness)
{
unsigned long m;
- BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
(fullness & FULLNESS_MASK);
- page->mapping = (struct address_space *)m;
+ first_page->mapping = (struct address_space *)m;
}
/*
@@ -567,17 +573,17 @@
.release = single_release,
};
-static int zs_pool_stat_create(const char *name, struct zs_pool *pool)
+static void zs_pool_stat_create(struct zs_pool *pool, const char *name)
{
struct dentry *entry;
if (!zs_stat_root)
- return -ENODEV;
+ return;
entry = debugfs_create_dir(name, zs_stat_root);
if (!entry) {
pr_warn("debugfs dir <%s> creation failed\n", name);
- return -ENOMEM;
+ return;
}
pool->stat_dentry = entry;
@@ -586,10 +592,8 @@
if (!entry) {
pr_warn("%s: debugfs file entry <%s> creation failed\n",
name, "classes");
- return -ENOMEM;
+ return;
}
-
- return 0;
}
static void zs_pool_stat_destroy(struct zs_pool *pool)
@@ -607,9 +611,8 @@
{
}
-static inline int zs_pool_stat_create(const char *name, struct zs_pool *pool)
+static inline void zs_pool_stat_create(struct zs_pool *pool, const char *name)
{
- return 0;
}
static inline void zs_pool_stat_destroy(struct zs_pool *pool)
@@ -617,7 +620,6 @@
}
#endif
-
/*
* For each size class, zspages are divided into different groups
* depending on how "full" they are. This was done so that we could
@@ -625,14 +627,15 @@
* the pool (not yet implemented). This function returns fullness
* status of the given page.
*/
-static enum fullness_group get_fullness_group(struct page *page)
+static enum fullness_group get_fullness_group(struct page *first_page)
{
int inuse, max_objects;
enum fullness_group fg;
- BUG_ON(!is_first_page(page));
- inuse = page->inuse;
- max_objects = page->objects;
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+ inuse = first_page->inuse;
+ max_objects = first_page->objects;
if (inuse == 0)
fg = ZS_EMPTY;
@@ -652,12 +655,13 @@
* have. This functions inserts the given zspage into the freelist
* identified by <class, fullness_group>.
*/
-static void insert_zspage(struct page *page, struct size_class *class,
- enum fullness_group fullness)
+static void insert_zspage(struct size_class *class,
+ enum fullness_group fullness,
+ struct page *first_page)
{
struct page **head;
- BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
if (fullness >= _ZS_NR_FULLNESS_GROUPS)
return;
@@ -667,7 +671,7 @@
head = &class->fullness_list[fullness];
if (!*head) {
- *head = page;
+ *head = first_page;
return;
}
@@ -675,34 +679,35 @@
* We want to see more ZS_FULL pages and less almost
* empty/full. Put pages with higher ->inuse first.
*/
- list_add_tail(&page->lru, &(*head)->lru);
- if (page->inuse >= (*head)->inuse)
- *head = page;
+ list_add_tail(&first_page->lru, &(*head)->lru);
+ if (first_page->inuse >= (*head)->inuse)
+ *head = first_page;
}
/*
* This function removes the given zspage from the freelist identified
* by <class, fullness_group>.
*/
-static void remove_zspage(struct page *page, struct size_class *class,
- enum fullness_group fullness)
+static void remove_zspage(struct size_class *class,
+ enum fullness_group fullness,
+ struct page *first_page)
{
struct page **head;
- BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
if (fullness >= _ZS_NR_FULLNESS_GROUPS)
return;
head = &class->fullness_list[fullness];
- BUG_ON(!*head);
+ VM_BUG_ON_PAGE(!*head, first_page);
if (list_empty(&(*head)->lru))
*head = NULL;
- else if (*head == page)
+ else if (*head == first_page)
*head = (struct page *)list_entry((*head)->lru.next,
struct page, lru);
- list_del_init(&page->lru);
+ list_del_init(&first_page->lru);
zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
}
@@ -717,21 +722,19 @@
* fullness group.
*/
static enum fullness_group fix_fullness_group(struct size_class *class,
- struct page *page)
+ struct page *first_page)
{
int class_idx;
enum fullness_group currfg, newfg;
- BUG_ON(!is_first_page(page));
-
- get_zspage_mapping(page, &class_idx, &currfg);
- newfg = get_fullness_group(page);
+ get_zspage_mapping(first_page, &class_idx, &currfg);
+ newfg = get_fullness_group(first_page);
if (newfg == currfg)
goto out;
- remove_zspage(page, class, currfg);
- insert_zspage(page, class, newfg);
- set_zspage_mapping(page, class_idx, newfg);
+ remove_zspage(class, currfg, first_page);
+ insert_zspage(class, newfg, first_page);
+ set_zspage_mapping(first_page, class_idx, newfg);
out:
return newfg;
@@ -809,7 +812,7 @@
unsigned long obj;
if (!page) {
- BUG_ON(obj_idx);
+ VM_BUG_ON(obj_idx);
return NULL;
}
@@ -842,7 +845,7 @@
void *obj)
{
if (class->huge) {
- VM_BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(page), page);
return page_private(page);
} else
return *(unsigned long *)obj;
@@ -892,8 +895,8 @@
{
struct page *nextp, *tmp, *head_extra;
- BUG_ON(!is_first_page(first_page));
- BUG_ON(first_page->inuse);
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+ VM_BUG_ON_PAGE(first_page->inuse, first_page);
head_extra = (struct page *)page_private(first_page);
@@ -914,12 +917,13 @@
}
/* Initialize a newly allocated zspage */
-static void init_zspage(struct page *first_page, struct size_class *class)
+static void init_zspage(struct size_class *class, struct page *first_page)
{
unsigned long off = 0;
struct page *page = first_page;
- BUG_ON(!is_first_page(first_page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
while (page) {
struct page *next_page;
struct link_free *link;
@@ -1001,7 +1005,7 @@
prev_page = page;
}
- init_zspage(first_page, class);
+ init_zspage(class, first_page);
first_page->freelist = location_to_obj(first_page, 0);
/* Maximum number of objects we can store in this zspage */
@@ -1234,11 +1238,11 @@
return true;
}
-static bool zspage_full(struct page *page)
+static bool zspage_full(struct page *first_page)
{
- BUG_ON(!is_first_page(page));
+ VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
- return page->inuse == page->objects;
+ return first_page->inuse == first_page->objects;
}
unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1274,14 +1278,12 @@
struct page *pages[2];
void *ret;
- BUG_ON(!handle);
-
/*
* Because we use per-cpu mapping areas shared among the
* pools/users, we can't allow mapping in interrupt context
* because it can corrupt another users mappings.
*/
- BUG_ON(in_interrupt());
+ WARN_ON_ONCE(in_interrupt());
/* From now on, migration cannot move the object */
pin_tag(handle);
@@ -1325,8 +1327,6 @@
struct size_class *class;
struct mapping_area *area;
- BUG_ON(!handle);
-
obj = handle_to_obj(handle);
obj_to_location(obj, &page, &obj_idx);
get_zspage_mapping(get_first_page(page), &class_idx, &fg);
@@ -1350,8 +1350,8 @@
}
EXPORT_SYMBOL_GPL(zs_unmap_object);
-static unsigned long obj_malloc(struct page *first_page,
- struct size_class *class, unsigned long handle)
+static unsigned long obj_malloc(struct size_class *class,
+ struct page *first_page, unsigned long handle)
{
unsigned long obj;
struct link_free *link;
@@ -1391,7 +1391,7 @@
* otherwise 0.
* Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
*/
-unsigned long zs_malloc(struct zs_pool *pool, size_t size)
+unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
{
unsigned long handle, obj;
struct size_class *class;
@@ -1400,7 +1400,7 @@
if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
return 0;
- handle = alloc_handle(pool);
+ handle = alloc_handle(pool, gfp);
if (!handle)
return 0;
@@ -1413,7 +1413,7 @@
if (!first_page) {
spin_unlock(&class->lock);
- first_page = alloc_zspage(class, pool->flags);
+ first_page = alloc_zspage(class, gfp);
if (unlikely(!first_page)) {
free_handle(pool, handle);
return 0;
@@ -1428,7 +1428,7 @@
class->size, class->pages_per_zspage));
}
- obj = obj_malloc(first_page, class, handle);
+ obj = obj_malloc(class, first_page, handle);
/* Now move the zspage to another fullness group, if required */
fix_fullness_group(class, first_page);
record_obj(handle, obj);
@@ -1438,16 +1438,13 @@
}
EXPORT_SYMBOL_GPL(zs_malloc);
-static void obj_free(struct zs_pool *pool, struct size_class *class,
- unsigned long obj)
+static void obj_free(struct size_class *class, unsigned long obj)
{
struct link_free *link;
struct page *first_page, *f_page;
unsigned long f_objidx, f_offset;
void *vaddr;
- BUG_ON(!obj);
-
obj &= ~OBJ_ALLOCATED_TAG;
obj_to_location(obj, &f_page, &f_objidx);
first_page = get_first_page(f_page);
@@ -1487,7 +1484,7 @@
class = pool->size_class[class_idx];
spin_lock(&class->lock);
- obj_free(pool, class, obj);
+ obj_free(class, obj);
fullness = fix_fullness_group(class, first_page);
if (fullness == ZS_EMPTY) {
zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
@@ -1503,8 +1500,8 @@
}
EXPORT_SYMBOL_GPL(zs_free);
-static void zs_object_copy(unsigned long dst, unsigned long src,
- struct size_class *class)
+static void zs_object_copy(struct size_class *class, unsigned long dst,
+ unsigned long src)
{
struct page *s_page, *d_page;
unsigned long s_objidx, d_objidx;
@@ -1547,7 +1544,6 @@
kunmap_atomic(d_addr);
kunmap_atomic(s_addr);
s_page = get_next_page(s_page);
- BUG_ON(!s_page);
s_addr = kmap_atomic(s_page);
d_addr = kmap_atomic(d_page);
s_size = class->size - written;
@@ -1557,7 +1553,6 @@
if (d_off >= PAGE_SIZE) {
kunmap_atomic(d_addr);
d_page = get_next_page(d_page);
- BUG_ON(!d_page);
d_addr = kmap_atomic(d_page);
d_size = class->size - written;
d_off = 0;
@@ -1572,8 +1567,8 @@
* Find alloced object in zspage from index object and
* return handle.
*/
-static unsigned long find_alloced_obj(struct page *page, int index,
- struct size_class *class)
+static unsigned long find_alloced_obj(struct size_class *class,
+ struct page *page, int index)
{
unsigned long head;
int offset = 0;
@@ -1623,7 +1618,7 @@
int ret = 0;
while (1) {
- handle = find_alloced_obj(s_page, index, class);
+ handle = find_alloced_obj(class, s_page, index);
if (!handle) {
s_page = get_next_page(s_page);
if (!s_page)
@@ -1640,8 +1635,8 @@
}
used_obj = handle_to_obj(handle);
- free_obj = obj_malloc(d_page, class, handle);
- zs_object_copy(free_obj, used_obj, class);
+ free_obj = obj_malloc(class, d_page, handle);
+ zs_object_copy(class, free_obj, used_obj);
index++;
/*
* record_obj updates handle's value to free_obj and it will
@@ -1652,7 +1647,7 @@
free_obj |= BIT(HANDLE_PIN_BIT);
record_obj(handle, free_obj);
unpin_tag(handle);
- obj_free(pool, class, used_obj);
+ obj_free(class, used_obj);
}
/* Remember last position in this iteration */
@@ -1670,7 +1665,7 @@
for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
page = class->fullness_list[i];
if (page) {
- remove_zspage(page, class, i);
+ remove_zspage(class, i, page);
break;
}
}
@@ -1692,10 +1687,8 @@
{
enum fullness_group fullness;
- BUG_ON(!is_first_page(first_page));
-
fullness = get_fullness_group(first_page);
- insert_zspage(first_page, class, fullness);
+ insert_zspage(class, fullness, first_page);
set_zspage_mapping(first_page, class->index, fullness);
if (fullness == ZS_EMPTY) {
@@ -1720,7 +1713,7 @@
if (!page)
continue;
- remove_zspage(page, class, i);
+ remove_zspage(class, i, page);
break;
}
@@ -1757,8 +1750,6 @@
spin_lock(&class->lock);
while ((src_page = isolate_source_page(class))) {
- BUG_ON(!is_first_page(src_page));
-
if (!zs_can_compact(class))
break;
@@ -1887,7 +1878,7 @@
* On success, a pointer to the newly created pool is returned,
* otherwise NULL.
*/
-struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
+struct zs_pool *zs_create_pool(const char *name)
{
int i;
struct zs_pool *pool;
@@ -1957,10 +1948,8 @@
prev_class = class;
}
- pool->flags = flags;
-
- if (zs_pool_stat_create(name, pool))
- goto err;
+ /* debug only, don't abort if it fails */
+ zs_pool_stat_create(pool, name);
/*
* Not critical, we still can use the pool
diff --git a/mm/zswap.c b/mm/zswap.c
index de0f119b..275b22c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -117,7 +117,7 @@
struct crypto_comp * __percpu *tfm;
struct kref kref;
struct list_head list;
- struct rcu_head rcu_head;
+ struct work_struct work;
struct notifier_block notifier;
char tfm_name[CRYPTO_MAX_ALG_NAME];
};
@@ -658,9 +658,11 @@
return kref_get_unless_zero(&pool->kref);
}
-static void __zswap_pool_release(struct rcu_head *head)
+static void __zswap_pool_release(struct work_struct *work)
{
- struct zswap_pool *pool = container_of(head, typeof(*pool), rcu_head);
+ struct zswap_pool *pool = container_of(work, typeof(*pool), work);
+
+ synchronize_rcu();
/* nobody should have been able to get a kref... */
WARN_ON(kref_get_unless_zero(&pool->kref));
@@ -680,7 +682,9 @@
WARN_ON(pool == zswap_pool_current());
list_del_rcu(&pool->list);
- call_rcu(&pool->rcu_head, __zswap_pool_release);
+
+ INIT_WORK(&pool->work, __zswap_pool_release);
+ schedule_work(&pool->work);
spin_unlock(&zswap_pools_lock);
}
diff --git a/samples/kprobes/jprobe_example.c b/samples/kprobes/jprobe_example.c
index c285a3b..c3108bb 100644
--- a/samples/kprobes/jprobe_example.c
+++ b/samples/kprobes/jprobe_example.c
@@ -25,7 +25,7 @@
/* Proxy routine having the same arguments as actual _do_fork() routine */
static long j_do_fork(unsigned long clone_flags, unsigned long stack_start,
unsigned long stack_size, int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr, unsigned long tls)
{
pr_info("jprobe: clone_flags = 0x%lx, stack_start = 0x%lx "
"stack_size = 0x%lx\n", clone_flags, stack_start, stack_size);
diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c
index 727eb21..ed0ca0c 100644
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -14,33 +14,37 @@
#include <linux/module.h>
#include <linux/kprobes.h>
+#define MAX_SYMBOL_LEN 64
+static char symbol[MAX_SYMBOL_LEN] = "_do_fork";
+module_param_string(symbol, symbol, sizeof(symbol), 0644);
+
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
- .symbol_name = "_do_fork",
+ .symbol_name = symbol,
};
/* kprobe pre_handler: called just before the probed instruction is executed */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
#ifdef CONFIG_X86
- printk(KERN_INFO "pre_handler: p->addr = 0x%p, ip = %lx,"
+ printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, ip = %lx,"
" flags = 0x%lx\n",
- p->addr, regs->ip, regs->flags);
+ p->symbol_name, p->addr, regs->ip, regs->flags);
#endif
#ifdef CONFIG_PPC
- printk(KERN_INFO "pre_handler: p->addr = 0x%p, nip = 0x%lx,"
+ printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, nip = 0x%lx,"
" msr = 0x%lx\n",
- p->addr, regs->nip, regs->msr);
+ p->symbol_name, p->addr, regs->nip, regs->msr);
#endif
#ifdef CONFIG_MIPS
- printk(KERN_INFO "pre_handler: p->addr = 0x%p, epc = 0x%lx,"
+ printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, epc = 0x%lx,"
" status = 0x%lx\n",
- p->addr, regs->cp0_epc, regs->cp0_status);
+ p->symbol_name, p->addr, regs->cp0_epc, regs->cp0_status);
#endif
#ifdef CONFIG_TILEGX
- printk(KERN_INFO "pre_handler: p->addr = 0x%p, pc = 0x%lx,"
+ printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, pc = 0x%lx,"
" ex1 = 0x%lx\n",
- p->addr, regs->pc, regs->ex1);
+ p->symbol_name, p->addr, regs->pc, regs->ex1);
#endif
/* A dump_stack() here will give a stack backtrace */
@@ -52,20 +56,20 @@
unsigned long flags)
{
#ifdef CONFIG_X86
- printk(KERN_INFO "post_handler: p->addr = 0x%p, flags = 0x%lx\n",
- p->addr, regs->flags);
+ printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, flags = 0x%lx\n",
+ p->symbol_name, p->addr, regs->flags);
#endif
#ifdef CONFIG_PPC
- printk(KERN_INFO "post_handler: p->addr = 0x%p, msr = 0x%lx\n",
- p->addr, regs->msr);
+ printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, msr = 0x%lx\n",
+ p->symbol_name, p->addr, regs->msr);
#endif
#ifdef CONFIG_MIPS
- printk(KERN_INFO "post_handler: p->addr = 0x%p, status = 0x%lx\n",
- p->addr, regs->cp0_status);
+ printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, status = 0x%lx\n",
+ p->symbol_name, p->addr, regs->cp0_status);
#endif
#ifdef CONFIG_TILEGX
- printk(KERN_INFO "post_handler: p->addr = 0x%p, ex1 = 0x%lx\n",
- p->addr, regs->ex1);
+ printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, ex1 = 0x%lx\n",
+ p->symbol_name, p->addr, regs->ex1);
#endif
}
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index d574d13..6750595 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -27,12 +27,15 @@
my $terse = 0;
my $showfile = 0;
my $file = 0;
+my $git = 0;
+my %git_commits = ();
my $check = 0;
my $check_orig = 0;
my $summary = 1;
my $mailback = 0;
my $summary_file = 0;
my $show_types = 0;
+my $list_types = 0;
my $fix = 0;
my $fix_inplace = 0;
my $root;
@@ -68,13 +71,24 @@
--emacs emacs compile window format
--terse one line per report
--showfile emit diffed file position, not input file position
+ -g, --git treat FILE as a single commit or git revision range
+ single git commit with:
+ <rev>
+ <rev>^
+ <rev>~n
+ multiple git commits with:
+ <rev1>..<rev2>
+ <rev1>...<rev2>
+ <rev>-<count>
+ git merges are ignored
-f, --file treat FILE as regular source file
--subjective, --strict enable more subjective tests
+ --list-types list the possible message types
--types TYPE(,TYPE2...) show only these comma separated message types
--ignore TYPE(,TYPE2...) ignore various comma separated message types
+ --show-types show the specific message type in the output
--max-line-length=n set the maximum line length, if exceeded, warn
--min-conf-desc-length=n set the min description length, if shorter, warn
- --show-types show the message "types" in the output
--root=PATH PATH to the kernel tree root
--no-summary suppress the per-file summary
--mailback only produce a report in case of warnings/errors
@@ -106,6 +120,37 @@
exit($exitcode);
}
+sub uniq {
+ my %seen;
+ return grep { !$seen{$_}++ } @_;
+}
+
+sub list_types {
+ my ($exitcode) = @_;
+
+ my $count = 0;
+
+ local $/ = undef;
+
+ open(my $script, '<', abs_path($P)) or
+ die "$P: Can't read '$P' $!\n";
+
+ my $text = <$script>;
+ close($script);
+
+ my @types = ();
+ for ($text =~ /\b(?:(?:CHK|WARN|ERROR)\s*\(\s*"([^"]+)")/g) {
+ push (@types, $_);
+ }
+ @types = sort(uniq(@types));
+ print("#\tMessage type\n\n");
+ foreach my $type (@types) {
+ print(++$count . "\t" . $type . "\n");
+ }
+
+ exit($exitcode);
+}
+
my $conf = which_conf($configuration_file);
if (-f $conf) {
my @conf_args;
@@ -141,11 +186,13 @@
'terse!' => \$terse,
'showfile!' => \$showfile,
'f|file!' => \$file,
+ 'g|git!' => \$git,
'subjective!' => \$check,
'strict!' => \$check,
'ignore=s' => \@ignore,
'types=s' => \@use,
'show-types!' => \$show_types,
+ 'list-types!' => \$list_types,
'max-line-length=i' => \$max_line_length,
'min-conf-desc-length=i' => \$min_conf_desc_length,
'root=s' => \$root,
@@ -166,6 +213,8 @@
help(0) if ($help);
+list_types(0) if ($list_types);
+
$fix = 1 if ($fix_inplace);
$check_orig = $check;
@@ -752,10 +801,42 @@
my @fixed_deleted = ();
my $fixlinenr = -1;
+# If input is git commits, extract all commits from the commit expressions.
+# For example, HEAD-3 means we need check 'HEAD, HEAD~1, HEAD~2'.
+die "$P: No git repository found\n" if ($git && !-e ".git");
+
+if ($git) {
+ my @commits = ();
+ foreach my $commit_expr (@ARGV) {
+ my $git_range;
+ if ($commit_expr =~ m/^(.*)-(\d+)$/) {
+ $git_range = "-$2 $1";
+ } elsif ($commit_expr =~ m/\.\./) {
+ $git_range = "$commit_expr";
+ } else {
+ $git_range = "-1 $commit_expr";
+ }
+ my $lines = `git log --no-color --no-merges --pretty=format:'%H %s' $git_range`;
+ foreach my $line (split(/\n/, $lines)) {
+ $line =~ /^([0-9a-fA-F]{40,40}) (.*)$/;
+ next if (!defined($1) || !defined($2));
+ my $sha1 = $1;
+ my $subject = $2;
+ unshift(@commits, $sha1);
+ $git_commits{$sha1} = $subject;
+ }
+ }
+ die "$P: no git commits after extraction!\n" if (@commits == 0);
+ @ARGV = @commits;
+}
+
my $vname;
for my $filename (@ARGV) {
my $FILE;
- if ($file) {
+ if ($git) {
+ open($FILE, '-|', "git format-patch -M --stdout -1 $filename") ||
+ die "$P: $filename: git format-patch failed - $!\n";
+ } elsif ($file) {
open($FILE, '-|', "diff -u /dev/null $filename") ||
die "$P: $filename: diff failed - $!\n";
} elsif ($filename eq '-') {
@@ -766,6 +847,8 @@
}
if ($filename eq '-') {
$vname = 'Your patch';
+ } elsif ($git) {
+ $vname = "Commit " . substr($filename, 0, 12) . ' ("' . $git_commits{$filename} . '")';
} else {
$vname = $filename;
}
@@ -2755,6 +2838,19 @@
"Logical continuations should be on the previous line\n" . $hereprev);
}
+# check indentation starts on a tab stop
+ if ($^V && $^V ge 5.10.0 &&
+ $sline =~ /^\+\t+( +)(?:$c90_Keywords\b|\{\s*$|\}\s*(?:else\b|while\b|\s*$))/) {
+ my $indent = length($1);
+ if ($indent % 8) {
+ if (WARN("TABSTOP",
+ "Statements should start on a tabstop\n" . $herecurr) &&
+ $fix) {
+ $fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/8)@e;
+ }
+ }
+ }
+
# check multi-line statement indentation matches previous line
if ($^V && $^V ge 5.10.0 &&
$prevline =~ /^\+([ \t]*)((?:$c90_Keywords(?:\s+if)\s*)|(?:$Declare\s*)?(?:$Ident|\(\s*\*\s*$Ident\s*\))\s*|$Ident\s*=\s*$Ident\s*)\(.*(\&\&|\|\||,)\s*$/) {
@@ -4291,7 +4387,7 @@
my $comp = $3;
my $to = $4;
my $newcomp = $comp;
- if ($lead !~ /$Operators\s*$/ &&
+ if ($lead !~ /(?:$Operators|\.)\s*$/ &&
$to !~ /^(?:Constant|[A-Z_][A-Z0-9_]*)$/ &&
WARN("CONSTANT_COMPARISON",
"Comparisons should place the constant on the right side of the test\n" . $herecurr) &&
@@ -5637,6 +5733,16 @@
}
}
+# check for #if defined CONFIG_<FOO> || defined CONFIG_<FOO>_MODULE
+ if ($line =~ /^\+\s*#\s*if\s+defined(?:\s*\(?\s*|\s+)(CONFIG_[A-Z_]+)\s*\)?\s*\|\|\s*defined(?:\s*\(?\s*|\s+)\1_MODULE\s*\)?\s*$/) {
+ my $config = $1;
+ if (WARN("PREFER_IS_ENABLED",
+ "Prefer IS_ENABLED(<FOO>) to CONFIG_<FOO> || CONFIG_<FOO>_MODULE\n" . $herecurr) &&
+ $fix) {
+ $fixed[$fixlinenr] = "\+#if IS_ENABLED($config)";
+ }
+ }
+
# check for case / default statements not preceded by break/fallthrough/switch
if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) {
my $has_break = 0;
@@ -5827,6 +5933,28 @@
}
}
+# whine about ACCESS_ONCE
+ if ($^V && $^V ge 5.10.0 &&
+ $line =~ /\bACCESS_ONCE\s*$balanced_parens\s*(=(?!=))?\s*($FuncArg)?/) {
+ my $par = $1;
+ my $eq = $2;
+ my $fun = $3;
+ $par =~ s/^\(\s*(.*)\s*\)$/$1/;
+ if (defined($eq)) {
+ if (WARN("PREFER_WRITE_ONCE",
+ "Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR>\n" . $herecurr) &&
+ $fix) {
+ $fixed[$fixlinenr] =~ s/\bACCESS_ONCE\s*\(\s*\Q$par\E\s*\)\s*$eq\s*\Q$fun\E/WRITE_ONCE($par, $fun)/;
+ }
+ } else {
+ if (WARN("PREFER_READ_ONCE",
+ "Prefer READ_ONCE(<FOO>) over ACCESS_ONCE(<FOO>)\n" . $herecurr) &&
+ $fix) {
+ $fixed[$fixlinenr] =~ s/\bACCESS_ONCE\s*\(\s*\Q$par\E\s*\)/READ_ONCE($par)/;
+ }
+ }
+ }
+
# check for lockdep_set_novalidate_class
if ($line =~ /^.\s*lockdep_set_novalidate_class\s*\(/ ||
$line =~ /__lockdep_no_validate__\s*\)/ ) {
@@ -5930,6 +6058,14 @@
}
if ($quiet == 0) {
+ # If there were any defects found and not already fixing them
+ if (!$clean and !$fix) {
+ print << "EOM"
+
+NOTE: For some of the reported defects, checkpatch may be able to
+ mechanically convert to the typical style using --fix or --fix-inplace.
+EOM
+ }
# If there were whitespace errors which cleanpatch can fix
# then suggest that.
if ($rpt_cleaners) {
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index 3cd0a58..0f887a5 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -972,7 +972,7 @@
int ima_policy_show(struct seq_file *m, void *v)
{
struct ima_rule_entry *entry = v;
- int i = 0;
+ int i;
char tbuf[64] = {0,};
rcu_read_lock();
@@ -1012,17 +1012,7 @@
}
if (entry->flags & IMA_FSUUID) {
- seq_puts(m, "fsuuid=");
- for (i = 0; i < ARRAY_SIZE(entry->fsuuid); ++i) {
- switch (i) {
- case 4:
- case 6:
- case 8:
- case 10:
- seq_puts(m, "-");
- }
- seq_printf(m, "%x", entry->fsuuid[i]);
- }
+ seq_printf(m, "fsuuid=%pU", entry->fsuuid);
seq_puts(m, " ");
}
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 604212d..3b53046 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -3,7 +3,7 @@
LDFLAGS += -lpthread -lurcu
TARGETS = main
OFILES = main.o radix-tree.o linux.o test.o tag_check.o find_next_bit.o \
- regression1.o regression2.o regression3.o
+ regression1.o regression2.o regression3.o multiorder.o
targets: $(TARGETS)
@@ -13,7 +13,7 @@
clean:
$(RM) -f $(TARGETS) *.o radix-tree.c
-$(OFILES): *.h */*.h
+$(OFILES): *.h */*.h ../../../include/linux/radix-tree.h ../../include/linux/*.h
radix-tree.c: ../../../lib/radix-tree.c
sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
diff --git a/tools/testing/radix-tree/generated/autoconf.h b/tools/testing/radix-tree/generated/autoconf.h
new file mode 100644
index 0000000..ad18cf5
--- /dev/null
+++ b/tools/testing/radix-tree/generated/autoconf.h
@@ -0,0 +1,3 @@
+#define CONFIG_RADIX_TREE_MULTIORDER 1
+#define CONFIG_SHMEM 1
+#define CONFIG_SWAP 1
diff --git a/tools/testing/radix-tree/linux/init.h b/tools/testing/radix-tree/linux/init.h
new file mode 100644
index 0000000..360cabb
--- /dev/null
+++ b/tools/testing/radix-tree/linux/init.h
@@ -0,0 +1 @@
+/* An empty file stub that allows radix-tree.c to compile. */
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index ae013b0..be98a47b 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -7,19 +7,28 @@
#include <stddef.h>
#include <limits.h>
+#include "../../include/linux/compiler.h"
+#include "../../../include/linux/kconfig.h"
+
+#define RADIX_TREE_MAP_SHIFT 3
+
#ifndef NULL
#define NULL 0
#endif
#define BUG_ON(expr) assert(!(expr))
+#define WARN_ON(expr) assert(!(expr))
#define __init
#define __must_check
#define panic(expr)
#define printk printf
#define __force
-#define likely(c) (c)
-#define unlikely(c) (c)
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
+#define pr_debug printk
+
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+#define cpu_relax() barrier()
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
@@ -28,6 +37,8 @@
(type *)( (char *)__mptr - offsetof(type, member) );})
#define min(a, b) ((a) < (b) ? (a) : (b))
+#define cond_resched() sched_yield()
+
static inline int in_interrupt(void)
{
return 0;
diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
index 57282506..6d5a347 100644
--- a/tools/testing/radix-tree/linux/slab.h
+++ b/tools/testing/radix-tree/linux/slab.h
@@ -3,7 +3,6 @@
#include <linux/types.h>
-#define GFP_KERNEL 1
#define SLAB_HWCACHE_ALIGN 1
#define SLAB_PANIC 2
#define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */
diff --git a/tools/testing/radix-tree/linux/types.h b/tools/testing/radix-tree/linux/types.h
index 72a9d85..faa0b6f 100644
--- a/tools/testing/radix-tree/linux/types.h
+++ b/tools/testing/radix-tree/linux/types.h
@@ -1,15 +1,13 @@
#ifndef _TYPES_H
#define _TYPES_H
+#include "../../include/linux/types.h"
+
#define __rcu
#define __read_mostly
#define BITS_PER_LONG (sizeof(long) * 8)
-struct list_head {
- struct list_head *next, *prev;
-};
-
static inline void INIT_LIST_HEAD(struct list_head *list)
{
list->next = list;
@@ -22,7 +20,6 @@
#define uninitialized_var(x) x = x
-typedef unsigned gfp_t;
#include <linux/gfp.h>
#endif
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index 0e83cad..b7619ff 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -61,11 +61,11 @@
} while (!wrapped);
}
-void big_gang_check(void)
+void big_gang_check(bool long_run)
{
int i;
- for (i = 0; i < 1000; i++) {
+ for (i = 0; i < (long_run ? 1000 : 3); i++) {
__big_gang_check();
srand(time(0));
printf("%d ", i);
@@ -232,10 +232,72 @@
item_kill_tree(&tree);
}
-static void single_thread_tests(void)
+static void __locate_check(struct radix_tree_root *tree, unsigned long index,
+ unsigned order)
+{
+ struct item *item;
+ unsigned long index2;
+
+ item_insert_order(tree, index, order);
+ item = item_lookup(tree, index);
+ index2 = radix_tree_locate_item(tree, item);
+ if (index != index2) {
+ printf("index %ld order %d inserted; found %ld\n",
+ index, order, index2);
+ abort();
+ }
+}
+
+static void __order_0_locate_check(void)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+ int i;
+
+ for (i = 0; i < 50; i++)
+ __locate_check(&tree, rand() % INT_MAX, 0);
+
+ item_kill_tree(&tree);
+}
+
+static void locate_check(void)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+ unsigned order;
+ unsigned long offset, index;
+
+ __order_0_locate_check();
+
+ for (order = 0; order < 20; order++) {
+ for (offset = 0; offset < (1 << (order + 3));
+ offset += (1UL << order)) {
+ for (index = 0; index < (1UL << (order + 5));
+ index += (1UL << order)) {
+ __locate_check(&tree, index + offset, order);
+ }
+ if (radix_tree_locate_item(&tree, &tree) != -1)
+ abort();
+
+ item_kill_tree(&tree);
+ }
+ }
+
+ if (radix_tree_locate_item(&tree, &tree) != -1)
+ abort();
+ __locate_check(&tree, -1, 0);
+ if (radix_tree_locate_item(&tree, &tree) != -1)
+ abort();
+ item_kill_tree(&tree);
+}
+
+static void single_thread_tests(bool long_run)
{
int i;
+ printf("starting single_thread_tests: %d allocated\n", nr_allocated);
+ multiorder_checks();
+ printf("after multiorder_check: %d allocated\n", nr_allocated);
+ locate_check();
+ printf("after locate_check: %d allocated\n", nr_allocated);
tag_check();
printf("after tag_check: %d allocated\n", nr_allocated);
gang_check();
@@ -244,9 +306,9 @@
printf("after add_and_check: %d allocated\n", nr_allocated);
dynamic_height_check();
printf("after dynamic_height_check: %d allocated\n", nr_allocated);
- big_gang_check();
+ big_gang_check(long_run);
printf("after big_gang_check: %d allocated\n", nr_allocated);
- for (i = 0; i < 2000; i++) {
+ for (i = 0; i < (long_run ? 2000 : 3); i++) {
copy_tag_check();
printf("%d ", i);
fflush(stdout);
@@ -254,15 +316,23 @@
printf("after copy_tag_check: %d allocated\n", nr_allocated);
}
-int main(void)
+int main(int argc, char **argv)
{
+ bool long_run = false;
+ int opt;
+
+ while ((opt = getopt(argc, argv, "l")) != -1) {
+ if (opt == 'l')
+ long_run = true;
+ }
+
rcu_register_thread();
radix_tree_init();
regression1_test();
regression2_test();
regression3_test();
- single_thread_tests();
+ single_thread_tests(long_run);
sleep(1);
printf("after sleep(1): %d allocated\n", nr_allocated);
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
new file mode 100644
index 0000000..39d9b95
--- /dev/null
+++ b/tools/testing/radix-tree/multiorder.c
@@ -0,0 +1,337 @@
+/*
+ * multiorder.c: Multi-order radix tree entry testing
+ * Copyright (c) 2016 Intel Corporation
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/radix-tree.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "test.h"
+
+#define for_each_index(i, base, order) \
+ for (i = base; i < base + (1 << order); i++)
+
+static void __multiorder_tag_test(int index, int order)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+ int base, err, i;
+ unsigned long first = 0;
+
+ /* our canonical entry */
+ base = index & ~((1 << order) - 1);
+
+ printf("Multiorder tag test with index %d, canonical entry %d\n",
+ index, base);
+
+ err = item_insert_order(&tree, index, order);
+ assert(!err);
+
+ /*
+ * Verify we get collisions for covered indices. We try and fail to
+ * insert an exceptional entry so we don't leak memory via
+ * item_insert_order().
+ */
+ for_each_index(i, base, order) {
+ err = __radix_tree_insert(&tree, i, order,
+ (void *)(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY));
+ assert(err == -EEXIST);
+ }
+
+ for_each_index(i, base, order) {
+ assert(!radix_tree_tag_get(&tree, i, 0));
+ assert(!radix_tree_tag_get(&tree, i, 1));
+ }
+
+ assert(radix_tree_tag_set(&tree, index, 0));
+
+ for_each_index(i, base, order) {
+ assert(radix_tree_tag_get(&tree, i, 0));
+ assert(!radix_tree_tag_get(&tree, i, 1));
+ }
+
+ assert(radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, 10, 0, 1) == 1);
+ assert(radix_tree_tag_clear(&tree, index, 0));
+
+ for_each_index(i, base, order) {
+ assert(!radix_tree_tag_get(&tree, i, 0));
+ assert(radix_tree_tag_get(&tree, i, 1));
+ }
+
+ assert(radix_tree_tag_clear(&tree, index, 1));
+
+ assert(!radix_tree_tagged(&tree, 0));
+ assert(!radix_tree_tagged(&tree, 1));
+
+ item_kill_tree(&tree);
+}
+
+static void multiorder_tag_tests(void)
+{
+ /* test multi-order entry for indices 0-7 with no sibling pointers */
+ __multiorder_tag_test(0, 3);
+ __multiorder_tag_test(5, 3);
+
+ /* test multi-order entry for indices 8-15 with no sibling pointers */
+ __multiorder_tag_test(8, 3);
+ __multiorder_tag_test(15, 3);
+
+ /*
+ * Our order 5 entry covers indices 0-31 in a tree with height=2.
+ * This is broken up as follows:
+ * 0-7: canonical entry
+ * 8-15: sibling 1
+ * 16-23: sibling 2
+ * 24-31: sibling 3
+ */
+ __multiorder_tag_test(0, 5);
+ __multiorder_tag_test(29, 5);
+
+ /* same test, but with indices 32-63 */
+ __multiorder_tag_test(32, 5);
+ __multiorder_tag_test(44, 5);
+
+ /*
+ * Our order 8 entry covers indices 0-255 in a tree with height=3.
+ * This is broken up as follows:
+ * 0-63: canonical entry
+ * 64-127: sibling 1
+ * 128-191: sibling 2
+ * 192-255: sibling 3
+ */
+ __multiorder_tag_test(0, 8);
+ __multiorder_tag_test(190, 8);
+
+ /* same test, but with indices 256-511 */
+ __multiorder_tag_test(256, 8);
+ __multiorder_tag_test(300, 8);
+
+ __multiorder_tag_test(0x12345678UL, 8);
+}
+
+static void multiorder_check(unsigned long index, int order)
+{
+ unsigned long i;
+ unsigned long min = index & ~((1UL << order) - 1);
+ unsigned long max = min + (1UL << order);
+ RADIX_TREE(tree, GFP_KERNEL);
+
+ printf("Multiorder index %ld, order %d\n", index, order);
+
+ assert(item_insert_order(&tree, index, order) == 0);
+
+ for (i = min; i < max; i++) {
+ struct item *item = item_lookup(&tree, i);
+ assert(item != 0);
+ assert(item->index == index);
+ }
+ for (i = 0; i < min; i++)
+ item_check_absent(&tree, i);
+ for (i = max; i < 2*max; i++)
+ item_check_absent(&tree, i);
+ for (i = min; i < max; i++) {
+ static void *entry = (void *)
+ (0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY);
+ assert(radix_tree_insert(&tree, i, entry) == -EEXIST);
+ }
+
+ assert(item_delete(&tree, index) != 0);
+
+ for (i = 0; i < 2*max; i++)
+ item_check_absent(&tree, i);
+}
+
+static void multiorder_shrink(unsigned long index, int order)
+{
+ unsigned long i;
+ unsigned long max = 1 << order;
+ RADIX_TREE(tree, GFP_KERNEL);
+ struct radix_tree_node *node;
+
+ printf("Multiorder shrink index %ld, order %d\n", index, order);
+
+ assert(item_insert_order(&tree, 0, order) == 0);
+
+ node = tree.rnode;
+
+ assert(item_insert(&tree, index) == 0);
+ assert(node != tree.rnode);
+
+ assert(item_delete(&tree, index) != 0);
+ assert(node == tree.rnode);
+
+ for (i = 0; i < max; i++) {
+ struct item *item = item_lookup(&tree, i);
+ assert(item != 0);
+ assert(item->index == 0);
+ }
+ for (i = max; i < 2*max; i++)
+ item_check_absent(&tree, i);
+
+ if (!item_delete(&tree, 0)) {
+ printf("failed to delete index %ld (order %d)\n", index, order); abort();
+ }
+
+ for (i = 0; i < 2*max; i++)
+ item_check_absent(&tree, i);
+}
+
+static void multiorder_insert_bug(void)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+
+ item_insert(&tree, 0);
+ radix_tree_tag_set(&tree, 0, 0);
+ item_insert_order(&tree, 3 << 6, 6);
+
+ item_kill_tree(&tree);
+}
+
+void multiorder_iteration(void)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+ struct radix_tree_iter iter;
+ void **slot;
+ int i, j, err;
+
+ printf("Multiorder iteration test\n");
+
+#define NUM_ENTRIES 11
+ int index[NUM_ENTRIES] = {0, 2, 4, 8, 16, 32, 34, 36, 64, 72, 128};
+ int order[NUM_ENTRIES] = {1, 1, 2, 3, 4, 1, 0, 1, 3, 0, 7};
+
+ for (i = 0; i < NUM_ENTRIES; i++) {
+ err = item_insert_order(&tree, index[i], order[i]);
+ assert(!err);
+ }
+
+ for (j = 0; j < 256; j++) {
+ for (i = 0; i < NUM_ENTRIES; i++)
+ if (j <= (index[i] | ((1 << order[i]) - 1)))
+ break;
+
+ radix_tree_for_each_slot(slot, &tree, &iter, j) {
+ int height = order[i] / RADIX_TREE_MAP_SHIFT;
+ int shift = height * RADIX_TREE_MAP_SHIFT;
+ int mask = (1 << order[i]) - 1;
+
+ assert(iter.index >= (index[i] &~ mask));
+ assert(iter.index <= (index[i] | mask));
+ assert(iter.shift == shift);
+ i++;
+ }
+ }
+
+ item_kill_tree(&tree);
+}
+
+void multiorder_tagged_iteration(void)
+{
+ RADIX_TREE(tree, GFP_KERNEL);
+ struct radix_tree_iter iter;
+ void **slot;
+ unsigned long first = 0;
+ int i, j;
+
+ printf("Multiorder tagged iteration test\n");
+
+#define MT_NUM_ENTRIES 9
+ int index[MT_NUM_ENTRIES] = {0, 2, 4, 16, 32, 40, 64, 72, 128};
+ int order[MT_NUM_ENTRIES] = {1, 0, 2, 4, 3, 1, 3, 0, 7};
+
+#define TAG_ENTRIES 7
+ int tag_index[TAG_ENTRIES] = {0, 4, 16, 40, 64, 72, 128};
+
+ for (i = 0; i < MT_NUM_ENTRIES; i++)
+ assert(!item_insert_order(&tree, index[i], order[i]));
+
+ assert(!radix_tree_tagged(&tree, 1));
+
+ for (i = 0; i < TAG_ENTRIES; i++)
+ assert(radix_tree_tag_set(&tree, tag_index[i], 1));
+
+ for (j = 0; j < 256; j++) {
+ int mask, k;
+
+ for (i = 0; i < TAG_ENTRIES; i++) {
+ for (k = i; index[k] < tag_index[i]; k++)
+ ;
+ if (j <= (index[k] | ((1 << order[k]) - 1)))
+ break;
+ }
+
+ radix_tree_for_each_tagged(slot, &tree, &iter, j, 1) {
+ for (k = i; index[k] < tag_index[i]; k++)
+ ;
+ mask = (1 << order[k]) - 1;
+
+ assert(iter.index >= (tag_index[i] &~ mask));
+ assert(iter.index <= (tag_index[i] | mask));
+ i++;
+ }
+ }
+
+ radix_tree_range_tag_if_tagged(&tree, &first, ~0UL,
+ MT_NUM_ENTRIES, 1, 2);
+
+ for (j = 0; j < 256; j++) {
+ int mask, k;
+
+ for (i = 0; i < TAG_ENTRIES; i++) {
+ for (k = i; index[k] < tag_index[i]; k++)
+ ;
+ if (j <= (index[k] | ((1 << order[k]) - 1)))
+ break;
+ }
+
+ radix_tree_for_each_tagged(slot, &tree, &iter, j, 2) {
+ for (k = i; index[k] < tag_index[i]; k++)
+ ;
+ mask = (1 << order[k]) - 1;
+
+ assert(iter.index >= (tag_index[i] &~ mask));
+ assert(iter.index <= (tag_index[i] | mask));
+ i++;
+ }
+ }
+
+ first = 1;
+ radix_tree_range_tag_if_tagged(&tree, &first, ~0UL,
+ MT_NUM_ENTRIES, 1, 0);
+ i = 0;
+ radix_tree_for_each_tagged(slot, &tree, &iter, 0, 0) {
+ assert(iter.index == tag_index[i]);
+ i++;
+ }
+
+ item_kill_tree(&tree);
+}
+
+void multiorder_checks(void)
+{
+ int i;
+
+ for (i = 0; i < 20; i++) {
+ multiorder_check(200, i);
+ multiorder_check(0, i);
+ multiorder_check((1UL << i) + 1, i);
+ }
+
+ for (i = 0; i < 15; i++)
+ multiorder_shrink((1UL << (i + RADIX_TREE_MAP_SHIFT)), i);
+
+ multiorder_insert_bug();
+ multiorder_tag_tests();
+ multiorder_iteration();
+ multiorder_tagged_iteration();
+}
diff --git a/tools/testing/radix-tree/regression2.c b/tools/testing/radix-tree/regression2.c
index 5d2fa28..63bf347 100644
--- a/tools/testing/radix-tree/regression2.c
+++ b/tools/testing/radix-tree/regression2.c
@@ -51,13 +51,6 @@
#include "regression.h"
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
-#endif
-
-#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
#define PAGECACHE_TAG_DIRTY 0
#define PAGECACHE_TAG_WRITEBACK 1
#define PAGECACHE_TAG_TOWRITE 2
diff --git a/tools/testing/radix-tree/tag_check.c b/tools/testing/radix-tree/tag_check.c
index 83136be..b7447ce 100644
--- a/tools/testing/radix-tree/tag_check.c
+++ b/tools/testing/radix-tree/tag_check.c
@@ -12,6 +12,7 @@
static void
__simple_checks(struct radix_tree_root *tree, unsigned long index, int tag)
{
+ unsigned long first = 0;
int ret;
item_check_absent(tree, index);
@@ -22,6 +23,10 @@
item_tag_set(tree, index, tag);
ret = item_tag_get(tree, index, tag);
assert(ret != 0);
+ ret = radix_tree_range_tag_if_tagged(tree, &first, ~0UL, 10, tag, !tag);
+ assert(ret == 1);
+ ret = item_tag_get(tree, index, !tag);
+ assert(ret != 0);
ret = item_delete(tree, index);
assert(ret != 0);
item_insert(tree, index);
@@ -304,6 +309,7 @@
struct item *items[BATCH];
RADIX_TREE(tree, GFP_KERNEL);
int ret;
+ unsigned long first = 0;
item_insert(&tree, 0);
item_tag_set(&tree, 0, 0);
@@ -313,6 +319,10 @@
assert(ret == 0);
verify_tag_consistency(&tree, 0);
verify_tag_consistency(&tree, 1);
+ ret = radix_tree_range_tag_if_tagged(&tree, &first, 10, 10, 0, 1);
+ assert(ret == 1);
+ ret = radix_tree_gang_lookup_tag(&tree, (void **)items, 0, BATCH, 1);
+ assert(ret == 1);
item_kill_tree(&tree);
}
diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c
index 2bebf34..a6e8099 100644
--- a/tools/testing/radix-tree/test.c
+++ b/tools/testing/radix-tree/test.c
@@ -24,14 +24,21 @@
return radix_tree_tag_get(root, index, tag);
}
-int __item_insert(struct radix_tree_root *root, struct item *item)
+int __item_insert(struct radix_tree_root *root, struct item *item,
+ unsigned order)
{
- return radix_tree_insert(root, item->index, item);
+ return __radix_tree_insert(root, item->index, order, item);
}
int item_insert(struct radix_tree_root *root, unsigned long index)
{
- return __item_insert(root, item_create(index));
+ return __item_insert(root, item_create(index), 0);
+}
+
+int item_insert_order(struct radix_tree_root *root, unsigned long index,
+ unsigned order)
+{
+ return __item_insert(root, item_create(index), order);
}
int item_delete(struct radix_tree_root *root, unsigned long index)
@@ -136,13 +143,13 @@
}
static int verify_node(struct radix_tree_node *slot, unsigned int tag,
- unsigned int height, int tagged)
+ int tagged)
{
int anyset = 0;
int i;
int j;
- slot = indirect_to_ptr(slot);
+ slot = entry_to_node(slot);
/* Verify consistency at this level */
for (i = 0; i < RADIX_TREE_TAG_LONGS; i++) {
@@ -152,7 +159,8 @@
}
}
if (tagged != anyset) {
- printf("tag: %u, height %u, tagged: %d, anyset: %d\n", tag, height, tagged, anyset);
+ printf("tag: %u, shift %u, tagged: %d, anyset: %d\n",
+ tag, slot->shift, tagged, anyset);
for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) {
printf("tag %d: ", j);
for (i = 0; i < RADIX_TREE_TAG_LONGS; i++)
@@ -164,10 +172,10 @@
assert(tagged == anyset);
/* Go for next level */
- if (height > 1) {
+ if (slot->shift > 0) {
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++)
if (slot->slots[i])
- if (verify_node(slot->slots[i], tag, height - 1,
+ if (verify_node(slot->slots[i], tag,
!!test_bit(i, slot->tags[tag]))) {
printf("Failure at off %d\n", i);
for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) {
@@ -184,9 +192,10 @@
void verify_tag_consistency(struct radix_tree_root *root, unsigned int tag)
{
- if (!root->height)
+ struct radix_tree_node *node = root->rnode;
+ if (!radix_tree_is_internal_node(node))
return;
- verify_node(root->rnode, tag, root->height, !!root_tag_get(root, tag));
+ verify_node(node, tag, !!root_tag_get(root, tag));
}
void item_kill_tree(struct radix_tree_root *root)
@@ -211,9 +220,19 @@
void tree_verify_min_height(struct radix_tree_root *root, int maxindex)
{
- assert(radix_tree_maxindex(root->height) >= maxindex);
- if (root->height > 1)
- assert(radix_tree_maxindex(root->height-1) < maxindex);
- else if (root->height == 1)
- assert(radix_tree_maxindex(root->height-1) <= maxindex);
+ unsigned shift;
+ struct radix_tree_node *node = root->rnode;
+ if (!radix_tree_is_internal_node(node)) {
+ assert(maxindex == 0);
+ return;
+ }
+
+ node = entry_to_node(node);
+ assert(maxindex <= node_maxindex(node));
+
+ shift = node->shift;
+ if (shift > 0)
+ assert(maxindex > shift_maxindex(shift - RADIX_TREE_MAP_SHIFT));
+ else
+ assert(maxindex > 0);
}
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 4e1d95f..e851313 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -8,8 +8,11 @@
};
struct item *item_create(unsigned long index);
-int __item_insert(struct radix_tree_root *root, struct item *item);
+int __item_insert(struct radix_tree_root *root, struct item *item,
+ unsigned order);
int item_insert(struct radix_tree_root *root, unsigned long index);
+int item_insert_order(struct radix_tree_root *root, unsigned long index,
+ unsigned order);
int item_delete(struct radix_tree_root *root, unsigned long index);
struct item *item_lookup(struct radix_tree_root *root, unsigned long index);
@@ -23,6 +26,7 @@
void item_kill_tree(struct radix_tree_root *root);
void tag_check(void);
+void multiorder_checks(void);
struct item *
item_tag_set(struct radix_tree_root *root, unsigned long index, int tag);
@@ -35,6 +39,7 @@
extern int nr_allocated;
/* Normally private parts of lib/radix-tree.c */
-void *indirect_to_ptr(void *ptr);
+void radix_tree_dump(struct radix_tree_root *root);
int root_tag_get(struct radix_tree_root *root, unsigned int tag);
-unsigned long radix_tree_maxindex(unsigned int height);
+unsigned long node_maxindex(struct radix_tree_node *);
+unsigned long shift_maxindex(unsigned int shift);