summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)Author
2017-06-24mm: larger stack guard gap, between vmasHugh Dickins
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()Laura Abbott
commit 861ce4a3244c21b0af64f880d5bfe5e6e2fb9e4a upstream. '__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbeb6: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dc16ecf7fd1f ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/1494278596-30373-1-git-send-email-labbott@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21mm: Tighten x86 /dev/mem with zeroing readsKees Cook
commit a4866aa812518ed1a37d8ea0c881dc946409de94 upstream. Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is disallowed. However, on x86, the first 1MB was always allowed for BIOS and similar things, regardless of it actually being System RAM. It was possible for heap to end up getting allocated in low 1MB RAM, and then read by things like x86info or dd, which would trip hardened usercopy: usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes) This changes the x86 exception for the low 1MB by reading back zeros for System RAM areas instead of blindly allowing them. More work is needed to extend this to mmap, but currently mmap doesn't go through usercopy, so hardened usercopy won't Oops the kernel. Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08x86/mm/KASLR: Exclude EFI region from KASLR VA space randomizationBaoquan He
commit a46f60d76004965e5669dbf3fc21ef3bc3632eb4 upstream. Currently KASLR is enabled on three regions: the direct mapping of physical memory, vamlloc and vmemmap. However the EFI region is also mistakenly included for VA space randomization because of misusing EFI_VA_START macro and assuming EFI_VA_START < EFI_VA_END. (This breaks kexec and possibly other things that rely on stable addresses.) The EFI region is reserved for EFI runtime services virtual mapping which should not be included in KASLR ranges. In Documentation/x86/x86_64/mm.txt, we can see: ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space EFI uses the space from -4G to -64G thus EFI_VA_START > EFI_VA_END, Here EFI_VA_START = -4G, and EFI_VA_END = -64G. Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem. Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Thomas Garnier <thgarnie@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1490331592-31860-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=yAndrey Ryabinin
commit be3606ff739d1c1be36389f8737c577ad87e1f57 upstream. The kernel doesn't boot with both PROFILE_ANNOTATED_BRANCHES=y and KASAN=y options selected. With branch profiling enabled we end up calling ftrace_likely_update() before kasan_early_init(). ftrace_likely_update() is built with KASAN instrumentation, so calling it before kasan has been initialized leads to crash. Use DISABLE_BRANCH_PROFILING define to make sure that we don't call ftrace_likely_update() from early code before kasan_early_init(). Fixes: ef7f0d6a6ca8 ("x86_64: add KASan support") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Alexander Potapenko <glider@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: lkp@01.org Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/20170313163337.1704-1-aryabinin@virtuozzo.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-15x86, mm: fix gup_pte_range() vs DAX mappingsDan Williams
commit ef947b2529f918d9606533eb9c32b187ed6a5ede upstream. gup_pte_range() fails to check pte_allows_gup() before translating a DAX pte entry, pte_devmap(), to a page. This allows writes to read-only mappings, and bypasses the DAX cacheline dirty tracking due to missed 'mkwrite' faults. The gup_huge_pmd() path and the gup_huge_pud() path correctly check pte_allows_gup() before checking for _devmap() entries. Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Link: http://lkml.kernel.org/r/148804251312.36605.12665024794196605053.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Xiong Zhou <xzhou@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-14x86/mm/ptdump: Fix soft lockup in page table walkerAndrey Ryabinin
commit 146fbb766934dc003fcbf755b519acef683576bf upstream. CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow. In that case ptdump_walk_pgd_level_core() takes a lot of time to walk across all page tables and doing this without a rescheduling causes soft lockups: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1] ... Call Trace: ptdump_walk_pgd_level_core+0x40c/0x550 ptdump_walk_pgd_level_checkwx+0x17/0x20 mark_rodata_ro+0x13b/0x150 kernel_init+0x2f/0x120 ret_from_fork+0x2c/0x40 I guess that this issue might arise even without KASAN on huge machines with several terabytes of RAM. Stick cond_resched() in pgd loop to fix this. Reported-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Alexander Potapenko <glider@google.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-21x86/traps: Ignore high word of regs->cs in early_fixup_exception()Andy Lutomirski
On the 80486 DX, it seems that some exceptions may leave garbage in the high bits of CS. This causes sporadic failures in which early_fixup_exception() refuses to fix up an exception. As far as I can tell, this has been buggy for a long time, but the problem seems to have been exacerbated by commits: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines") This appears to have broken for as long as we've had early exception handling. [ Note to stable maintainers: This patch is needed all the way back to 3.4, but it will only apply to 4.6 and up, as it depends on commit: 0e861fbb5bda ("x86/head: Move early exception panic code into early_fixup_exception()") If you want to backport to kernels before 4.6, please don't backport the prerequisites (there was a big chain of them that rewrote a lot of the early exception machinery); instead, ask me and I can send you a one-liner that will apply. ] Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 4c5023a3fa2e ("x86-32: Handle exception table entries during early boot") Link: http://lkml.kernel.org/r/cb32c69920e58a1a58e7b5cad975038a69c0ce7d.1479609510.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28Merge tag 'drm-x86-pat-regression-fix' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull drm x86/pat regression fixes from Dave Airlie: "This is a standalone pull request for the fix for a regression introduced in -rc1 by a change to vm_insert_mixed to start using the PAT range tracking to validate page protections. With this fix in place, all the VRAM mappings for GPU drivers ended up at UC instead of WC. There are probably better ways to fix this long term, but nothing I'd considered for -fixes that wouldn't need more settling in time. So I've just created a new arch API that the drivers can reserve all their VRAM aperture ranges as WC" * tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux: drm/drivers: add support for using the arch wc mapping API. x86/io: add interface to reserve io memtype for a resource range. (v1.1)
2016-10-28kconfig.h: remove config_enabled() macroMasahiro Yamada
The use of config_enabled() is ambiguous. For config options, IS_ENABLED(), IS_REACHABLE(), etc. will make intention clearer. Sometimes config_enabled() has been used for non-config options because it is useful to check whether the given symbol is defined or not. I have been tackling on deprecating config_enabled(), and now is the time to finish this work. Some new users have appeared for v4.9-rc1, but it is trivial to replace them: - arch/x86/mm/kaslr.c replace config_enabled() with IS_ENABLED() because CONFIG_X86_ESPFIX64 and CONFIG_EFI are boolean. - include/asm-generic/export.h replace config_enabled() with __is_defined(). Then, config_enabled() can be removed now. Going forward, please use IS_ENABLED(), IS_REACHABLE(), etc. for config options, and __is_defined() for non-config symbols. Link: http://lkml.kernel.org/r/1476616078-32252-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Garnier <thgarnie@google.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-26x86/io: add interface to reserve io memtype for a resource range. (v1.1)Dave Airlie
A recent change to the mm code in: 87744ab3832b mm: fix cache mode tracking in vm_insert_mixed() started enforcing checking the memory type against the registered list for amixed pfn insertion mappings. It happens that the drm drivers for a number of gpus relied on this being broken. Currently the driver only inserted VRAM mappings into the tracking table when they came from the kernel, and userspace mappings never landed in the table. This led to a regression where all the mapping end up as UC instead of WC now. I've considered a number of solutions but since this needs to be fixed in fixes and not next, and some of the solutions were going to introduce overhead that hadn't been there before I didn't consider them viable at this stage. These mainly concerned hooking into the TTM io reserve APIs, but these API have a bunch of fast paths I didn't want to unwind to add this to. The solution I've decided on is to add a new API like the arch_phys_wc APIs (these would have worked but wc_del didn't take a range), and use them from the drivers to add a WC compatible mapping to the table for all VRAM on those GPUs. This means we can then create userspace mapping that won't get degraded to UC. v1.1: use CONFIG_X86_PAT + add some comments in io.h Cc: Toshi Kani <toshi.kani@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: x86@kernel.org Cc: mcgrof@suse.com Cc: Dan Williams <dan.j.williams@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-10-19Merge branch 'gup_flag-cleanups'Linus Torvalds
Merge the gup_flags cleanups from Lorenzo Stoakes: "This patch series adjusts functions in the get_user_pages* family such that desired FOLL_* flags are passed as an argument rather than implied by flags. The purpose of this change is to make the use of FOLL_FORCE explicit so it is easier to grep for and clearer to callers that this flag is being used. The use of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for the VMA whose pages we are reading from/writing to, which can result in surprising behaviour. The patch series came out of the discussion around commit 38e088546522 ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"), which addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE set but having been overridden by FOLL_FORCE. do_numa_page() was run on the assumption the page _must_ be one marked for NUMA node migration as an actual PROT_NONE page would have been dealt with prior to this code path, however FOLL_FORCE introduced a situation where this assumption did not hold. See https://marc.info/?l=linux-mm&m=147585445805166 for the patch proposal" Additionally, there's a fix for an ancient bug related to FOLL_FORCE and FOLL_WRITE by me. [ This branch was rebased recently to add a few more acked-by's and reviewed-by's ] * gup_flag-cleanups: mm: replace access_process_vm() write parameter with gup_flags mm: replace access_remote_vm() write parameter with gup_flags mm: replace __access_remote_vm() write parameter with gup_flags mm: replace get_user_pages_remote() write/force parameters with gup_flags mm: replace get_user_pages() write/force parameters with gup_flags mm: replace get_vaddr_frames() write/force parameters with gup_flags mm: replace get_user_pages_locked() write/force parameters with gup_flags mm: replace get_user_pages_unlocked() write/force parameters with gup_flags mm: remove write/force parameters from __get_user_pages_unlocked() mm: remove write/force parameters from __get_user_pages_locked() mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19mm: replace get_user_pages() write/force parameters with gup_flagsLorenzo Stoakes
This removes the 'write' and 'force' from get_user_pages() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18mm: replace get_user_pages_unlocked() write/force parameters with gup_flagsLorenzo Stoakes
This removes the 'write' and 'force' use from get_user_pages_unlocked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-12Merge branch 'work.uaccess2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull uaccess.h prepwork from Al Viro: "Preparations to tree-wide switch to use of linux/uaccess.h (which, obviously, will allow to start unifying stuff for real). The last step there, ie PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ `git grep -l "$PATT"|grep -v ^include/linux/uaccess.h` is not taken here - I would prefer to do it once just before or just after -rc1. However, everything should be ready for it" * 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove a stray reference to asm/uaccess.h in docs sparc64: separate extable_64.h, switch elf_64.h to it score: separate extable.h, switch module.h to it mips: separate extable.h, switch module.h to it x86: separate extable.h, switch sections.h to it remove stray include of asm/uaccess.h from cacheflush.h mn10300: remove a bogus processor.h->uaccess.h include xtensa: split uaccess.h into C and asm sides bonding: quit messing with IOCTL kill __kernel_ds_p off mn10300: finish verify_area() off frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h exceptions: detritus removal
2016-10-10Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull protection keys syscall interface from Thomas Gleixner: "This is the final step of Protection Keys support which adds the syscalls so user space can actually allocate keys and protect memory areas with them. Details and usage examples can be found in the documentation. The mm side of this has been acked by Mel" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pkeys: Update documentation x86/mm/pkeys: Do not skip PKRU register if debug registers are not used x86/pkeys: Fix pkeys build breakage for some non-x86 arches x86/pkeys: Add self-tests x86/pkeys: Allow configuration of init_pkru x86/pkeys: Default to a restrictive init PKRU pkeys: Add details of system call use to Documentation/ generic syscalls: Wire up memory protection keys syscalls x86: Wire up protection keys system calls x86/pkeys: Allocation/free syscalls x86/pkeys: Make mprotect_key() mask off additional vm_flags mm: Implement new pkey_mprotect() system call x86/pkeys: Add fault handling for PF_PK page fault bit
2016-10-04Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Header file and a wrapper functions cleanup" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Migrate exception table users off module.h and onto extable.h x86: Clean up various simple wrapper functions
2016-10-03Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: "The changes in this cycle were: - Save e820 table RAM footprint on larger kernel configurations. (Denys Vlasenko) - pmem related fixes (Dan Williams) - theoretical e820 boundary condition fix (Wei Yang)" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation x86/e820: Use much less memory for e820/e820_saved, save up to 120k x86/e820: Prepare e280 code for switch to dynamic storage x86/e820: Mark some static functions __init x86/e820: Fix very large 'size' handling boundary condition
2016-10-03Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull low-level x86 updates from Ingo Molnar: "In this cycle this topic tree has become one of those 'super topics' that accumulated a lot of changes: - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on x86 - preceded by an array of changes. v4.8 saw preparatory changes in this area already - this is the rest of the work. Includes the thread stack caching performance optimization. (Andy Lutomirski) - switch_to() cleanups and all around enhancements. (Brian Gerst) - A large number of dumpstack infrastructure enhancements and an unwinder abstraction. The secret long term plan is safe(r) live patching plus maybe another attempt at debuginfo based unwinding - but all these current bits are standalone enhancements in a frame pointer based debug environment as well. (Josh Poimboeuf) - More __ro_after_init and const annotations. (Kees Cook) - Enable KASLR for the vmemmap memory region. (Thomas Garnier)" [ The virtually mapped stack changes are pretty fundamental, and not x86-specific per se, even if they are only used on x86 right now. ] * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) x86/asm: Get rid of __read_cr4_safe() thread_info: Use unsigned long for flags x86/alternatives: Add stack frame dependency to alternative_call_2() x86/dumpstack: Fix show_stack() task pointer regression x86/dumpstack: Remove dump_trace() and related callbacks x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder oprofile/x86: Convert x86_backtrace() to use the new unwinder x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder perf/x86: Convert perf_callchain_kernel() to use the new unwinder x86/unwind: Add new unwind interface and implementations x86/dumpstack: Remove NULL task pointer convention fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK lib/syscall: Pin the task stack in collect_syscall() x86/process: Pin the target stack in get_wchan() x86/dumpstack: Pin the target stack when dumping it kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function sched/core: Add try_get_task_stack() and put_task_stack() x86/entry/64: Fix a minor comment rebase error iommu/amd: Don't put completion-wait semaphore on stack ...
2016-10-03Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Ingo Molnar: "The main changes are: - Persistent CPU/node numbering across CPU hotplug/unplug events. This is a pretty involved series of changes that first fetches all the information during bootup and then uses it for the various hotplug/unplug methods. (Gu Zheng, Dou Liyang) - IO-APIC hot-add/remove fixes and enhancements. (Rui Wang) - ... various fixes, cleanups and enhancements" * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86/apic: Fix silent & fatal merge conflict in __generic_processor_info() acpi: Fix broken error check in map_processor() acpi: Validate processor id when mapping the processor acpi: Provide mechanism to validate processors in the ACPI tables x86/acpi: Set persistent cpuid <-> nodeid mapping when booting x86/acpi: Enable MADT APIs to return disabled apicids x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping x86/acpi: Enable acpi to register all possible cpus at boot time x86/numa: Online memory-less nodes at boot time x86/apic: Get rid of apic_version[] array x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq() x86/ioapic: Ignore root bridges without a companion ACPI device x86/apic: Update comment about disabling processor focus x86/smpboot: Check APIC ID before setting up default routing x86/ioapic: Fix IOAPIC failing to request resource x86/ioapic: Fix lost IOAPIC resource after hot-removal and hotadd x86/ioapic: Fix setup_res() failing to get resource x86/ioapic: Support hot-removal of IOAPICs present during boot x86/ioapic: Change prototype of acpi_ioapic_add() x86/apic, ACPI: Fix incorrect assignment when handling apic/x2apic entries ...
2016-09-30Merge branch 'x86/urgent' into x86/asmThomas Gleixner
Get the cr4 fixes so we can apply the final cleanup
2016-09-28exceptions: detritus removalAl Viro
externs and defines for stuff that is never used Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-26Merge branch 'x86/urgent' into x86/apicThomas Gleixner
Bring in the upstream modifications so we can fixup the silent merge conflict which is introduced by this merge. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/numa: Online memory-less nodes at boot timeTang Chen
For now, x86 does not support memory-less node. A node without memory will not be onlined, and the cpus on it will be mapped to the other online nodes with memory in init_cpu_to_node(). The reason of doing this is to ensure each cpu has mapped to a node with memory, so that it will be able to allocate local memory for that cpu. But we don't have to do it in this way. In this series of patches, we are going to construct cpu <-> node mapping for all possible cpus at boot time, which is a persistent mapping. It means that the cpu will be mapped to the node which it belongs to, and will never be changed. If a node has only cpus but no memory, the cpus on it will be mapped to a memory-less node. And the memory-less node should be onlined. Allocate pgdats for all memory-less nodes and online them at boot time. Then build zonelists for these nodes. As a result, when cpus on these memory-less nodes try to allocate memory from local node, it will automatically fall back to the proper zones in the zonelists. Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-2-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/e820: Use much less memory for e820/e820_saved, save up to 120kDenys Vlasenko
The maximum size of e820 map array for EFI systems is defined as E820_X_MAX (E820MAX + 3 * MAX_NUMNODES). In x86_64 defconfig, this ends up with E820_X_MAX = 320, e820 and e820_saved are 6404 bytes each. With larger configs, for example Fedora kernels, E820_X_MAX = 3200, e820 and e820_saved are 64004 bytes each. Most of this space is wasted. Typical machines have some 20-30 e820 areas at most. After previous patch, e820 and e820_saved are pointers to e280 maps. Change them to initially point to maps which are __initdata. At the very end of kernel init, just before __init[data] sections are freed in free_initmem(), allocate smaller blocks, copy maps there, and change pointers. The late switch makes sure that all functions which can be used to change e820 maps are no longer accessible (they are all __init functions). Run-tested. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160918182125.21000-1-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-21x86/e820: Prepare e280 code for switch to dynamic storageDenys Vlasenko
This patch turns e820 and e820_saved into pointers to e820 tables, of the same size as before. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160917213927.1787-2-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20x86/mm/pat: Prevent hang during boot when mapping pagesMatt Fleming
There's a mixture of signed 32-bit and unsigned 32-bit and 64-bit data types used for keeping track of how many pages have been mapped. This leads to hangs during boot when mapping large numbers of pages (multiple terabytes, as reported by Waiman) because those values are interpreted as being negative. commit 742563777e8d ("x86/mm/pat: Avoid truncation when converting cpa->numpages to address") fixed one of those bugs, but there is another lurking in __change_page_attr_set_clr(). Additionally, the return value type for the populate_*() functions can return negative values when a large number of pages have been mapped, triggering the error paths even though no error occurred. Consistently use 64-bit types on 64-bit platforms when counting pages. Even in the signed case this gives us room for regions 8PiB (pebibytes) in size whilst still allowing the usual negative value error checking idiom. Reported-by: Waiman Long <waiman.long@hpe.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> CC: Theodore Ts'o <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Douglas Hatch <doug.hatch@hpe.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-19x86: Migrate exception table users off module.h and onto extable.hPaul Gortmaker
These files were only including module.h for exception table related functions. We've now separated that content out into its own file "extable.h" so now move over to that and avoid all the extra header content in module.h that we don't really need to compile these files. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/20160919210418.30243-1-paul.gortmaker@windriver.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-15Merge branch 'linus' into x86/asm, to pick up recent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-13x86: Clean up various simple wrapper functionsMasahiro Yamada
Remove unneeded variables and assignments. While we are here, let's fix the following as well: - Remove unnecessary parentheses - Remove unnecessary unsigned-suffix 'U' from constant values - Reword the comment in set_apic_id() (suggested by Thomas Gleixner) Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Borislav Petkov <bp@suse.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Travis <travis@sgi.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steffen Persvold <sp@numascale.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Link: http://lkml.kernel.org/r/1473573502-27954-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-10mm: fix cache mode of dax pmd mappingsDan Williams
track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as uncacheable rendering them impractical for application usage. DAX-pte mappings are cached and the goal of establishing DAX-pmd mappings is to attain more performance, not dramatically less (3 orders of magnitude). track_pfn_insert() relies on a previous call to reserve_memtype() to establish the expected page_cache_mode for the range. While memremap() arranges for reserve_memtype() to be called, devm_memremap_pages() does not. So, teach track_pfn_insert() and untrack_pfn() how to handle tracking without a vma, and arrange for devm_memremap_pages() to establish the write-back-cache reservation in the memtype tree. Cc: <stable@vger.kernel.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: Kai Zhang <kai.ka.zhang@oracle.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09x86/pkeys: Allow configuration of init_pkruDave Hansen
As discussed in the previous patch, there is a reliability benefit to allowing an init value for the Protection Keys Rights User register (PKRU) which differs from what the XSAVE hardware provides. But, having PKRU be 0 (its init value) provides some nonzero amount of optimization potential to the hardware. It can, for instance, skip writes to the XSAVE buffer when it knows that PKRU is in its init state. The cost of losing this optimization is approximately 100 cycles per context switch for a workload which lightly using XSAVE state (something not using AVX much). The overhead comes from a combinaation of actually manipulating PKRU and the overhead of pullin in an extra cacheline. This overhead is not huge, but it's also not something that I think we should unconditionally inflict on everyone. So, make it configurable both at boot-time and from debugfs. Changes to the debugfs value affect all processes created after the write to debugfs. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: mgorman@techsingularity.net Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163023.407672D2@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09x86/pkeys: Default to a restrictive init PKRUDave Hansen
PKRU is the register that lets you disallow writes or all access to a given protection key. The XSAVE hardware defines an "init state" of 0 for PKRU: its most permissive state, allowing access/writes to everything. Since we start off all new processes with the init state, we start all processes off with the most permissive possible PKRU. This is unfortunate. If a thread is clone()'d [1] before a program has time to set PKRU to a restrictive value, that thread will be able to write to all data, no matter what pkey is set on it. This weakens any integrity guarantees that we want pkeys to provide. To fix this, we define a very restrictive PKRU to override the XSAVE-provided value when we create a new FPU context. We choose a value that only allows access to pkey 0, which is as restrictive as we can practically make it. This does not cause any practical problems with applications using protection keys because we require them to specify initial permissions for each key when it is allocated, which override the restrictive default. In the end, this ensures that threads which do not know how to manage their own pkey rights can not do damage to data which is pkey-protected. I would have thought this was a pretty contrived scenario, except that I heard a bug report from an MPX user who was creating threads in some very early code before main(). It may be crazy, but folks evidently _do_ it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: mgorman@techsingularity.net Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09x86/pkeys: Allocation/free syscallsDave Hansen
This patch adds two new system calls: int pkey_alloc(unsigned long flags, unsigned long init_access_rights) int pkey_free(int pkey); These implement an "allocator" for the protection keys themselves, which can be thought of as analogous to the allocator that the kernel has for file descriptors. The kernel tracks which numbers are in use, and only allows operations on keys that are valid. A key which was not obtained by pkey_alloc() may not, for instance, be passed to pkey_mprotect(). These system calls are also very important given the kernel's use of pkeys to implement execute-only support. These help ensure that userspace can never assume that it has control of a key unless it first asks the kernel. The kernel does not promise to preserve PKRU (right register) contents except for allocated pkeys. The 'init_access_rights' argument to pkey_alloc() specifies the rights that will be established for the returned pkey. For instance: pkey = pkey_alloc(flags, PKEY_DENY_WRITE); will allocate 'pkey', but also sets the bits in PKRU[1] such that writing to 'pkey' is already denied. The kernel does not prevent pkey_free() from successfully freeing in-use pkeys (those still assigned to a memory range by pkey_mprotect()). It would be expensive to implement the checks for this, so we instead say, "Just don't do it" since sane software will never do it anyway. Any piece of userspace calling pkey_alloc() needs to be prepared for it to fail. Why? pkey_alloc() returns the same error code (ENOSPC) when there are no pkeys and when pkeys are unsupported. They can be unsupported for a whole host of reasons, so apps must be prepared for this. Also, libraries or LD_PRELOADs might steal keys before an application gets access to them. This allocation mechanism could be implemented in userspace. Even if we did it in userspace, we would still need additional user/kernel interfaces to tell userspace which keys are being used by the kernel internally (such as for execute-only mappings). Having the kernel provide this facility completely removes the need for these additional interfaces, or having an implementation of this in userspace at all. Note that we have to make changes to all of the architectures that do not use mman-common.h because we use the new PKEY_DENY_ACCESS/WRITE macros in arch-independent code. 1. PKRU is the Protection Key Rights User register. It is a usermode-accessible register that controls whether writes and/or access to each individual pkey is allowed or denied. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09x86/pkeys: Add fault handling for PF_PK page fault bitDave Hansen
PF_PK means that a memory access violated the protection key access restrictions. It is unconditionally an access_error() because the permissions set on the VMA don't matter (the PKRU value overrides it), and we never "resolve" PK faults (like how a COW can "resolve write fault). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163010.DD1FE1ED@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-08x86/mm: Improve stack-overflow #PF handlingAndy Lutomirski
If we get a page fault indicating kernel stack overflow, invoke handle_stack_overflow(). To prevent us from overflowing the stack again while handling the overflow (because we are likely to have very little stack space left), call handle_stack_overflow() on the double-fault stack. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/6d6cf96b3fb9b4c9aa303817e1dc4de0c7c36487.1472603235.git.luto@kernel.org [ Minor edit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08Merge branch 'x86/mm' into x86/asm, to unify the two branches for simplicityIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-27treewide: replace config_enabled() with IS_ENABLED() (2nd round)Masahiro Yamada
Commit 97f2645f358b ("tree-wide: replace config_enabled() with IS_ENABLED()") mostly killed config_enabled(), but some new users have appeared for v4.8-rc1. They are all used for a boolean option, so can be replaced with IS_ENABLED() safely. Link: http://lkml.kernel.org/r/1471970749-24867-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-24x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)Andy Lutomirski
This allows x86_64 kernels to enable vmapped stacks by setting HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y high level Kconfig option. There are a couple of interesting bits: First, x86 lazily faults in top-level paging entries for the vmalloc area. This won't work if we get a page fault while trying to access the stack: the CPU will promote it to a double-fault and we'll die. To avoid this problem, probe the new stack when switching stacks and forcibly populate the pgd entry for the stack when switching mms. Second, once we have guard pages around the stack, we'll want to detect and handle stack overflow. I didn't enable it on x86_32. We'd need to rework the double-fault code a bit and I'm concerned about running out of vmalloc virtual addresses under some workloads. This patch, by itself, will behave somewhat erratically when the stack overflows while RSP is still more than a few tens of bytes above the bottom of the stack. Specifically, we'll get #PF and make it to no_context and them oops without reliably triggering a double-fault, and no_context doesn't know about stack overflows. The next patch will improve that case. Thank you to Nadav and Brian for helping me pay enough attention to the SDM to hopefully get this right. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/c88f3e2920b18e6cc621d772a04a62c06869037e.1470907718.git.luto@kernel.org [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-15x86/mm/numa: Open code function early_get_boot_cpu_id()Baoquan He
Previously early_acpi_boot_init() was called in early_get_boot_cpu_id() to get the value for boot_cpu_physical_apicid. Now early_acpi_boot_init() has been taken out and moved to setup_arch(), the name of early_get_boot_cpu_id() doesn't match its implementation anymore, and only the getting boot-time SMP configuration code was left. So in this patch we open code it. Also move the smp_found_config check into default_get_smp_config to simplify code, because both early_get_smp_config() and get_smp_config() call x86_init.mpparse.get_smp_config(). Also remove the redundent CONFIG_X86_MPPARSE #ifdef check when we call early_get_smp_config(). Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-acpi@vger.kernel.org Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1470985033-22493-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-12Merge tag 'pm-4.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Two hibernation fixes allowing it to work with the recently added randomization of the kernel identity mapping base on x86-64 and one cpufreq driver regression fix. Specifics: - Fix the x86 identity mapping creation helpers to avoid the assumption that the base address of the mapping will always be aligned at the PGD level, as it may be aligned at the PUD level if address space randomization is enabled (Rafael Wysocki). - Fix the hibernation core to avoid executing tracing functions before restoring the processor state completely during resume (Thomas Garnier). - Fix a recently introduced regression in the powernv cpufreq driver that causes it to crash due to an out-of-bounds array access (Akshay Adiga)" * tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / hibernate: Restore processor state before using per-CPU variables x86/power/64: Always create temporary identity mapping correctly cpufreq: powernv: Fix crash in gpstate_timer_handler()
2016-08-12Merge branches 'pm-sleep' and 'pm-cpufreq'Rafael J. Wysocki
* pm-sleep: PM / hibernate: Restore processor state before using per-CPU variables x86/power/64: Always create temporary identity mapping correctly * pm-cpufreq: cpufreq: powernv: Fix crash in gpstate_timer_handler()
2016-08-10x86/mm/64: Enable KASLR for vmemmap memory regionThomas Garnier
Add vmemmap in the list of randomized memory regions. The vmemmap region holds a representation of the physical memory (through a struct page array). An attacker could use this region to disclose the kernel memory layout (walking the page linked list). Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/1469635196-122447-1-git-send-email-thgarnie@google.com [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/mm/KASLR: Increase BRK pages for KASLR memory randomizationThomas Garnier
Default implementation expects 6 pages maximum are needed for low page allocations. If KASLR memory randomization is enabled, the worse case of e820 layout would require 12 pages (no large pages). It is due to the PUD level randomization and the variable e820 memory layout. This bug was found while doing extensive testing of KASLR memory randomization on different type of hardware. Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: kernel-hardening@lists.openwall.com Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions") Link: http://lkml.kernel.org/r/1470762665-88032-2-git-send-email-thgarnie@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomizationThomas Garnier
Initialize KASLR memory randomization after max_pfn is initialized. Also ensure the size is rounded up. It could create problems on machines with more than 1Tb of memory on certain random addresses. Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: kernel-hardening@lists.openwall.com Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions") Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-08x86/power/64: Always create temporary identity mapping correctlyRafael J. Wysocki
The low-level resume-from-hibernation code on x86-64 uses kernel_ident_mapping_init() to create the temoprary identity mapping, but that function assumes that the offset between kernel virtual addresses and physical addresses is aligned on the PGD level. However, with a randomized identity mapping base, it may be aligned on the PUD level and if that happens, the temporary identity mapping created by set_up_temporary_mappings() will not reflect the actual kernel identity mapping and the image restoration will fail as a result (leading to a kernel panic most of the time). To fix this problem, rework kernel_ident_mapping_init() to support unaligned offsets between KVA and PA up to the PMD level and make set_up_temporary_mappings() use it as approprtiate. Reported-and-tested-by: Thomas Garnier <thgarnie@google.com> Reported-by: Borislav Petkov <bp@suse.de> Suggested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Yinghai Lu <yinghai@kernel.org>
2016-08-02treewide: replace obsolete _refok by __refFabian Frederick
There was only one use of __initdata_refok and __exit_refok __init_refok was used 46 times against 82 for __ref. Those definitions are obsolete since commit 312b1485fb50 ("Introduce new section reference annotations tags: __ref, __refdata, __refconst") This patch removes the following compatibility definitions and replaces them treewide. /* compatibility defines */ #define __init_refok __ref #define __initdata_refok __refdata #define __exit_refok __ref I can also provide separate patches if necessary. (One patch per tree and check in 1 month or 2 to remove old definitions) [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-01Merge branch 'x86-headers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 header cleanups from Ingo Molnar: "This tree is a cleanup of the x86 tree reducing spurious uses of module.h - which should improve build performance a bit" * 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads x86/apic: Remove duplicated include from probe_64.c x86/ce4100: Remove duplicated include from ce4100.c x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t x86/platform: Delete extraneous MODULE_* tags fromm ts5500 x86: Audit and remove any remaining unnecessary uses of module.h x86/kvm: Audit and remove any unnecessary uses of module.h x86/xen: Audit and remove any unnecessary uses of module.h x86/platform: Audit and remove any unnecessary uses of module.h x86/lib: Audit and remove any unnecessary uses of module.h x86/kernel: Audit and remove any unnecessary uses of module.h x86/mm: Audit and remove any unnecessary uses of module.h x86: Don't use module.h just for AUTHOR / LICENSE tags
2016-07-30Merge branch 'x86-microcode-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode updates from Thomas Gleixner: - more work to make the microcode loader robust - a fix for the micro code load precedence - fixes for initrd loading with randomized memory - less printk noise on SMP machines * 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm, x86/microcode: Add __PAGE_OFFSET_BASE define on 32-bit x86/microcode/intel: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y x86/microcode: Remove unused symbol exports x86/microcode/intel: Do not issue microcode updates messages on each CPU Documentation/microcode: Document some aspects for more clarity x86/microcode/AMD: Make amd_ucode_patch[] static x86/microcode/intel: Unexport save_mc_for_early() x86/microcode/intel: Rename load_microcode_early() to find_microcode_patch() x86/microcode: Propagate save_microcode_in_initrd() retval x86/microcode: Get rid of find_cpio_data()'s dummy offset arg lib/cpio: Make find_cpio_data()'s offset arg optional x86/microcode: Fix suspend to RAM with builtin microcode x86/microcode: Fix loading precedence
2016-07-27Merge branch 'linus' into x86/microcode, to pick up merge window changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>