summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2016-08-12arm64: hibernate: handle allocation failuresMark Rutland
In create_safe_exec_page(), we create a copy of the hibernate exit text, along with some page tables to map this via TTBR0. We then install the new tables in TTBR0. In swsusp_arch_resume() we call create_safe_exec_page() before trying a number of operations which may fail (e.g. copying the linear map page tables). If these fail, we bail out of swsusp_arch_resume() and return an error code, but leave TTBR0 as-is. Subsequently, the core hibernate code will call free_basic_memory_bitmaps(), which will free all of the memory allocations we made, including the page tables installed in TTBR0. Thus, we may have TTBR0 pointing at dangling freed memory for some period of time. If the hibernate attempt was triggered by a user requesting a hibernate test via the reboot syscall, we may return to userspace with the clobbered TTBR0 value. Avoid these issues by reorganising swsusp_arch_resume() such that we have no failure paths after create_safe_exec_page(). We also add a check that the zero page allocation succeeded, matching what we have for other allocations. Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: James Morse <james.morse@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # 4.7+ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-08-12arm64: hibernate: avoid potential TLB conflictMark Rutland
In create_safe_exec_page we install a set of global mappings in TTBR0, then subsequently invalidate TLBs. While TTBR0 points at the zero page, and the TLBs should be free of stale global entries, we may have stale ASID-tagged entries (e.g. from the EFI runtime services mappings) for the same VAs. Per the ARM ARM these ASID-tagged entries may conflict with newly-allocated global entries, and we must follow a Break-Before-Make approach to avoid issues resulting from this. This patch reworks create_safe_exec_page to invalidate TLBs while the zero page is still in place, ensuring that there are no potential conflicts when the new TTBR0 value is installed. As a single CPU is online while this code executes, we do not need to perform broadcast TLB maintenance, and can call local_flush_tlb_all(), which also subsumes some barriers. The remaining assembly is converted to use write_sysreg() and isb(). Other than this, we safely manipulate TTBRs in the hibernate dance. The code we install as part of the new TTBR0 mapping (the hibernated kernel's swsusp_arch_suspend_exit) installs a zero page into TTBR1, invalidates TLBs, then installs its preferred value. Upon being restored to the middle of swsusp_arch_suspend, the new image will call __cpu_suspend_exit, which will call cpu_uninstall_idmap, installing the zero page in TTBR0 and invalidating all TLB entries. Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # 4.7+ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-08-12arm64: Handle el1 synchronous instruction aborts cleanlyLaura Abbott
Executing from a non-executable area gives an ugly message: lkdtm: Performing direct entry EXEC_RODATA lkdtm: attempting ok execution at ffff0000084c0e08 lkdtm: attempting bad execution at ffff000008880700 Bad mode in Synchronous Abort handler detected on CPU2, code 0x8400000e -- IABT (current EL) CPU: 2 PID: 998 Comm: sh Not tainted 4.7.0-rc2+ #13 Hardware name: linux,dummy-virt (DT) task: ffff800077e35780 ti: ffff800077970000 task.ti: ffff800077970000 PC is at lkdtm_rodata_do_nothing+0x0/0x8 LR is at execute_location+0x74/0x88 The 'IABT (current EL)' indicates the error but it's a bit cryptic without knowledge of the ARM ARM. There is also no indication of the specific address which triggered the fault. The increase in kernel page permissions makes hitting this case more likely as well. Handling the case in the vectors gives a much more familiar looking error message: lkdtm: Performing direct entry EXEC_RODATA lkdtm: attempting ok execution at ffff0000084c0840 lkdtm: attempting bad execution at ffff000008880680 Unable to handle kernel paging request at virtual address ffff000008880680 pgd = ffff8000089b2000 [ffff000008880680] *pgd=00000000489b4003, *pud=0000000048904003, *pmd=0000000000000000 Internal error: Oops: 8400000e [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 997 Comm: sh Not tainted 4.7.0-rc1+ #24 Hardware name: linux,dummy-virt (DT) task: ffff800077f9f080 ti: ffff800008a1c000 task.ti: ffff800008a1c000 PC is at lkdtm_rodata_do_nothing+0x0/0x8 LR is at execute_location+0x74/0x88 Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-08-12Merge tag 'kvm-s390-master-4.8-1' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux KVM: s390: Fixes for 4.8 (via kvm/master) Here are two fixes found by fuzzing of the ioctl interface. Both cases can trigger a WARN_ON_ONCE from user space.
2016-08-12MIPS: KVM: Propagate kseg0/mapped tlb fault errorsJames Hogan
Propagate errors from kvm_mips_handle_kseg0_tlb_fault() and kvm_mips_handle_mapped_seg_tlb_fault(), usually triggering an internal error since they normally indicate the guest accessed bad physical memory or the commpage in an unexpected way. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Fixes: e685c689f3a8 ("KVM/MIPS32: Privileged instruction/target branch emulation.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12MIPS: KVM: Fix gfn range check in kseg0 tlb faultsJames Hogan
Two consecutive gfns are loaded into host TLB, so ensure the range check isn't off by one if guest_pmap_npages is odd. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12MIPS: KVM: Add missing gfn range checkJames Hogan
kvm_mips_handle_mapped_seg_tlb_fault() calculates the guest frame number based on the guest TLB EntryLo values, however it is not range checked to ensure it lies within the guest_pmap. If the physical memory the guest refers to is out of range then dump the guest TLB and emit an internal error. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12MIPS: KVM: Fix mapped fault broken commpage handlingJames Hogan
kvm_mips_handle_mapped_seg_tlb_fault() appears to map the guest page at virtual address 0 to PFN 0 if the guest has created its own mapping there. The intention is unclear, but it may have been an attempt to protect the zero page from being mapped to anything but the comm page in code paths you wouldn't expect from genuine commpage accesses (guest kernel mode cache instructions on that address, hitting trapping instructions when executing from that address with a coincidental TLB eviction during the KVM handling, and guest user mode accesses to that address). Fix this to check for mappings exactly at KVM_GUEST_COMMPAGE_ADDR (it may not be at address 0 since commit 42aa12e74e91 ("MIPS: KVM: Move commpage so 0x0 is unmapped")), and set the corresponding EntryLo to be interpreted as 0 (invalid). Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12KVM: Protect device ops->create and list_add with kvm->lockChristoffer Dall
KVM devices were manipulating list data structures without any form of synchronization, and some implementations of the create operations also suffered from a lack of synchronization. Now when we've split the xics create operation into create and init, we can hold the kvm->lock mutex while calling the create operation and when manipulating the devices list. The error path in the generic code gets slightly ugly because we have to take the mutex again and delete the device from the list, but holding the mutex during anon_inode_getfd or releasing/locking the mutex in the common non-error path seemed wrong. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12KVM: PPC: Move xics_debugfs_init out of createChristoffer Dall
As we are about to hold the kvm->lock during the create operation on KVM devices, we should move the call to xics_debugfs_init into its own function, since holding a mutex over extended amounts of time might not be a good idea. Introduce an init operation on the kvm_device_ops struct which cannot fail and call this, if configured, after the device has been created. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12KVM: s390: reset KVM_REQ_MMU_RELOAD if mapping the prefix failedJulius Niedworok
When triggering KVM_RUN without a user memory region being mapped (KVM_SET_USER_MEMORY_REGION) a validity intercept occurs. This could happen, if the user memory region was not mapped initially or if it was unmapped after the vcpu is initialized. The function kvm_s390_handle_requests checks for the KVM_REQ_MMU_RELOAD bit. The check function always clears this bit. If gmap_mprotect_notify returns an error code, the mapping failed, but the KVM_REQ_MMU_RELOAD was not set anymore. So the next time kvm_s390_handle_requests is called, the execution would fall trough the check for KVM_REQ_MMU_RELOAD. The bit needs to be resetted, if gmap_mprotect_notify returns an error code. Resetting the bit with kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu) fixes the bug. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Julius Niedworok <jniedwor@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-08-12KVM: s390: set the prefix initially properlyJulius Niedworok
When KVM_RUN is triggered on a VCPU without an initial reset, a validity intercept occurs. Setting the prefix will set the KVM_REQ_MMU_RELOAD bit initially, thus preventing the bug. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Julius Niedworok <jniedwor@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-08-12perf/x86/intel/uncore: Add enable_box for client MSR uncoreKan Liang
There are bug reports about miscounting uncore counters on some client machines like Sandybridge, Broadwell and Skylake. It is very likely to be observed on idle systems. This issue is caused by a hardware issue. PERF_GLOBAL_CTL could be cleared after Package C7, and nothing will be count. The related errata (HSD 158) could be found in: www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf This patch tries to work around this issue by re-enabling PERF_GLOBAL_CTL in ->enable_box(). The workaround does not cover all cases. It helps for new events after returning from C7. But it cannot prevent C7, it will still miscount if a counter is already active. There is no drawback in leaving it enabled, so it does not need disable_box() here. Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1470925874-59943-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-12perf/x86/intel/uncore: Fix uncore num_countersKan Liang
Some uncore boxes' num_counters value for Haswell server and Broadwell server are not correct (too large, off by one). This issue was found by comparing the code with the document. Although there is no bug report from users yet, accessing non-existent counters is dangerous and the behavior is undefined: it may cause miscounting or even crashes. This patch makes them consistent with the uncore document. Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1470925820-59847-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-12uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructionsDenys Vlasenko
Since instruction decoder now supports EVEX-encoded instructions, two fixes are needed to correctly handle them in uprobes. Extended bits for MODRM.rm field need to be sanitized just like we do it for VEX3, to avoid encoding wrong register for register-relative access. EVEX has _two_ extended bits: b and x. Theoretically, EVEX.x should be ignored by the CPU (since GPRs go only up to 15, not 31), but let's be paranoid here: proper encoding for register-relative access should have EVEX.x = 1. Secondly, we should fetch vex.vvvv for EVEX too. This is now super easy because instruction decoder populates vex_prefix.bytes[2] for all flavors of (e)vex encodings, even for VEX2. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Cc: <stable@vger.kernel.org> # v4.1+ Fixes: 8a764a875fe3 ("x86/asm/decoder: Create artificial 3rd byte for 2-byte VEX") Link: http://lkml.kernel.org/r/20160811154521.20469-1-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Arnd Bergmann: "A couple of bug fixes have come in for v4.8 so far. Since the first few were originally meant to go into -rc1 (but didn't get sent in time for travel reasons), the branch is unfortunately based on top of a commit in the middle of the merge window rather than -rc1. Content-wise we have: - a fix for the last remaining broken build in kernelci, getting mach-shmobile to build again with SMP disabled - a fix for a realview regression that broke real hardware but not the qemu model that everyone uses in practice (needed for v4.7 as well) - a merge conflict fix for Tegra that also broke v4.7 - two Kconfig fixes for arm64 build regressions - a couple of arm32 build warning fixes (all harmless) - fix the RTC on Exynos7 Espresso (which apparently never worked right)" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: Merge tag 'pxa-fixes-v4.8' of https://github.com/rjarzmik/linux into randconfig-4.8 arm64: Kconfig: select HISILICON_IRQ_MBIGEN only if PCI is selected arm64: Kconfig: select ALPINE_MSI only if PCI is selected ARM: dts: realview: Fix PBX-A9 cache description ARM: tegra: fix erroneous address in dts ARM: dts: add syscon compatible string for AP syscon ARM: dts: add syscon compatible string for CP syscon ARM: oxnas: select reset controller framework ARM: hide mach-*/ include for ARM_SINGLE_ARMV7M ARM: don't include removed directories Revert "ARM: aspeed: adapt defconfigs for new CONFIG_PRINTK_TIME" ARM: shmobile: don't call platform_can_secondary_boot on UP MAINTAINER: alpine: add a mailing list ARM: do away with final ARCH_REQUIRE_GPIOLIB arm64: dts: Fix RTC by providing rtc_src clock
2016-08-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio/vhost fixes and cleanups from Michael Tsirkin: "Misc fixes and cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio/s390: deprecate old transport virtio/s390: keep early_put_chars virtio_blk: Fix a slient kernel panic virtio-vsock: fix include guard typo vhost/vsock: fix vhost virtio_vsock_pkt use-after-free 9p/trans_virtio: use kvfree() for iov_iter_get_pages_alloc() virtio: fix error handling for debug builds virtio: fix memory leak in virtqueue_add()
2016-08-11arm64: Remove stack duplicating code from jprobesDavid A. Long
Because the arm64 calling standard allows stacked function arguments to be anywhere in the stack frame, do not attempt to duplicate the stack frame for jprobes handler functions. Documentation changes to describe this issue have been broken out into a separate patch in order to simultaneously address them in other architecture(s). Signed-off-by: David A. Long <dave.long@linaro.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-08-11x86/apic/x2apic, smp/hotplug: Don't use before alloc in x2apic_cluster_probe()Sebastian Andrzej Siewior
I made a mistake while converting the driver to the hotplug state machine and as a result x2apic_cluster_probe() was accessing cpus_in_cluster before allocating it. This patch fixes it by setting the cpumask after the allocation the memory succeeded. While at it, I marked two functions static which are only used within this file. Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6b2c28471de5 ("x86/x2apic: Convert to CPU hotplug state machine") Link: http://lkml.kernel.org/r/1470924515-9444-1-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/platform/uv: Skip UV runtime services mapping in the ↵Alex Thorlton
efi_runtime_disabled case This problem has actually been in the UV code for a while, but we didn't catch it until recently, because we had been relying on EFI_OLD_MEMMAP to allow our systems to boot for a period of time. We noticed the issue when trying to kexec a recent community kernel, where we hit this NULL pointer dereference in efi_sync_low_kernel_mappings(): [ 0.337515] BUG: unable to handle kernel NULL pointer dereference at 0000000000000880 [ 0.346276] IP: [<ffffffff8105df8d>] efi_sync_low_kernel_mappings+0x5d/0x1b0 The problem doesn't show up with EFI_OLD_MEMMAP because we skip the chunk of setup_efi_state() that sets the efi_loader_signature for the kexec'd kernel. When the kexec'd kernel boots, it won't set EFI_BOOT in setup_arch, so we completely avoid the bug. We always kexec with noefi on the command line, so this shouldn't be an issue, but since we're not actually checking for efi_runtime_disabled in uv_bios_init(), we end up trying to do EFI runtime callbacks when we shouldn't be. This patch just adds a check for efi_runtime_disabled in uv_bios_init() so that we don't map in uv_systab when runtime_disabled == true. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: <stable@vger.kernel.org> # v4.7 Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Travis <travis@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1470912120-22831-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/efi: Allocate a trampoline if needed in efi_free_boot_services()Andy Lutomirski
On my Dell XPS 13 9350 with firmware 1.4.4 and SGX on, if I boot Fedora 24's grub2-efi off a hard disk, my first 1MB of RAM looks like: efi: mem00: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x0000000000000000-0x0000000000000fff] (0MB) efi: mem01: [Boot Data | | | | | | | | |WB|WT|WC|UC] range=[0x0000000000001000-0x0000000000027fff] (0MB) efi: mem02: [Loader Data | | | | | | | | |WB|WT|WC|UC] range=[0x0000000000028000-0x0000000000029fff] (0MB) efi: mem03: [Reserved | | | | | | | | |WB|WT|WC|UC] range=[0x000000000002a000-0x000000000002bfff] (0MB) efi: mem04: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000000002c000-0x000000000002cfff] (0MB) efi: mem05: [Loader Data | | | | | | | | |WB|WT|WC|UC] range=[0x000000000002d000-0x000000000002dfff] (0MB) efi: mem06: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x000000000002e000-0x0000000000057fff] (0MB) efi: mem07: [Reserved | | | | | | | | |WB|WT|WC|UC] range=[0x0000000000058000-0x0000000000058fff] (0MB) efi: mem08: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x0000000000059000-0x000000000009ffff] (0MB) My EBDA is at 0x2c000, which blocks off everything from 0x2c000 and up, and my trampoline is 0x6000 bytes (6 pages), so it doesn't fit in the loader data range at 0x28000. Without this patch, it panics due to a failure to allocate the trampoline. With this patch, it works: [ +0.001744] Base memory trampoline at [ffff880000001000] 1000 size 24576 Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <mario_limonciello@dell.com> Cc: Matt Fleming <mfleming@suse.de> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/998c77b3bf709f3dfed85cb30701ed1a5d8a438b.1470821230.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/boot: Rework reserve_real_mode() to allow multiple triesAndy Lutomirski
If reserve_real_mode() fails, panicing immediately means we're doomed. Make it safe to try more than once to allocate the trampoline: - Degrade a failure from panic() to pr_info(). (If we make it to setup_real_mode() without reserving the trampoline, we'll panic them.) - Factor out helpers so that platform code can supply a specific address to try. - Warn if reserve_real_mode() is called after we're done with the memblock allocator. If that were to happen, we would behave unpredictably. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <mario_limonciello@dell.com> Cc: Matt Fleming <mfleming@suse.de> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/876e383038f3e9971aa72fd20a4f5da05f9d193d.1470821230.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/boot: Defer setup_real_mode() to early_initcall timeAndy Lutomirski
There's no need to run setup_real_mode() as early as we run it. Defer it to the same early_initcall that sets up the page permissions for the real mode code. This should be a code size reduction. More importantly, it give us a longer window in which we can allocate the real mode trampoline. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <mario_limonciello@dell.com> Cc: Matt Fleming <mfleming@suse.de> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directlyAndy Lutomirski
The initialization process for trampoline_cr4_features and mmu_cr4_features was confusing. The intent is for mmu_cr4_features and *trampoline_cr4_features to stay in sync, but trampoline_cr4_features is NULL until setup_real_mode() runs. The old code synchronized *trampoline_cr4_features *twice*, once in setup_real_mode() and once in setup_arch(). It also initialized mmu_cr4_features in setup_real_mode(), which causes the actual value of mmu_cr4_features to potentially depend on when setup_real_mode() is called. With this patch, mmu_cr4_features is initialized directly in setup_arch(), and *trampoline_cr4_features is synchronized to mmu_cr4_features when the trampoline is set up. After this patch, it should be safe to defer setup_real_mode(). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <mario_limonciello@dell.com> Cc: Matt Fleming <mfleming@suse.de> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/boot: Run reserve_bios_regions() after we initialize the memory mapAndy Lutomirski
reserve_bios_regions() is a quirk that reserves memory that we might otherwise think is available. There's no need to run it so early, and running it before we have the memory map initialized with its non-quirky inputs makes it hard to make reserve_bios_regions() more intelligent. Move it right after we populate the memblock state. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <mario_limonciello@dell.com> Cc: Matt Fleming <mfleming@suse.de> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/irq: Do not substract irq_tlb_count from irq_call_countAaron Lu
Since commit: 52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR") the TLB remote shootdown is done through call function vector. That commit didn't take care of irq_tlb_count, which a later commit: fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts") ... tried to fix. The fix assumes every increase of irq_tlb_count has a corresponding increase of irq_call_count. So the irq_call_count is always bigger than irq_tlb_count and we could substract irq_tlb_count from irq_call_count. Unfortunately this is not true for the smp_call_function_single() case. The IPI is only sent if the target CPU's call_single_queue is empty when adding a csd into it in generic_exec_single. That means if two threads are both adding flush tlb csds to the same CPU's call_single_queue, only one IPI is sent. In other words, the irq_call_count is incremented by 1 but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be bigger than irq_call_count and the substract will produce a very large irq_call_count value due to overflow. Considering that: 1) it's not worth to send more IPIs for the sake of accurate counting of irq_call_count in generic_exec_single(); 2) it's not easy to tell if the call function interrupt is for TLB shootdown in __smp_call_function_single_interrupt(). Not to exclude TLB shootdown from call function count seems to be the simplest fix and this patch just does that. This bug was found by LKP's cyclic performance regression tracking recently with the vm-scalability test suite. I have bisected to commit: 3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem") This commit didn't do anything wrong but revealed the irq_call_count problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file concurrent with multiple threads. When remap_one is try_to_unmap_one(), then multiple threads could queue flush TLB to the same CPU but only one IPI will be sent. Since the commit was added in Linux v3.19, the counting problem only shows up from v3.19 onwards. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/mm: Fix swap entry comment and macroDave Hansen
A recent patch changed the format of a swap PTE. The comment explaining the format of the swap PTE is wrong about the bits used for the swap type field. Amusingly, the ASCII art and the patch description are correct, but the comment itself is wrong. As I was looking at this, I also noticed that the SWP_OFFSET_FIRST_BIT has an off-by-one error. This does not really hurt anything. It just wasted a bit of space in the PTE, giving us 2^59 bytes of addressable space in our swapfiles instead of 2^60. But, it doesn't match with the comments, and it wastes a bit of space, so fix it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Fixes: 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to work around erratum") Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11x86/mm/kaslr: Fix -Wformat-security warningNicolas Iooss
debug_putstr() is used to output strings without using printf-like formatting but debug_putstr(v) is defined as early_printk(v) in arch/x86/lib/kaslr.c. This makes clang reports the following warning when building with -Wformat-security: arch/x86/lib/kaslr.c:57:15: warning: format string is not a string literal (potentially insecure) [-Wformat-security] debug_putstr(purpose); ^~~~~~~ Fix this by using "%s" in early_printk(). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160806102039.27221-1-nicolas.iooss_linux@m4x.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11Merge tag 'pxa-fixes-v4.8' of https://github.com/rjarzmik/linux into ↵Arnd Bergmann
randconfig-4.8 This is the pxa changes for v4.8 cycle. This is a tiny fix couple to enable changes in includes in gpio API without breaking pxa boards. * tag 'pxa-fixes-v4.8' of https://github.com/rjarzmik/linux: ARM: pxa: add module.h for corgi symbol_get/symbol_put usage ARM: pxa: add module.h for spitz symbol_get/symbol_put usage
2016-08-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "8 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock rmap: fix compound check logic in page_remove_file_rmap mm, rmap: fix false positive VM_BUG() in page_add_file_rmap() mm/page_alloc.c: recalculate some of node threshold when on/offline memory mm/page_alloc.c: fix wrong initialization when sysctl_min_unmapped_ratio changes thp: move shmem_huge_enabled() outside of SYSFS ifdef revert "ARM: keystone: dts: add psci command definition" rapidio: dereferencing an error pointer
2016-08-10revert "ARM: keystone: dts: add psci command definition"Andrew Morton
Revert commit 51d5d12b8f3d ("ARM: keystone: dts: add psci command definition"), which was inadvertently added twice. Cc: Russell King - ARM Linux <linux@armlinux.org.uk> Cc: Vitaly Andrianov <vitalya@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10arm64: Kconfig: select HISILICON_IRQ_MBIGEN only if PCI is selectedSudeep Holla
Even when PCI is disabled, ARCH_HISI selects HISILICON_IRQ_MBIGEN triggerring the following config warning: warning: (ARM64 && HISILICON_IRQ_MBIGEN) selects ARM_GIC_V3_ITS which has unmet direct dependencies (PCI && PCI_MSI) This patch makes selection of HISILICON_IRQ_MBIGEN conditional on PCI. Cc: Ma Jun <majun258@huawei.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10arm64: Kconfig: select ALPINE_MSI only if PCI is selectedSudeep Holla
Even when PCI is disabled, ARCH_ALPINE selects ALPINE_MSI triggerring the following config warning: warning: (ARCH_ALPINE) selects ALPINE_MSI which has unmet direct dependencies (PCI) This patch makes selection of ALPINE_MSI conditional on PCI. Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: dts: realview: Fix PBX-A9 cache descriptionRobin Murphy
Clearly QEMU is very permissive in how its PL310 model may be set up, but the real hardware turns out to be far more particular about things actually being correct. Fix up the DT description so that the real thing actually boots: - The arm,data-latency and arm,tag-latency properties need 3 cells to be valid, otherwise we end up retaining the default 8-cycle latencies which leads pretty quickly to lockup. - The arm,dirty-latency property is only relevant to L210/L220, so get rid of it. - The cache geometry override also leads to lockup and/or general misbehaviour. Irritatingly, the manual doesn't state the actual PL310 configuration, but based on the boardfile code and poking registers from the Boot Monitor, it would seem to be 8 sets of 16KB ways. With that, we can successfully boot to enjoy the fun of mismatched FPUs... Cc: stable@vger.kernel.org Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: tegra: fix erroneous address in dtsRalf Ramsauer
c90bb7b enabled the high speed UARTs of the Jetson TK1. Due to a merge quirk, wrong addresses were introduced. Fix it and use the correct addresses. Thierry let me know, that there is another patch (b5896f67ab3c in linux-next) in preparation which removes all the '0,' prefixes of unit addresses on Tegra124 and is planned to go upstream in 4.8, so this patch will get reverted then. But for the moment, this patch is necessary to fix current misbehaviour. Fixes: c90bb7b9b9 ("ARM: tegra: Add high speed UARTs to Jetson TK1 device tree") Signed-off-by: Ralf Ramsauer <ralf@ramses-pyramidenbau.de> Acked-by: Thierry Reding <thierry.reding@gmail.com> Cc: stable@vger.kernel.org # v4.7 Cc: linux-tegra@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: dts: add syscon compatible string for AP sysconLinus Walleij
This syscon needs to be looked up by clocks, flash protection and other consumers. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: dts: add syscon compatible string for CP sysconLinus Walleij
This syscon needs to be looked up by flash protection, CLCD display output settings and other consumers. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: oxnas: select reset controller frameworkArnd Bergmann
For unknown reasons, we have to enable three symbols for a platform to use a reset controller driver, otherwise we get a Kconfig warning: warning: (MACH_OX810SE) selects RESET_OXNAS which has unmet direct dependencies (RESET_CONTROLLER) This selects the other two symbols for oxnas. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Neil Armstrong <narmstrong@baylibre.com>
2016-08-10ARM: hide mach-*/ include for ARM_SINGLE_ARMV7MArnd Bergmann
The machine specific header files are exported for traditional platforms, but not for the ones that use ARCH_MULTIPLATFORM, as they could conflict with one another. In case of ARM_SINGLE_ARMV7M, we end up also exporting them, but that appears to be a mistake, and we should treat it the same way as ARCH_MULTIPLATFORM here. 'make W=1' warns about this because it passes -Wmissing-includes to gcc and the directories are not actually present. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10ARM: don't include removed directoriesArnd Bergmann
Three platforms used to have header files in include/mach that are now all gone, but the removed directories are still being included, which leads to -Wmissing-include-dirs warnings. This removes the extra -I flags. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2016-08-10arm: oabi compat: add missing access checksDave Weinstein
Add access checks to sys_oabi_epoll_wait() and sys_oabi_semtimedop(). This fixes CVE-2016-3857, a local privilege escalation under CONFIG_OABI_COMPAT. Cc: stable@vger.kernel.org Reported-by: Chiachih Wu <wuchiachih@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Dave Weinstein <olorin@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10Merge tag 'metag-for-v4.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag Pull metag architecture fix from James Hogan: "A single fix for a boot crash since a commit in the merge window. Metag was unusual in calling show_mem() early, before setup_per_cpu_pageset(), which is no longer safe. It doesn't add much value to the log, so the fix just drops the call" * tag 'metag-for-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: metag: Drop show_mem() from mem_init()
2016-08-10x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer ↵Dave Hansen
manipulation The Memory Protection Keys "rights register" (PKRU) is XSAVE-managed, and is saved/restored along with the FPU state. When kernel code accesses FPU regsisters, it does a delicate dance with preempt. Otherwise, the context switching code can get confused as to whether the most up-to-date state is in the registers themselves or in the XSAVE buffer. But, PKRU is not a normal FPU register. Using it does not generate the normal device-not-available (#NM) exceptions which means we can not manage it lazily, and the kernel completley disallows using lazy mode when it is enabled. The dance with preempt *only* occurs when managing the FPU lazily. Since we never manage PKRU lazily, we do not have to do the dance with preempt; we can access it directly. Doing it this way saves a ton of complicated code (and is faster too). Further, the XSAVES reenabling failed to patch a bit of code in fpu__xfeature_set_state() the checked for compacted buffers. That check caused fpu__xfeature_set_state() to silently refuse to work when the kernel is using compacted XSAVE buffers. This broke execute-only and future pkey_mprotect() support when using compact XSAVE buffers. But, removing fpu__xfeature_set_state() gets rid of this issue, in addition to the nice cleanup and speedup. This fixes the same thing as a fix that Sai posted: https://lkml.org/lkml/2016/7/25/637 The fix that he posted is a much more obviously correct, but I think we should just do this instead. Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/build: Reduce the W=1 warnings noise when compiling x86 syscall tablesValdis Kletnieks
Building an X86_64 kernel with W=1 throws a total of 9,948 lines of warnings of this form for both 32-bit and 64-bit syscall tables. Given that the entire rest of the build for my config only generates 8,375 lines of output, this is a big reduction in the warnings generated. The warnings follow this pattern: ./arch/x86/include/generated/asm/syscalls_32.h:885:21: warning: initialized field overwritten [-Woverride-init] __SYSCALL_I386(379, compat_sys_pwritev2, ) ^ arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386' #define __SYSCALL_I386(nr, sym, qual) [nr] = sym, ^~~ ./arch/x86/include/generated/asm/syscalls_32.h:885:21: note: (near initialization for 'ia32_sys_call_table[379]') __SYSCALL_I386(379, compat_sys_pwritev2, ) ^ arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386' #define __SYSCALL_I386(nr, sym, qual) [nr] = sym, Since we intentionally build the syscall tables this way, ignore that one warning in the two files. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/7464.1470021890@turing-police.cc.vt.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix kernel panic running RHEL kdump kernel on UV systemsMike Travis
The latest UV kernel support panics when RHEL7 kexec's the kdump kernel to make a dumpfile. This patch fixes the problem by turning off all UV support if NUMA is off. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.577755634@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix problem with UV4 BIOS providing incorrect PXM valuesMike Travis
There are some circumstances where the UV4 BIOS cannot provide the correct Proximity Node values to associate with specific Sockets and Physical Nodes. The decision was made to remove these values from BIOS and for the kernel to get these values from the standard ACPI tables. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix bug with iounmap() of the UV4 EFI System Table causing ↵Mike Travis
a crash Save the uv_systab::size field before doing the iounmap() of the struct pointer, to avoid a NULL dereference crash. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.250424783@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix problem with UV4 Socket IDs not being contiguousMike Travis
The UV4 Socket IDs are not guaranteed to equate to Node values which can cause the GAM (Global Addressable Memory) table lookups to fail. Fix this by using an independent index into the GAM table instead of the Socket ID to reference the base address. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/entry: Clarify the RF saving/restoring situation with SYSCALL/SYSRETBorislav Petkov
Clarify why exactly RF cannot be restored properly by SYSRET to avoid confusion. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160803171429.GA2590@nazgul.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/mm: Disable preemption during CR3 read+writeSebastian Andrzej Siewior
There's a subtle preemption race on UP kernels: Usually current->mm (and therefore mm->pgd) stays the same during the lifetime of a task so it does not matter if a task gets preempted during the read and write of the CR3. But then, there is this scenario on x86-UP: TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by: -> mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb() At this point current->mm is NULL but current->active_mm still points to the "old" mm. Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its own mm so CR3 has changed. Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's mm and so CR3 remains unchanged. Once taskA gets active it continues where it was interrupted and that means it writes its old CR3 value back. Everything is fine because userland won't need its memory anymore. Now the fun part: Let's preempt taskA one more time and get back to taskB. This time switch_mm() won't do a thing because oldmm (->active_mm) is the same as mm (as per context_switch()). So we remain with a bad CR3 / PGD and return to userland. The next thing that happens is handle_mm_fault() with an address for the execution of its code in userland. handle_mm_fault() realizes that it has a PTE with proper rights so it returns doing nothing. But the CPU looks at the wrong PGD and insists that something is wrong and faults again. And again. And one more time… This pagefault circle continues until the scheduler gets tired of it and puts another task on the CPU. It gets little difficult if the task is a RT task with a high priority. The system will either freeze or it gets fixed by the software watchdog thread which usually runs at RT-max prio. But waiting for the watchdog will increase the latency of the RT task which is no good. Fix this by disabling preemption across the critical code section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>