summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2017-10-18KVM: nVMX: update last_nonleaf_level when initializing nested EPTLadi Prosek
commit fd19d3b45164466a4adce7cbff448ba9189e1427 upstream. The function updates context->root_level but didn't call update_last_nonleaf_level so the previous and potentially wrong value was used for page walks. For example, a zero value of last_nonleaf_level would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's walk_addr_generic function (CVE-2017-12188). Fixes: 155a97a3d7c78b46cef6f1a973c831bc5a4f82bb Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exitHaozhong Zhang
commit 8eb3f87d903168bdbd1222776a6b1e281f50513e upstream. When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the guest CR4. Before this CR4 loading, the guest CR4 refers to L2 CR4. Because these two CR4's are in different levels of guest, we should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which is used to handle guest writes to its CR4, checks the guest change to CR4 and may fail if the change is invalid. The failure may cause trouble. Consider we start a L1 guest with non-zero L1 PCID in use, (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0) and a L2 guest with L2 PCID disabled, (i.e. L2 CR4.PCIDE == 0) and following events may happen: 1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4 into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e. vcpu->arch.cr4) is left to the value of L2 CR4. 2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit, kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID, because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1 CR3.PCID != 0, L0 KVM will inject GP to L1 guest. Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1") Cc: qemu-stable@nongnu.org Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18KVM: MMU: always terminate page walks at level 1Ladi Prosek
commit 829ee279aed43faa5cb1e4d65c0cad52f2426c53 upstream. is_last_gpte() is not equivalent to the pseudo-code given in commit 6bb69c9b69c31 ("KVM: MMU: simplify last_pte_bitmap") because an incorrect value of last_nonleaf_level may override the result even if level == 1. It is critical for is_last_gpte() to return true on level == 1 to terminate page walks. Otherwise memory corruption may occur as level is used as an index to various data structures throughout the page walking code. Even though the actual bug would be wherever the MMU is initialized (as in the previous patch), be defensive and ensure here that is_last_gpte() returns the correct value. This patch is also enough to fix CVE-2017-12188. Fixes: 6bb69c9b69c315200ddc2bc79aee14c0184cf5b2 Cc: Andy Honig <ahonig@google.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> [Panic if walk_addr_generic gets an incorrect level; this is a serious bug and it's not worth a WARN_ON where the recovery path might hide further exploitable issues; suggested by Andrew Honig. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12KVM: x86: fix singlestepping over syscallPaolo Bonzini
commit c8401dda2f0a00cd25c0af6a95ed50e478d25de4 upstream. TF is handled a bit differently for syscall and sysret, compared to the other instructions: TF is checked after the instruction completes, so that the OS can disable #DB at a syscall by adding TF to FMASK. When the sysret is executed the #DB is taken "as if" the syscall insn just completed. KVM emulates syscall so that it can trap 32-bit syscall on Intel processors. Fix the behavior, otherwise you could get #DB on a user stack which is not nice. This does not affect Linux guests, as they use an IST or task gate for #DB. This fixes CVE-2017-7518. Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [bwh: Backported to 4.9: - kvm_vcpu_check_singlestep() sets some flags differently - Drop changes to kvm_skip_emulated_instruction()] Cc: Ben Hutchings <benh@debian.org> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: use cmpxchg64Paolo Bonzini
commit c0a1666bcb2a33e84187a15eabdcd54056be9a97 upstream. This fixes a compilation failure on 32-bit systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interruptHaozhong Zhang
commit 5753743fa5108b8f98bd61e40dc63f641b26c768 upstream. WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc)) in kvm_vcpu_trigger_posted_interrupt() intends to detect the violation of invariant that VT-d PI notification event is not suppressed when vcpu is in the guest mode. Because the two checks for the target vcpu mode and the target suppress field cannot be performed atomically, the target vcpu mode may change in between. If that does happen, WARN_ON_ONCE() here may raise false alarms. As the previous patch fixed the real invariant breaker, remove this WARN_ON_ONCE() to avoid false alarms, and document the allowed cases instead. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: do not change SN bit in vmx_update_pi_irte()Haozhong Zhang
commit dc91f2eb1a4021eb6705c15e474942f84ab9b211 upstream. In kvm_vcpu_trigger_posted_interrupt() and pi_pre_block(), KVM assumes that PI notification events should not be suppressed when the target vCPU is not blocked. vmx_update_pi_irte() sets the SN field before changing an interrupt from posting to remapping, but it does not check the vCPU mode. Therefore, the change of SN field may break above the assumption. Besides, I don't see reasons to suppress notification events here, so remove the changes of SN field to avoid race condition. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05kvm: nVMX: Don't allow L2 to access the hardware CR8Jim Mattson
commit 51aa68e7d57e3217192d88ce90fd5b8ef29ec94f upstream. If L1 does not specify the "use TPR shadow" VM-execution control in vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store exiting" VM-execution controls in vmcs02. Failure to do so will give the L2 VM unrestricted read/write access to the hardware CR8. This fixes CVE-2017-12154. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: Do not BUG() on out-of-bounds guest IRQJan H. Schönherr
commit 3a8b0677fc6180a467e26cc32ce6b0c09a32f9bb upstream. The value of the guest_irq argument to vmx_update_pi_irte() is ultimately coming from a KVM_IRQFD API call. Do not BUG() in vmx_update_pi_irte() if the value is out-of bounds. (Especially, since KVM as a whole seems to hang after that.) Instead, print a message only once if we find that we don't have a route for a certain IRQ (which can be out-of-bounds or within the array). This fixes CVE-2017-1000252. Fixes: efc644048ecde54 ("KVM: x86: Update IRTE for posted-interrupts") Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: simplify and fix vmx_vcpu_pi_loadPaolo Bonzini
commit 31afb2ea2b10a7d17ce3db4cdb0a12b63b2fe08a upstream. The simplify part: do not touch pi_desc.nv, we can set it when the VCPU is first created. Likewise, pi_desc.sn is only handled by vmx_vcpu_pi_load, do not touch it in __pi_post_block. The fix part: do not check kvm_arch_has_assigned_device, instead check the SN bit to figure out whether vmx_vcpu_pi_put ran before. This matches what the previous patch did in pi_post_block. Cc: Huangweidong <weidong.huang@huawei.com> Cc: Gonglei <arei.gonglei@huawei.com> Cc: wangxin <wangxinxin.wang@huawei.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Longpeng (Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: avoid double list add with VT-d posted interruptsPaolo Bonzini
commit 8b306e2f3c41939ea528e6174c88cfbfff893ce1 upstream. In some cases, for example involving hot-unplug of assigned devices, pi_post_block can forget to remove the vCPU from the blocked_vcpu_list. When this happens, the next call to pi_pre_block corrupts the list. Fix this in two ways. First, check vcpu->pre_pcpu in pi_pre_block and WARN instead of adding the element twice in the list. Second, always do the list removal in pi_post_block if vcpu->pre_pcpu is set (not -1). The new code keeps interrupts disabled for the whole duration of pi_pre_block/pi_post_block. This is not strictly necessary, but easier to follow. For the same reason, PI.ON is checked only after the cmpxchg, and to handle it we just call the post-block code. This removes duplication of the list removal code. Cc: Huangweidong <weidong.huang@huawei.com> Cc: Gonglei <arei.gonglei@huawei.com> Cc: wangxin <wangxinxin.wang@huawei.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Longpeng (Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: extract __pi_post_blockPaolo Bonzini
commit cd39e1176d320157831ce030b4c869bd2d5eb142 upstream. Simple code movement patch, preparing for the next one. Cc: Huangweidong <weidong.huang@huawei.com> Cc: Gonglei <arei.gonglei@huawei.com> Cc: wangxin <wangxinxin.wang@huawei.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Longpeng (Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30KVM: x86: block guest protection keys unless the host has them enabledPaolo Bonzini
commit c469268cd523245cc58255f6696e0c295485cb0b upstream. If the host has protection keys disabled, we cannot read and write the guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0. Block the PKU cpuid bit in that case. This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1. Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21kvm: vmx: allow host to access guest MSR_IA32_BNDCFGSHaozhong Zhang
commit 691bd4340bef49cf7e5855d06cf24444b5bf2d85 upstream. It's easier for host applications, such as QEMU, if they can always access guest MSR_IA32_BNDCFGS in VMCS, even though MPX is disabled in guest cpuid. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21kvm: vmx: Check value written to IA32_BNDCFGSJim Mattson
commit 4531662d1abf6c1f0e5c2b86ddb60e61509786c8 upstream. Bits 11:2 must be zero and the linear addess in bits 63:12 must be canonical. Otherwise, WRMSR(BNDCFGS) should raise #GP. Fixes: 0dd376e709975779 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save") Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21kvm: x86: Guest BNDCFGS requires guest MPX supportJim Mattson
commit 4439af9f911ae0243ffe4e2dfc12bace49605d8b upstream. The BNDCFGS MSR should only be exposed to the guest if the guest supports MPX. (cf. the TSC_AUX MSR and RDTSCP.) Fixes: 0dd376e709975779 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save") Change-Id: I3ad7c01bda616715137ceac878f3fa7e66b6b387 Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21kvm: vmx: Do not disable intercepts for BNDCFGSJim Mattson
commit a8b6fda38f80e75afa3b125c9e7f2550b579454b upstream. The MSR permission bitmaps are shared by all VMs. However, some VMs may not be configured to support MPX, even when the host does. If the host supports VMX and the guest does not, we should intercept accesses to the BNDCFGS MSR, so that we can synthesize a #GP fault. Furthermore, if the host does not support MPX and the "ignore_msrs" kvm kernel parameter is set, then we should intercept accesses to the BNDCFGS MSR, so that we can skip over the rdmsr/wrmsr without raising a #GP fault. Fixes: da8999d31818fdc8 ("KVM: x86: Intel MPX vmx and msr handle") Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05KVM: nVMX: Fix exception injectionWanpeng Li
commit d4912215d1031e4fb3d1038d2e1857218dba0d0a upstream. WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel] CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G OE 4.12.0-rc3+ #23 RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel] Call Trace: ? kvm_check_async_pf_completion+0xef/0x120 [kvm] ? rcu_read_lock_sched_held+0x79/0x80 vmx_queue_exception+0x104/0x160 [kvm_intel] ? vmx_queue_exception+0x104/0x160 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm] ? kvm_arch_vcpu_load+0x47/0x240 [kvm] ? kvm_arch_vcpu_load+0x62/0x240 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 This is triggered occasionally by running both win7 and win2016 in L2, in addition, EPT is disabled on both L1 and L2. It can't be reproduced easily. Commit 0b6ac343fc (KVM: nVMX: Correct handling of exception injection) mentioned that "KVM wants to inject page-faults which it got to the guest. This function assumes it is called with the exit reason in vmcs02 being a #PF exception". Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to L2) allows to check all exceptions for intercept during delivery to L2. However, there is no guarantee the exit reason is exception currently, when there is an external interrupt occurred on host, maybe a time interrupt for host which should not be injected to guest, and somewhere queues an exception, then the function nested_vmx_check_exception() will be called and the vmexit emulation codes will try to emulate the "Acknowledge interrupt on exit" behavior, the warning is triggered. Reusing the exit reason from the L2->L0 vmexit is wrong in this case, the reason must always be EXCEPTION_NMI when injecting an exception into L1 as a nested vmexit. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Fixes: e011c663b9c7 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05KVM: x86: zero base3 of unusable segmentsRadim Krčmář
commit f0367ee1d64d27fa08be2407df5c125442e885e3 upstream. Static checker noticed that base3 could be used uninitialized if the segment was not present (useable). Random stack values probably would not pass VMCS entry checks. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 1aa366163b8b ("KVM: x86 emulator: consolidate segment accessors") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()Radim Krčmář
commit 34b0dadbdf698f9b277a31b2747b625b9a75ea1f upstream. Static analysis noticed that pmu->nr_arch_gp_counters can be 32 (INTEL_PMC_MAX_GENERIC) and therefore cannot be used to shift 'int'. I didn't add BUILD_BUG_ON for it as we have a better checker. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 25462f7f5295 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05KVM: x86: fix emulation of RSM and IRET instructionsLadi Prosek
commit 6ed071f051e12cf7baa1b69d3becb8f232fdfb7b upstream. On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm on hflags is reverted later on in x86_emulate_instruction where hflags are overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu. Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after an instruction is emulated, this commit deletes emul_flags altogether and makes the emulator access vcpu->arch.hflags using two new accessors. This way all changes, on the emulator side as well as in functions called from the emulator and accessing vcpu state with emul_to_vcpu, are preserved. More details on the bug and its manifestation with Windows and OVMF: It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD. I believe that the SMM part explains why we started seeing this only with OVMF. KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because later on in x86_emulate_instruction we overwrite arch.hflags with ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call. The AMD-specific hflag of interest here is HF_NMI_MASK. When rebooting the system, Windows sends an NMI IPI to all but the current cpu to shut them down. Only after all of them are parked in HLT will the initiating cpu finish the restart. If NMI is masked, other cpus never get the memo and the initiating cpu spins forever, waiting for hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe. Fixes: a584539b24b8 ("KVM: x86: pass the whole hflags field to emulator and back") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05KVM: x86: fix fixing of hypercallsDmitry Vyukov
[ Upstream commit ce2e852ecc9a42e4b8dabb46025cfef63209234a ] emulator_fix_hypercall() replaces hypercall with vmcall instruction, but it does not handle GP exception properly when writes the new instruction. It can return X86EMUL_PROPAGATE_FAULT without setting exception information. This leads to incorrect emulation and triggers WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn() as discovered by syzkaller fuzzer: WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558 Call Trace: warn_slowpath_null+0x2c/0x40 kernel/panic.c:582 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline] handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline] vcpu_run arch/x86/kvm/x86.c:6947 [inline] Set exception information when write in emulator_fix_hypercall() fails. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: kvm@vger.kernel.org Cc: syzkaller@googlegroups.com Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14KVM: async_pf: avoid async pf injection when in guest modeWanpeng Li
commit 9bc1f09f6fa76fdf31eb7d6a4a4df43574725f93 upstream. INFO: task gnome-terminal-:1734 blocked for more than 120 seconds. Not tainted 4.12.0-rc4+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gnome-terminal- D 0 1734 1015 0x00000000 Call Trace: __schedule+0x3cd/0xb30 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? __vfs_read+0x37/0x150 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 This is triggered by running both win7 and win2016 on L1 KVM simultaneously, and then gives stress to memory on L1, I can observed this hang on L1 when at least ~70% swap area is occupied on L0. This is due to async pf was injected to L2 which should be injected to L1, L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host actually), and L1 guest starts accumulating tasks stuck in D state in kvm_async_pf_task_wait() since missing PAGE_READY async_pfs. This patch fixes the hang by doing async pf when executing L1 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulationWanpeng Li
commit a3641631d14571242eec0d30c9faa786cbf52d44 upstream. If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it potentially can be exploited the vulnerability. this will out-of-bounds read and write. Luckily, the effect is small: /* when no next entry is found, the current entry[i] is reselected */ for (j = i + 1; ; j = (j + 1) % nent) { struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j]; if (ej->function == e->function) { It reads ej->maxphyaddr, which is user controlled. However... ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; After cpuid_entries there is int maxphyaddr; struct x86_emulate_ctxt emulate_ctxt; /* 16-byte aligned */ So we have: - cpuid_entries at offset 1B50 (6992) - maxphyaddr at offset 27D0 (6992 + 3200 = 10192) - padding at 27D4...27DF - emulate_ctxt at 27E0 And it writes in the padding. Pfew, writing the ops field of emulate_ctxt would have been much worse. This patch fixes it by modding the index to avoid the out-of-bounds access. Worst case, i == j and ej->function == e->function, the loop can bail out. Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Guofang Mo <moguofang@huawei.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulationWanpeng Li
commit cbfc6c9184ce71b52df4b1d82af5afc81a709178 upstream. Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation. - "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only, a read should be ignored) in guest can get a random number. - "rep insb" instruction to access PIT register port 0x43 can control memcpy() in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes, which will disclose the unimportant kernel memory in host but no crash. The similar test program below can reproduce the read out-of-bounds vulnerability: void hexdump(void *mem, unsigned int len) { unsigned int i, j; for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++) { /* print offset */ if(i % HEXDUMP_COLS == 0) { printf("0x%06x: ", i); } /* print hex data */ if(i < len) { printf("%02x ", 0xFF & ((char*)mem)[i]); } else /* end of block, just aligning for ASCII dump */ { printf(" "); } /* print ASCII dump */ if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1)) { for(j = i - (HEXDUMP_COLS - 1); j <= i; j++) { if(j >= len) /* end of block, not really printing */ { putchar(' '); } else if(isprint(((char*)mem)[j])) /* printable char */ { putchar(0xFF & ((char*)mem)[j]); } else /* other char */ { putchar('.'); } } putchar('\n'); } } } int main(void) { int i; if (iopl(3)) { err(1, "set iopl unsuccessfully\n"); return -1; } static char buf[0x40]; /* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x40, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x41, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x42, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x44, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x45, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x40 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x40, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x43 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); return 0; } The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation w/o clear after using which results in some random datas are left over in the buffer. Guest reads port 0x43 will be ignored since it is write only, however, the function kernel_pio() can't distigush this ignore from successfully reads data from device's ioport. There is no new data fill the buffer from port 0x43, however, emulator_pio_in_emulated() will copy the stale data in the buffer to the guest unconditionally. This patch fixes it by clearing the buffer before in instruction emulation to avoid to grant guest the stale data in the buffer. In addition, string I/O is not supported for in kernel device. So there is no iteration to read ioport %RCX times for string I/O. The function kernel_pio() just reads one round, and then copy the io size * %RCX to the guest unconditionally, actually it copies the one round ioport data w/ other random datas which are left over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by introducing the string I/O support for in kernel device in order to grant the right ioport datas to the guest. Before the patch: 0x000000: fe 38 93 93 ff ff ab ab .8...... 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ After the patch: 0x000000: 1e 02 f8 00 ff ff ab ab ........ 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: d2 e2 d2 df d2 db d2 d7 ........ 0x000008: d2 d3 d2 cf d2 cb d2 c7 ........ 0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........ 0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: 00 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 00 00 00 00 ........ 0x000018: 00 00 00 00 00 00 00 00 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Moguofang <moguofang@huawei.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: x86: Fix potential preemption when get the current kvmclock timestampWanpeng Li
commit e2c2206a18993bc9f62393d49c7b2066c3845b25 upstream. BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/2809 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 2 PID: 2809 Comm: qemu-system-x86 Not tainted 4.11.0+ #13 Call Trace: dump_stack+0x99/0xce check_preemption_disabled+0xf5/0x100 __this_cpu_preempt_check+0x13/0x20 get_kvmclock_ns+0x6f/0x110 [kvm] get_time_ref_counter+0x5d/0x80 [kvm] kvm_hv_process_stimers+0x2a1/0x8a0 [kvm] ? kvm_hv_process_stimers+0x2a1/0x8a0 [kvm] ? kvm_arch_vcpu_ioctl_run+0xac9/0x1ce0 [kvm] kvm_arch_vcpu_ioctl_run+0x5bf/0x1ce0 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 RIP: 0033:0x7f9d164ed357 ? __this_cpu_preempt_check+0x13/0x20 This can be reproduced by run kvm-unit-tests/hyperv_stimer.flat w/ CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT enabled. Safe access to per-CPU data requires a couple of constraints, though: the thread working with the data cannot be preempted and it cannot be migrated while it manipulates per-CPU variables. If the thread is preempted, the thread that replaces it could try to work with the same variables; migration to another CPU could also cause confusion. However there is no preemption disable when reads host per-CPU tsc rate to calculate the current kvmclock timestamp. This patch fixes it by utilizing get_cpu/put_cpu pair to guarantee both __this_cpu_read() and rdtsc() are not preempted. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: x86: Fix load damaged SSEx MXCSR registerWanpeng Li
commit a575813bfe4bc15aba511a5e91e61d242bff8b9d upstream. Reported by syzkaller: BUG: unable to handle kernel paging request at ffffffffc07f6a2e IP: report_bug+0x94/0x120 PGD 348e12067 P4D 348e12067 PUD 348e14067 PMD 3cbd84067 PTE 80000003f7e87161 Oops: 0003 [#1] SMP CPU: 2 PID: 7091 Comm: kvm_load_guest_ Tainted: G OE 4.11.0+ #8 task: ffff92fdfb525400 task.stack: ffffbda6c3d04000 RIP: 0010:report_bug+0x94/0x120 RSP: 0018:ffffbda6c3d07b20 EFLAGS: 00010202 do_trap+0x156/0x170 do_error_trap+0xa3/0x170 ? kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm] ? mark_held_locks+0x79/0xa0 ? retint_kernel+0x10/0x10 ? trace_hardirqs_off_thunk+0x1a/0x1c do_invalid_op+0x20/0x30 invalid_op+0x1e/0x30 RIP: 0010:kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm] ? kvm_load_guest_fpu.part.175+0x1c/0x170 [kvm] kvm_arch_vcpu_ioctl_run+0xed6/0x1b70 [kvm] kvm_vcpu_ioctl+0x384/0x780 [kvm] ? kvm_vcpu_ioctl+0x384/0x780 [kvm] ? sched_clock+0x13/0x20 ? __do_page_fault+0x2a0/0x550 do_vfs_ioctl+0xa4/0x700 ? up_read+0x1f/0x40 ? __do_page_fault+0x2a0/0x550 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 SDM mentioned that "The MXCSR has several reserved bits, and attempting to write a 1 to any of these bits will cause a general-protection exception(#GP) to be generated". The syzkaller forks' testcase overrides xsave area w/ random values and steps on the reserved bits of MXCSR register. The damaged MXCSR register values of guest will be restored to SSEx MXCSR register before vmentry. This patch fixes it by catching userspace override MXCSR register reserved bits w/ random values and bails out immediately. Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20KVM: x86: fix user triggerable warning in kvm_apic_accept_events()David Hildenbrand
commit 28bf28887976d8881a3a59491896c718fade7355 upstream. If we already entered/are about to enter SMM, don't allow switching to INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events() will report a warning. Same applies if we are already in MP state INIT_RECEIVED and SMM is requested to be turned on. Refuse to set the VCPU events in this case. Fixes: cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14KVM: nVMX: do not leak PML full vmexit to L1Ladi Prosek
commit ab007cc94ff9d82f5a8db8363b3becbd946e58cf upstream. The PML feature is not exposed to guests so we should not be forwarding the vmexit either. This commit fixes BSOD 0x20001 (HYPERVISOR_ERROR) when running Hyper-V enabled Windows Server 2016 in L1 on hardware that supports PML. Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14KVM: nVMX: initialize PML fields in vmcs02Ladi Prosek
commit 1fb883bb827ee8efc1cc9ea0154f953f8a219d38 upstream. L2 was running with uninitialized PML fields which led to incomplete dirty bitmap logging. This manifested as all kinds of subtle erratic behavior of the nested guest. Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14Revert "KVM: nested VMX: disable perf cpuid reporting"Jim Mattson
commit 0b4c208d443ba2af82b4c70f99ca8df31e9a0020 upstream. This reverts commit bc6134942dbbf31c25e9bd7c876be5da81c9e1ce. A CPUID instruction executed in VMX non-root mode always causes a VM-exit, regardless of the leaf being queried. Fixes: bc6134942dbb ("KVM: nested VMX: disable perf cpuid reporting") Signed-off-by: Jim Mattson <jmattson@google.com> [The issue solved by bc6134942dbb has been resolved with ff651cb613b4 ("KVM: nVMX: Add nested msr load/restore algorithm").] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21kvm: fix page struct leak in handle_vmonPaolo Bonzini
commit 06ce521af9558814b8606c0476c54497cf83a653 upstream. handle_vmon gets a reference on VMXON region page, but does not release it. Release the reference. Found by syzkaller; based on a patch by Dmitry. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 3.16: use skip_emulated_instruction()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31KVM: x86: cleanup the page tracking SRCU instancePaolo Bonzini
commit 2beb6dad2e8f95d710159d5befb390e4f62ab5cf upstream. SRCU uses a delayed work item. Skip cleaning it up, and the result is use-after-free in the work item callbacks. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 0eb05bf290cfe8610d9680b49abef37febd1c38a Reviewed-by: Xiao Guangrong <xiaoguangrong.eric@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-15KVM: VMX: use correct vmcs_read/write for guest segment selector/baseChao Peng
commit 96794e4ed4d758272c486e1529e431efb7045265 upstream. Guest segment selector is 16 bit field and guest segment base is natural width field. Fix two incorrect invocations accordingly. Without this patch, build fails when aggressive inlining is used with ICC. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09KVM: x86: do not save guest-unsupported XSAVE stateRadim Krčmář
commit 00c87e9a70a17b355b81c36adedf05e84f54e10d upstream. Saving unsupported state prevents migration when the new host does not support a XSAVE feature of the original host, even if the feature is not exposed to the guest. We've masked host features with guest-visible features before, with 4344ee981e21 ("KVM: x86: only copy XSAVE state for the supported features") and dropped it when implementing XSAVES. Do it again. Fixes: df1daba7d1cb ("KVM: x86: support XSAVES usage in the host") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: Introduce segmented_write_stdSteve Rutherford
commit 129a72a0d3c8e139a04512325384fe5ac119e74d upstream. Introduces segemented_write_std. Switches from emulated reads/writes to standard read/writes in fxsave, fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding kernel memory leak. Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR", 2016-11-09), which is luckily not yet in any final release, this would also be an exploitable kernel memory *write*! Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 96051572c819194c37a8367624b285be10297eca Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: emulate FXSAVE and FXRSTORRadim Krčmář
commit 283c95d0e3891b64087706b344a4b545d04a6e62 upstream. Internal errors were reported on 16 bit fxsave and fxrstor with ipxe. Old Intels don't have unrestricted_guest, so we have to emulate them. The patch takes advantage of the hardware implementation. AMD and Intel differ in saving and restoring other fields in first 32 bytes. A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee, and executed fxsave: Intel (Nehalem): 7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00 ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00 Intel (Haswell -- deprecated FPU CS and FPU DS): 7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00 AMD (Opteron 2300-series): 7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00 fxsave/fxrstor will only be emulated on early Intels, so KVM can't do much to improve the situation. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: add asm_safe wrapperRadim Krčmář
commit aabba3c6abd50b05b1fc2c6ec44244aa6bcda576 upstream. Move the existing exception handling for inline assembly into a macro and switch its return values to X86EMUL type. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: add Align16 instruction flagRadim Krčmář
commit d3fe959f81024072068e9ed86b39c2acfd7462a9 upstream. Needed for FXSAVE and FXRSTOR. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: fix NULL deref in vcpu_scan_ioapicWanpeng Li
commit 546d87e5c903a7f3ee7b9f998949a94729fbc65b upstream. Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0 IP: _raw_spin_lock+0xc/0x30 PGD 3e28eb067 PUD 3f0ac6067 PMD 0 Oops: 0002 [#1] SMP CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3 Call Trace: ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm] kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm] ? pick_next_task_fair+0xe1/0x4e0 ? kvm_arch_vcpu_load+0xea/0x260 [kvm] kvm_vcpu_ioctl+0x33a/0x600 [kvm] ? hrtimer_try_to_cancel+0x29/0x130 ? do_nanosleep+0x97/0xf0 do_vfs_ioctl+0xa1/0x5d0 ? __hrtimer_init+0x90/0x90 ? do_nanosleep+0x5b/0xf0 SyS_ioctl+0x79/0x90 do_syscall_64+0x6e/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0 The syzkaller folks reported a NULL pointer dereference due to ENABLE_CAP succeeding even without an irqchip. The Hyper-V synthetic interrupt controller is activated, resulting in a wrong request to rescan the ioapic and a NULL pointer dereference. #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/types.h> #include <linux/kvm.h> #include <pthread.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #ifndef KVM_CAP_HYPERV_SYNIC #define KVM_CAP_HYPERV_SYNIC 123 #endif void* thr(void* arg) { struct kvm_enable_cap cap; cap.flags = 0; cap.cap = KVM_CAP_HYPERV_SYNIC; ioctl((long)arg, KVM_ENABLE_CAP, &cap); return 0; } int main() { void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); int kvmfd = open("/dev/kvm", 0); int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0); struct kvm_userspace_memory_region memreg; memreg.slot = 0; memreg.flags = 0; memreg.guest_phys_addr = 0; memreg.memory_size = 0x1000; memreg.userspace_addr = (unsigned long)host_mem; host_mem[0] = 0xf4; ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg); int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0); struct kvm_sregs sregs; ioctl(cpufd, KVM_GET_SREGS, &sregs); sregs.cr0 = 0; sregs.cr4 = 0; sregs.efer = 0; sregs.cs.selector = 0; sregs.cs.base = 0; ioctl(cpufd, KVM_SET_SREGS, &sregs); struct kvm_regs regs = { .rflags = 2 }; ioctl(cpufd, KVM_SET_REGS, &regs); ioctl(vmfd, KVM_CREATE_IRQCHIP, 0); pthread_t th; pthread_create(&th, 0, thr, (void*)(long)cpufd); usleep(rand() % 10000); ioctl(cpufd, KVM_RUN, 0); pthread_join(th, 0); return 0; } This patch fixes it by failing ENABLE_CAP if without an irqchip. Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: flush pending lapic jump label updates on module unloadDavid Matlack
commit cef84c302fe051744b983a92764d3fcca933415d upstream. KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled). These are implemented with delayed_work structs which can still be pending when the KVM module is unloaded. We've seen this cause kernel panics when the kvm_intel module is quickly reloaded. Use the new static_key_deferred_flush() API to flush pending updates on module unload. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19KVM: x86: fix emulation of "MOV SS, null selector"Paolo Bonzini
commit 33ab91103b3415e12457e3104f0e4517ce12d0f3 upstream. This is CVE-2017-2583. On Intel this causes a failed vmentry because SS's type is neither 3 nor 7 (even though the manual says this check is only done for usable SS, and the dmesg splat says that SS is unusable!). On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb. The fix fabricates a data segment descriptor when SS is set to a null selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb. Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3; this in turn ensures CPL < 3 because RPL must be equal to CPL. Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing the bug and deciphering the manuals. Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com> Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12KVM: x86: reset MMU on KVM_SET_VCPU_EVENTSXiao Guangrong
commit 6ef4e07ecd2db21025c446327ecf34414366498b upstream. Otherwise, mismatch between the smm bit in hflags and the MMU role can cause a NULL pointer dereference. Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)Jim Mattson
commit ef85b67385436ddc1998f45f1d6a210f935b3388 upstream. When L2 exits to L0 due to "exception or NMI", software exceptions (#BP and #OF) for which L1 has requested an intercept should be handled by L1 rather than L0. Previously, only hardware exceptions were forwarded to L1. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-24KVM: x86: check for pic and ioapic presence before useRadim Krčmář
Split irqchip allows pic and ioapic routes to be used without them being created, which results in NULL access. Check for NULL and avoid it. (The setup is too racy for a nicer solutions.) Found by syzkaller: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: events irqfd_inject task: ffff88006a06c7c0 task.stack: ffff880068638000 RIP: 0010:[...] [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221 RSP: 0000:ffff88006863ea20 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000 RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000 R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0 R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0 Stack: ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000 ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082 0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000 Call Trace: [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746 [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144 [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151 [...] spin_lock include/linux/spinlock.h:302 [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379 [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52 [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101 [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60 [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096 [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230 [...] kthread+0x328/0x3e0 kernel/kthread.c:209 [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433 Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: 49df6397edfc ("KVM: x86: Split the APIC from the rest of IRQCHIP.") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24KVM: x86: fix out-of-bounds accesses of rtc_eoi mapRadim Krčmář
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be bigger that the maximal number of VCPUs, resulting in out-of-bounds access. Found by syzkaller: BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...] Write of size 1 by task a.out/27101 CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [...] Call Trace: [...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905 [...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495 [...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86 [...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360 [...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222 [...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235 [...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670 [...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668 [...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999 [...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099 Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: af1bae5497b9 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24KVM: x86: drop error recovery in em_jmp_far and em_ret_farRadim Krčmář
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64 bit mode, but syzkaller proved otherwise (and SDM agrees). Code segment was restored upon failure, but it was left uninitialized outside of long mode, which could lead to a leak of host kernel stack. We could have fixed that by always saving and restoring the CS, but we take a simpler approach and just break any guest that manages to fail as the error recovery is error-prone and modern CPUs don't need emulator for this. Found by syzkaller: WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480 Kernel panic - not syncing: panic_on_warn set ... CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [...] Call Trace: [...] __dump_stack lib/dump_stack.c:15 [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51 [...] panic+0x1b7/0x3a3 kernel/panic.c:179 [...] __warn+0x1c4/0x1e0 kernel/panic.c:542 [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217 [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227 [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294 [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545 [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116 [...] complete_emulated_io arch/x86/kvm/x86.c:6870 [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934 [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978 [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557 [...] vfs_ioctl fs/ioctl.c:43 [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679 [...] SYSC_ioctl fs/ioctl.c:694 [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 [...] entry_SYSCALL_64_fastpath+0x1f/0xc2 Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: d1442d85cc30 ("KVM: x86: Handle errors when RIP is set during far jumps") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24KVM: x86: fix out-of-bounds access in lapicRadim Krčmář
Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff. With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a userspace can send an interrupt with dest_id that results in out-of-bounds access. Found by syzkaller: BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750 Read of size 8 by task syz-executor/22923 CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [...] Call Trace: [...] __dump_stack lib/dump_stack.c:15 [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51 [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156 [...] print_address_description mm/kasan/report.c:194 [...] kasan_report_error mm/kasan/report.c:283 [...] kasan_report+0x231/0x500 mm/kasan/report.c:303 [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329 [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824 [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72 [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157 [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74 [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015 [...] vfs_ioctl fs/ioctl.c:43 [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679 [...] SYSC_ioctl fs/ioctl.c:694 [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 [...] entry_SYSCALL_64_fastpath+0x1f/0xc2 Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: e45115b62f9a ("KVM: x86: use physical LAPIC array for logical x2APIC") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomicPaolo Bonzini
kvm_arch_set_irq is unused since commit b97e6de9c96. Merge its functionality with kvm_arch_set_irq_inatomic. Reported-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addrPaolo Bonzini
Reported by syzkaller: [ INFO: suspicious RCU usage. ] 4.9.0-rc4+ #47 Not tainted ------------------------------- ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage! stack backtrace: CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000 0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9 ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51 [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445 [< inline >] __kvm_memslots include/linux/kvm_host.h:534 [< inline >] kvm_memslots include/linux/kvm_host.h:541 [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941 [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217 Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: fda4e2e85589191b123d31cdc21fd33ee70f50fd Cc: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>