summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2012-03-08KVM: Introduce gfn_to_index() which returns the index for a given levelTakuya Yoshikawa
This patch cleans up the code and removes the "(void)level;" warning suppressor. Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every level uniformly later. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix write protection race during dirty loggingTakuya Yoshikawa
This patch fixes a race introduced by: commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: remove yield_on_hltRaghavendra K T
yield_on_hlt was introduced for CPU bandwidth capping. Now it is redundant with CFS hardlimit. yield_on_hlt also complicates the scenario in paravirtual environment, that needs to trap halt. for e.g. paravirtualized ticket spinlocks. Acked-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Track TSC synchronization in generationsZachary Amsden
This allows us to track the original nanosecond and counter values at each phase of TSC writing by the guest. This gets us perfect offset matching for stable TSC systems, and perfect software computed TSC matching for machines with unstable TSC. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Dont mark TSC unstable due to S4 suspendZachary Amsden
During a host suspend, TSC may go backwards, which KVM interprets as an unstable TSC. Technically, KVM should not be marking the TSC unstable, which causes the TSC clocksource to go bad, but we need to be adjusting the TSC offsets in such a case. Dealing with this issue is a little tricky as the only place we can reliably do it is before much of the timekeeping infrastructure is up and running. On top of this, we are not in a KVM thread context, so we may not be able to safely access VCPU fields. Instead, we compute our best known hardware offset at power-up and stash it to be applied to all VCPUs when they actually start running. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow adjust_tsc_offset to be in host or guest cyclesMarcelo Tosatti
Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Add last_host_tsc tracking back to KVMZachary Amsden
The variable last_host_tsc was removed from upstream code. I am adding it back for two reasons. First, it is unnecessary to use guest TSC computation to conclude information about the host TSC. The guest may set the TSC backwards (this case handled by the previous patch), but the computation of guest TSC (and fetching an MSR) is significanlty more work and complexity than simply reading the hardware counter. In addition, we don't actually need the guest TSC for any part of the computation, by always recomputing the offset, we can eliminate the need to deal with the current offset and any scaling factors that may apply. The second reason is that later on, we are going to be using the host TSC value to restore TSC offsets after a host S4 suspend, so we need to be reading the host values, not the guest values here. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix last_guest_tsc / tsc_offset semanticsZachary Amsden
The variable last_guest_tsc was being used as an ad-hoc indicator that guest TSC has been initialized and recorded correctly. However, it may not have been, it could be that guest TSC has been set to some large value, the back to a small value (by, say, a software reboot). This defeats the logic and causes KVM to falsely assume that the guest TSC has gone backwards, marking the host TSC unstable, which is undesirable behavior. In addition, rather than try to compute an offset adjustment for the TSC on unstable platforms, just recompute the whole offset. This allows us to get rid of one callsite for adjust_tsc_offset, which is problematic because the units it takes are in guest units, but here, the computation was originally being done in host units. Doing this, and also recording last_guest_tsc when the TSC is written allow us to remove the tricky logic which depended on last_guest_tsc being zero to indicate a reset of uninitialized value. Instead, we now have the guarantee that the guest TSC offset is always at least something which will get us last_guest_tsc. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Leave TSC synchronization window open with each new syncZachary Amsden
Currently, when the TSC is written by the guest, the variable ns is updated to force the current write to appear to have taken place at the time of the first write in this sync phase. This leaves a cliff at the end of the match window where updates will fall of the end. There are two scenarios where this can be a problem in practe - first, on a system with a large number of VCPUs, the sync period may last for an extended period of time. The second way this can happen is if the VM reboots very rapidly and we catch a VCPU TSC synchronization just around the edge. We may be unaware of the reboot, and thus the first VCPU might synchronize with an old set of the timer (at, say 0.97 seconds ago, when first powered on). The second VCPU can come in 0.04 seconds later to try to synchronize, but it misses the window because it is just over the threshold. Instead, stop doing this artificial setback of the ns variable and just update it with every write of the TSC. It may be observed that doing so causes values computed by compute_guest_tsc to diverge slightly across CPUs - note that the last_tsc_ns and last_tsc_write variable are used here, and now they last_tsc_ns will be different for each VCPU, reflecting the actual time of the update. However, compute_guest_tsc is used only for guests which already have TSC stability issues, and further, note that the previous patch has caused last_tsc_write to be incremented by the difference in nanoseconds, converted back into guest cycles. As such, only boundary rounding errors should be visible, which given the resolution in nanoseconds, is going to only be a few cycles and only visible in cross-CPU consistency tests. The problem can be fixed by adding a new set of variables to track the start offset and start write value for the current sync cycle. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Improve TSC offset matchingZachary Amsden
There are a few improvements that can be made to the TSC offset matching code. First, we don't need to call the 128-bit multiply (especially on a constant number), the code works much nicer to do computation in nanosecond units. Second, the way everything is setup with software TSC rate scaling, we currently have per-cpu rates. Obviously this isn't too desirable to use in practice, but if for some reason we do change the rate of all VCPUs at runtime, then reset the TSCs, we will only want to match offsets for VCPUs running at the same rate. Finally, for the case where we have an unstable host TSC, but rate scaling is being done in hardware, we should call the platform code to compute the TSC offset, so the math is reorganized to recompute the base instead, then transform the base into an offset using the existing API. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> KVM: Fix 64-bit division in kvm_write_tsc() Breaks i386 build. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Infrastructure for software and hardware based TSC rate scalingZachary Amsden
This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86: increase recommended max vcpus to 160Marcelo Tosatti
Increase recommended max vcpus from 64 to 160 (tested internally at Red Hat). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05x86: Introduce x86_cpuinit.early_percpu_clock_init hookIgor Mammedov
When kvm guest uses kvmclock, it may hang on vcpu hot-plug. This is caused by an overflow in pvclock_get_nsec_offset, u64 delta = tsc - shadow->tsc_timestamp; which in turn is caused by an undefined values from percpu hv_clock that hasn't been initialized yet. Uninitialized clock on being booted cpu is accessed from start_secondary -> smp_callin -> smp_store_cpu_info -> identify_secondary_cpu -> mtrr_ap_init -> mtrr_restore -> stop_machine_from_inactive_cpu -> queue_stop_cpus_work ... -> sched_clock -> kvm_clock_read which is well before x86_cpuinit.setup_percpu_clockev call in start_secondary, where percpu clock is initialized. This patch introduces a hook that allows to setup/initialize per_cpu clock early and avoid overflow due to reading - undefined values - old values if cpu was offlined and then onlined again Another possible early user of this clock source is ftrace that accesses it to get timestamps for ring buffer entries. So if mtrr_ap_init is moved from identify_secondary_cpu to past x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace may cause the same overflow/hang on cpu hot-plug anyway. More complete description of the problem: https://lkml.org/lkml/2012/2/2/101 Credits to Marcelo Tosatti <mtosatti@redhat.com> for hook idea. Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86: reset edge sense circuit of i8259 on initGleb Natapov
The spec says that during initialization "The edge sense circuit is reset which means that following initialization an interrupt request (IR) input must make a low-to-high transition to generate an interrupt", but currently if edge triggered interrupt is in IRR it is delivered after i8259 initialization. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: reject SYSENTER in compatibility mode on AMD guestsAvi Kivity
If the guest thinks it's an AMD, it will not have prepared the SYSENTER MSRs, and if the guest executes SYSENTER in compatibility mode, it will fails. Detect this condition and #UD instead, like the spec says. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)Julian Stecklina
If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: "For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1." Signed-off-by: Julian Stecklina <js@alien8.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: SVM: comment nested paging and virtualization module parametersDavidlohr Bueso
Also use true instead of 1 for enabling by default. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from rmap_next()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm_pte_chainTakuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hackAvi Kivity
Currently we treat MOVSX/MOVZX with a byte source as a byte instruction, and change the destination operand size with a hack. Change it to be a word instruction, so the destination receives its natural size, and change the source to be SrcMem8. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: x86 emulator: add 8-bit memory operandsAvi Kivity
Useful for MOVSX/MOVZX. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: provide synchronous registers in kvm_runChristian Borntraeger
On some cpus the overhead for virtualization instructions is in the same range as a system call. Having to call multiple ioctls to get set registers will make certain userspace handled exits more expensive than necessary. Lets provide a section in kvm_run that works as a shared save area for guest registers. We also provide two 64bit flags fields (architecture specific), that will specify 1. which parts of these fields are valid. 2. which registers were modified by userspace Each bit for these flag fields will define a group of registers (like general purpose) or a single register. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: SVM: Add support for AMD's OSVW feature in guestsBoris Ostrovsky
In some cases guests should not provide workarounds for errata even when the physical processor is affected. For example, because of erratum 400 on family 10h processors a Linux guest will read an MSR (resulting in VMEXIT) before going to idle in order to avoid getting stuck in a non-C0 state. This is not necessary: HLT and IO instructions are intercepted and therefore there is no reason for erratum 400 workaround in the guest. This patch allows us to present a guest with certain errata as fixed, regardless of the state of actual hardware. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: unnecessary NX state assignmentDavidlohr Bueso
We can remove the first ->nx state assignment since it is assigned afterwards anyways. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: ucontrol: export SIE control block to userCarsten Otte
This patch exports the s390 SIE hardware control block to userspace via the mapping of the vcpu file descriptor. In order to do so, a new arch callback named kvm_arch_vcpu_fault is introduced for all architectures. It allows to map architecture specific pages. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: add parameter for KVM_CREATE_VMCarsten Otte
This patch introduces a new config option for user controlled kernel virtual machines. It introduces a parameter to KVM_CREATE_VM that allows to set bits that alter the capabilities of the newly created virtual machine. The parameter is passed to kvm_arch_init_vm for all architectures. The only valid modifier bit for now is KVM_VM_S390_UCONTROL. This requires CAP_SYS_ADMIN privileges and creates a user controlled virtual machine on s390 architectures. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: remove the redundant get_written_sptesXiao Guangrong
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Add missing large page accounting to drop_large_spte()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove for_each_unsync_children() macroTakuya Yoshikawa
There is only one user of it and for_each_set_bit() does the same. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-27Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/AMD: Fix UP build error x86: Specify a size for the cmp in the NMI handler x86/nmi: Test saved %cs in NMI to determine nested NMI case x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processors x86/microcode: Remove noisy AMD microcode warning
2012-02-27Merge tag 'stable/for-linus-fixes-3.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Two fixes to fix a memory corruption bug when WC pages never get converted back to WB but end up being recycled in the general memory pool as WC. There is a better way of fixing this, but there is not enough time to do the full benchmarking to pick one of the right options - so picking the one that favors stability for right now. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> * tag 'stable/for-linus-fixes-3.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/pat: Disable PAT support for now. xen/setup: Remove redundant filtering of PTE masks.
2012-02-22x86/mce/AMD: Fix UP build errorBorislav Petkov
141168c36cde ("x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'") removed a bunch of CONFIG_SMP ifdefs around code touching struct cpuinfo_x86 members but also caused the following build error with Randy's randconfigs: mce_amd.c:(.cpuinit.text+0x4723): undefined reference to `cpu_llc_shared_map' Restore the #ifdef in threshold_create_bank() which creates symlinks on the non-BSP CPUs. There's a better patch series being worked on by Kevin Winchester which will solve this in a cleaner fashion, but that series is too ambitious for v3.3 merging - so we first queue up this trivial fix and then do the rest for v3.4. Signed-off-by: Borislav Petkov <bp@alien8.de> Acked-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Nick Bowler <nbowler@elliptictech.com> Link: http://lkml.kernel.org/r/20120203191801.GA2846@x1.osrc.amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-21i387: export 'fpu_owner_task' per-cpu variableLinus Torvalds
(And define it properly for x86-32, which had its 'current_task' declaration in separate from x86-64) Bitten by my dislike for modules on the machines I use, and the fact that apparently nobody else actually wanted to test the patches I sent out. Snif. Nobody else cares. Anyway, we probably should uninline the 'kernel_fpu_begin()' function that is what modules actually use and that references this, but this is the minimal fix for now. Reported-by: Josh Boyer <jwboyer@gmail.com> Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-21x86: Specify a size for the cmp in the NMI handlerSteven Rostedt
Linus noticed that the cmp used to check if the code segment is __KERNEL_CS or not did not specify a size. Perhaps it does not matter as H. Peter Anvin noted that user space can not set the bottom two bits of the %cs register. But it's best not to let the assembly choose and change things between different versions of gas, but instead just pick the size. Four bytes are used to compare the saved code segment against __KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when the time comes. Also I noticed that there was another non-specified cmp that checks the special stack variable if it is 1 or 0. This too probably doesn't matter what cmp is used, but this patch uses cmpl just to make it non ambiguous. Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-20i387: support lazy restore of FPU stateLinus Torvalds
This makes us recognize when we try to restore FPU state that matches what we already have in the FPU on this CPU, and avoids the restore entirely if so. To do this, we add two new data fields: - a percpu 'fpu_owner_task' variable that gets written any time we update the "has_fpu" field, and thus acts as a kind of back-pointer to the task that owns the CPU. The exception is when we save the FPU state as part of a context switch - if the save can keep the FPU state around, we leave the 'fpu_owner_task' variable pointing at the task whose FP state still remains on the CPU. - a per-thread 'last_cpu' field, that indicates which CPU that thread used its FPU on last. We update this on every context switch (writing an invalid CPU number if the last context switch didn't leave the FPU in a lazily usable state), so we know that *that* thread has done nothing else with the FPU since. These two fields together can be used when next switching back to the task to see if the CPU still matches: if 'fpu_owner_task' matches the task we are switching to, we know that no other task (or kernel FPU usage) touched the FPU on this CPU in the meantime, and if the current CPU number matches the 'last_cpu' field, we know that this thread did no other FP work on any other CPU, so the FPU state on the CPU must match what was saved on last context switch. In that case, we can avoid the 'f[x]rstor' entirely, and just clear the CR0.TS bit. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20i387: use 'restore_fpu_checking()' directly in task switching codeLinus Torvalds
This inlines what is usually just a couple of instructions, but more importantly it also fixes the theoretical error case (can that FPU restore really ever fail? Maybe we should remove the checking). We can't start sending signals from within the scheduler, we're much too deep in the kernel and are holding the runqueue lock etc. So don't bother even trying. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20i387: fix up some fpu_counter confusionLinus Torvalds
This makes sure we clear the FPU usage counter for newly created tasks, just so that we start off in a known state (for example, don't try to preload the FPU state on the first task switch etc). It also fixes a thinko in when we increment the fpu_counter at task switch time, introduced by commit 34ddc81a230b ("i387: re-introduce FPU state preloading at context switch time"). We should increment the *new* task fpu_counter, not the old task, and only if we decide to use that state (whether lazily or preloaded). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20xen/pat: Disable PAT support for now.Konrad Rzeszutek Wilk
[Pls also look at https://lkml.org/lkml/2012/2/10/228] Using of PAT to change pages from WB to WC works quite nicely. Changing it back to WB - not so much. The crux of the matter is that the code that does this (__page_change_att_set_clr) has only limited information so when it tries to the change it gets the "raw" unfiltered information instead of the properly filtered one - and the "raw" one tell it that PSE bit is on (while infact it is not). As a result when the PTE is set to be WB from WC, we get tons of: :WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0() :Hardware name: HP xw4400 Workstation .. snip.. :Pid: 27, comm: kswapd0 Tainted: G W 3.2.2-1.fc16.x86_64 #1 :Call Trace: : [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0 : [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20 : [<ffffffff81005a17>] xen_make_pte+0x67/0xa0 : [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e : [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00 : [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0 : [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190 : [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0 : [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0 : [<ffffffff8100a9b2>] ? check_events+0x12/0x20 : [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm] : [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm] : [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm] : [<ffffffff8112ba04>] shrink_slab+0x154/0x310 : [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0 : [<ffffffff8112f4b8>] kswapd+0x178/0x3d0 : [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0 : [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50 : [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0 : [<ffffffff8108fb6c>] kthread+0x8c/0xa0 for every page. The proper fix for this is has been posted and is https://lkml.org/lkml/2012/2/10/228 "x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations." along with a detailed description of the problem and solution. But since that posting has gone nowhere I am proposing this band-aid solution so that at least users don't get the page corruption (the pages that are WC don't get changed to WB and end up being recycled for filesystem or other things causing mysterious crashes). The negative impact of this patch is that users of WC flag (which are InfiniBand, radeon, nouveau drivers) won't be able to set that flag - so they are going to see performance degradation. But stability is more important here. Fixes RH BZ# 742032, 787403, and 745574 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-20xen/setup: Remove redundant filtering of PTE masks.Konrad Rzeszutek Wilk
commit 7347b4082e55ac4a673f06a0a0ce25c37273c9ec "xen: Allow unprivileged Xen domains to create iomap pages" added a redundant line in the early bootup code to filter out the PTE. That filtering is already done a bit earlier so this extra processing is not required. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-20x86/nmi: Test saved %cs in NMI to determine nested NMI caseSteven Rostedt
Currently, the NMI handler tests if it is nested by checking the special variable saved on the stack (set during NMI handling) and whether the saved stack is the NMI stack as well (to prevent the race when the variable is set to zero). But userspace may set their %rsp to any value as long as they do not derefence it, and it may make it point to the NMI stack, which will prevent NMIs from triggering while the userspace app is running. (I tested this, and it is indeed the case) Add another check to determine nested NMIs by looking at the saved %cs (code segment register) and making sure that it is the kernel code segment. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1329687817.1561.27.camel@acer.local.home Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-18i387: re-introduce FPU state preloading at context switch timeLinus Torvalds
After all the FPU state cleanups and finally finding the problem that caused all our FPU save/restore problems, this re-introduces the preloading of FPU state that was removed in commit b3b0870ef3ff ("i387: do not preload FPU state at task switch time"). However, instead of simply reverting the removal, this reimplements preloading with several fixes, most notably - properly abstracted as a true FPU state switch, rather than as open-coded save and restore with various hacks. In particular, implementing it as a proper FPU state switch allows us to optimize the CR0.TS flag accesses: there is no reason to set the TS bit only to then almost immediately clear it again. CR0 accesses are quite slow and expensive, don't flip the bit back and forth for no good reason. - Make sure that the same model works for both x86-32 and x86-64, so that there are no gratuitous differences between the two due to the way they save and restore segment state differently due to architectural differences that really don't matter to the FPU state. - Avoid exposing the "preload" state to the context switch routines, and in particular allow the concept of lazy state restore: if nothing else has used the FPU in the meantime, and the process is still on the same CPU, we can avoid restoring state from memory entirely, just re-expose the state that is still in the FPU unit. That optimized lazy restore isn't actually implemented here, but the infrastructure is set up for it. Of course, older CPU's that use 'fnsave' to save the state cannot take advantage of this, since the state saving also trashes the state. In other words, there is now an actual _design_ to the FPU state saving, rather than just random historical baggage. Hopefully it's easier to follow as a result. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-18i387: move TS_USEDFPU flag from thread_info to task_structLinus Torvalds
This moves the bit that indicates whether a thread has ownership of the FPU from the TS_USEDFPU bit in thread_info->status to a word of its own (called 'has_fpu') in task_struct->thread.has_fpu. This fixes two independent bugs at the same time: - changing 'thread_info->status' from the scheduler causes nasty problems for the other users of that variable, since it is defined to be thread-synchronous (that's what the "TS_" part of the naming was supposed to indicate). So perfectly valid code could (and did) do ti->status |= TS_RESTORE_SIGMASK; and the compiler was free to do that as separate load, or and store instructions. Which can cause problems with preemption, since a task switch could happen in between, and change the TS_USEDFPU bit. The change to TS_USEDFPU would be overwritten by the final store. In practice, this seldom happened, though, because the 'status' field was seldom used more than once, so gcc would generally tend to generate code that used a read-modify-write instruction and thus happened to avoid this problem - RMW instructions are naturally low fat and preemption-safe. - On x86-32, the current_thread_info() pointer would, during interrupts and softirqs, point to a *copy* of the real thread_info, because x86-32 uses %esp to calculate the thread_info address, and thus the separate irq (and softirq) stacks would cause these kinds of odd thread_info copy aliases. This is normally not a problem, since interrupts aren't supposed to look at thread information anyway (what thread is running at interrupt time really isn't very well-defined), but it confused the heck out of irq_fpu_usable() and the code that tried to squirrel away the FPU state. (It also caused untold confusion for us poor kernel developers). It also turns out that using 'task_struct' is actually much more natural for most of the call sites that care about the FPU state, since they tend to work with the task struct for other reasons anyway (ie scheduling). And the FPU data that we are going to save/restore is found there too. Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to the %esp issue. Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-and-tested-by: Raphael Prevost <raphael@buro.asia> Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-17i387: move AMD K7/K8 fpu fxsave/fxrstor workaround from save to restoreLinus Torvalds
The AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is pending. In order to not leak FIP state from one process to another, we need to do a floating point load after the fxsave of the old process, and before the fxrstor of the new FPU state. That resets the state to the (uninteresting) kernel load, rather than some potentially sensitive user information. We used to do this directly after the FPU state save, but that is actually very inconvenient, since it (a) corrupts what is potentially perfectly good FPU state that we might want to lazy avoid restoring later and (b) on x86-64 it resulted in a very annoying ordering constraint, where "__unlazy_fpu()" in the task switch needs to be delayed until after the DS segment has been reloaded just to get the new DS value. Coupling it to the fxrstor instead of the fxsave automatically avoids both of these issues, and also ensures that we only do it when actually necessary (the FP state after a save may never actually get used). It's simply a much more natural place for the leaked state cleanup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: do not preload FPU state at task switch timeLinus Torvalds
Yes, taking the trap to re-load the FPU/MMX state is expensive, but so is spending several days looking for a bug in the state save/restore code. And the preload code has some rather subtle interactions with both paravirtualization support and segment state restore, so it's not nearly as simple as it should be. Also, now that we no longer necessarily depend on a single bit (ie TS_USEDFPU) for keeping track of the state of the FPU, we migth be able to do better. If we are really switching between two processes that keep touching the FP state, save/restore is inevitable, but in the case of having one process that does most of the FPU usage, we may actually be able to do much better than the preloading. In particular, we may be able to keep track of which CPU the process ran on last, and also per CPU keep track of which process' FP state that CPU has. For modern CPU's that don't destroy the FPU contents on save time, that would allow us to do a lazy restore by just re-enabling the existing FPU state - with no restore cost at all! Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: don't ever touch TS_USEDFPU directly, use helper functionsLinus Torvalds
This creates three helper functions that do the TS_USEDFPU accesses, and makes everybody that used to do it by hand use those helpers instead. In addition, there's a couple of helper functions for the "change both CR0.TS and TS_USEDFPU at the same time" case, and the places that do that together have been changed to use those. That means that we have fewer random places that open-code this situation. The intent is partly to clarify the code without actually changing any semantics yet (since we clearly still have some hard to reproduce bug in this area), but also to make it much easier to use another approach entirely to caching the CR0.TS bit for software accesses. Right now we use a bit in the thread-info 'status' variable (this patch does not change that), but we might want to make it a full field of its own or even make it a per-cpu variable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: move TS_USEDFPU clearing out of __save_init_fpu and into callersLinus Torvalds
Touching TS_USEDFPU without touching CR0.TS is confusing, so don't do it. By moving it into the callers, we always do the TS_USEDFPU next to the CR0.TS accesses in the source code, and it's much easier to see how the two go hand in hand. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: fix x86-64 preemption-unsafe user stack save/restoreLinus Torvalds
Commit 5b1cbac37798 ("i387: make irq_fpu_usable() tests more robust") added a sanity check to the #NM handler to verify that we never cause the "Device Not Available" exception in kernel mode. However, that check actually pinpointed a (fundamental) race where we do cause that exception as part of the signal stack FPU state save/restore code. Because we use the floating point instructions themselves to save and restore state directly from user mode, we cannot do that atomically with testing the TS_USEDFPU bit: the user mode access itself may cause a page fault, which causes a task switch, which saves and restores the FP/MMX state from the kernel buffers. This kind of "recursive" FP state save is fine per se, but it means that when the signal stack save/restore gets restarted, it will now take the '#NM' exception we originally tried to avoid. With preemption this can happen even without the page fault - but because of the user access, we cannot just disable preemption around the save/restore instruction. There are various ways to solve this, including using the "enable/disable_page_fault()" helpers to not allow page faults at all during the sequence, and fall back to copying things by hand without the use of the native FP state save/restore instructions. However, the simplest thing to do is to just allow the #NM from kernel space, but fix the race in setting and clearing CR0.TS that this all exposed: the TS bit changes and the TS_USEDFPU bit absolutely have to be atomic wrt scheduling, so while the actual state save/restore can be interrupted and restarted, the act of actually clearing/setting CR0.TS and the TS_USEDFPU bit together must not. Instead of just adding random "preempt_disable/enable()" calls to what is already excessively ugly code, this introduces some helper functions that mostly mirror the "kernel_fpu_begin/end()" functionality, just for the user state instead. Those helper functions should probably eventually replace the other ad-hoc CR0.TS and TS_USEDFPU tests too, but I'll need to think about it some more: the task switching functionality in particular needs to expose the difference between the 'prev' and 'next' threads, while the new helper functions intentionally were written to only work with 'current'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-15i387: fix sense of sanity checkLinus Torvalds
The check for save_init_fpu() (introduced in commit 5b1cbac37798: "i387: make irq_fpu_usable() tests more robust") was the wrong way around, but I hadn't noticed, because my "tests" were bogus: the FPU exceptions are disabled by default, so even doing a divide by zero never actually triggers this code at all unless you do extra work to enable them. So if anybody did enable them, they'd get one spurious warning. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-14Merge tag 'stable/for-linus-fixes-3.3-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Two fixes for VCPU offlining; One to fix the string format exposed by the xen-pci[front|back] to conform to the one used in majority of PCI drivers; Two fixes to make the code more resilient to invalid configurations. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> * tag 'stable/for-linus-fixes-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xenbus_dev: add missing error check to watch handling xen/pci[front|back]: Use %d instead of %1x for displaying PCI devfn. xen pvhvm: do not remap pirqs onto evtchns if !xen_have_vector_callback xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic. xen/bootup: During bootup suppress XENBUS: Unable to read cpu state