summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2012-03-08KVM: Introduce gfn_to_index() which returns the index for a given levelTakuya Yoshikawa
This patch cleans up the code and removes the "(void)level;" warning suppressor. Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every level uniformly later. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: provide control registers via kvm_runChristian Borntraeger
There are several cases were we need the control registers for userspace. Lets also provide those in kvm_run. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: add stop_on_stop flag when doing stop and storeJens Freimann
When we do a stop and store status we need to pass ACTION_STOP_ON_STOP flag to __sigp_stop(). Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: ignore sigp stop overinitiativeJens Freimann
In __inject_sigp_stop() do nothing when the CPU is already in stopped state. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: make sigp restart return busy when stop pendingJens Freimann
On reboot the guest sends in smp_send_stop() a sigp stop to all CPUs except for current CPU. Then the guest switches to the IPL cpu by sending a restart to the IPL CPU, followed by a sigp stop to the current cpu. Since restart is handled by userspace it's possible that the restart is delivered before the old stop. This means that the IPL CPU isn't restarted and we have no running CPUs. So let's make sure that there is no stop action pending when we do the restart. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: do store status after handling STOP_ON_STOP bitJens Freimann
In handle_stop() handle the stop bit before doing the store status as described for "Stop and Store Status" in the Principles of Operation. We have to give up the local_int.lock before calling kvm store status since it calls gmap_fault() which might sleep. Since local_int.lock only protects local_int.* and not guest memory we can give up the lock. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: s390: Sanitize fpc registers for KVM_SET_FPUChristian Borntraeger
commit 7eef87dc99e419b1cc051e4417c37e4744d7b661 (KVM: s390: fix register setting) added a load of the floating point control register to the KVM_SET_FPU path. Lets make sure that the fpc is valid. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix write protection race during dirty loggingTakuya Yoshikawa
This patch fixes a race introduced by: commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: remove yield_on_hltRaghavendra K T
yield_on_hlt was introduced for CPU bandwidth capping. Now it is redundant with CFS hardlimit. yield_on_hlt also complicates the scenario in paravirtual environment, that needs to trap halt. for e.g. paravirtualized ticket spinlocks. Acked-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Track TSC synchronization in generationsZachary Amsden
This allows us to track the original nanosecond and counter values at each phase of TSC writing by the guest. This gets us perfect offset matching for stable TSC systems, and perfect software computed TSC matching for machines with unstable TSC. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Dont mark TSC unstable due to S4 suspendZachary Amsden
During a host suspend, TSC may go backwards, which KVM interprets as an unstable TSC. Technically, KVM should not be marking the TSC unstable, which causes the TSC clocksource to go bad, but we need to be adjusting the TSC offsets in such a case. Dealing with this issue is a little tricky as the only place we can reliably do it is before much of the timekeeping infrastructure is up and running. On top of this, we are not in a KVM thread context, so we may not be able to safely access VCPU fields. Instead, we compute our best known hardware offset at power-up and stash it to be applied to all VCPUs when they actually start running. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow adjust_tsc_offset to be in host or guest cyclesMarcelo Tosatti
Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Add last_host_tsc tracking back to KVMZachary Amsden
The variable last_host_tsc was removed from upstream code. I am adding it back for two reasons. First, it is unnecessary to use guest TSC computation to conclude information about the host TSC. The guest may set the TSC backwards (this case handled by the previous patch), but the computation of guest TSC (and fetching an MSR) is significanlty more work and complexity than simply reading the hardware counter. In addition, we don't actually need the guest TSC for any part of the computation, by always recomputing the offset, we can eliminate the need to deal with the current offset and any scaling factors that may apply. The second reason is that later on, we are going to be using the host TSC value to restore TSC offsets after a host S4 suspend, so we need to be reading the host values, not the guest values here. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix last_guest_tsc / tsc_offset semanticsZachary Amsden
The variable last_guest_tsc was being used as an ad-hoc indicator that guest TSC has been initialized and recorded correctly. However, it may not have been, it could be that guest TSC has been set to some large value, the back to a small value (by, say, a software reboot). This defeats the logic and causes KVM to falsely assume that the guest TSC has gone backwards, marking the host TSC unstable, which is undesirable behavior. In addition, rather than try to compute an offset adjustment for the TSC on unstable platforms, just recompute the whole offset. This allows us to get rid of one callsite for adjust_tsc_offset, which is problematic because the units it takes are in guest units, but here, the computation was originally being done in host units. Doing this, and also recording last_guest_tsc when the TSC is written allow us to remove the tricky logic which depended on last_guest_tsc being zero to indicate a reset of uninitialized value. Instead, we now have the guarantee that the guest TSC offset is always at least something which will get us last_guest_tsc. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Leave TSC synchronization window open with each new syncZachary Amsden
Currently, when the TSC is written by the guest, the variable ns is updated to force the current write to appear to have taken place at the time of the first write in this sync phase. This leaves a cliff at the end of the match window where updates will fall of the end. There are two scenarios where this can be a problem in practe - first, on a system with a large number of VCPUs, the sync period may last for an extended period of time. The second way this can happen is if the VM reboots very rapidly and we catch a VCPU TSC synchronization just around the edge. We may be unaware of the reboot, and thus the first VCPU might synchronize with an old set of the timer (at, say 0.97 seconds ago, when first powered on). The second VCPU can come in 0.04 seconds later to try to synchronize, but it misses the window because it is just over the threshold. Instead, stop doing this artificial setback of the ns variable and just update it with every write of the TSC. It may be observed that doing so causes values computed by compute_guest_tsc to diverge slightly across CPUs - note that the last_tsc_ns and last_tsc_write variable are used here, and now they last_tsc_ns will be different for each VCPU, reflecting the actual time of the update. However, compute_guest_tsc is used only for guests which already have TSC stability issues, and further, note that the previous patch has caused last_tsc_write to be incremented by the difference in nanoseconds, converted back into guest cycles. As such, only boundary rounding errors should be visible, which given the resolution in nanoseconds, is going to only be a few cycles and only visible in cross-CPU consistency tests. The problem can be fixed by adding a new set of variables to track the start offset and start write value for the current sync cycle. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Improve TSC offset matchingZachary Amsden
There are a few improvements that can be made to the TSC offset matching code. First, we don't need to call the 128-bit multiply (especially on a constant number), the code works much nicer to do computation in nanosecond units. Second, the way everything is setup with software TSC rate scaling, we currently have per-cpu rates. Obviously this isn't too desirable to use in practice, but if for some reason we do change the rate of all VCPUs at runtime, then reset the TSCs, we will only want to match offsets for VCPUs running at the same rate. Finally, for the case where we have an unstable host TSC, but rate scaling is being done in hardware, we should call the platform code to compute the TSC offset, so the math is reorganized to recompute the base instead, then transform the base into an offset using the existing API. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> KVM: Fix 64-bit division in kvm_write_tsc() Breaks i386 build. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Infrastructure for software and hardware based TSC rate scalingZachary Amsden
This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86: increase recommended max vcpus to 160Marcelo Tosatti
Increase recommended max vcpus from 64 to 160 (tested internally at Red Hat). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05x86: Introduce x86_cpuinit.early_percpu_clock_init hookIgor Mammedov
When kvm guest uses kvmclock, it may hang on vcpu hot-plug. This is caused by an overflow in pvclock_get_nsec_offset, u64 delta = tsc - shadow->tsc_timestamp; which in turn is caused by an undefined values from percpu hv_clock that hasn't been initialized yet. Uninitialized clock on being booted cpu is accessed from start_secondary -> smp_callin -> smp_store_cpu_info -> identify_secondary_cpu -> mtrr_ap_init -> mtrr_restore -> stop_machine_from_inactive_cpu -> queue_stop_cpus_work ... -> sched_clock -> kvm_clock_read which is well before x86_cpuinit.setup_percpu_clockev call in start_secondary, where percpu clock is initialized. This patch introduces a hook that allows to setup/initialize per_cpu clock early and avoid overflow due to reading - undefined values - old values if cpu was offlined and then onlined again Another possible early user of this clock source is ftrace that accesses it to get timestamps for ring buffer entries. So if mtrr_ap_init is moved from identify_secondary_cpu to past x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace may cause the same overflow/hang on cpu hot-plug anyway. More complete description of the problem: https://lkml.org/lkml/2012/2/2/101 Credits to Marcelo Tosatti <mtosatti@redhat.com> for hook idea. Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86: reset edge sense circuit of i8259 on initGleb Natapov
The spec says that during initialization "The edge sense circuit is reset which means that following initialization an interrupt request (IR) input must make a low-to-high transition to generate an interrupt", but currently if edge triggered interrupt is in IRR it is delivered after i8259 initialization. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Add HPT preallocatorAlexander Graf
We're currently allocating 16MB of linear memory on demand when creating a guest. That does work some times, but finding 16MB of linear memory available in the system at runtime is definitely not a given. So let's add another command line option similar to the RMA preallocator, that we can use to keep a pool of page tables around. Now, when a guest gets created it has a pretty low chance of receiving an OOM. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Initialize linears with zerosAlexander Graf
RMAs and HPT preallocated spaces should be zeroed, so we don't accidently leak information from previous VM executions. Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Convert RMA allocation into generic codeAlexander Graf
We have code to allocate big chunks of linear memory on bootup for later use. This code is currently used for RMA allocation, but can be useful beyond that extent. Make it generic so we can reuse it for other stuff later. Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: E500: Fail init when not on e500v2Alexander Graf
When enabling the current KVM code on e500mc, I get the following oops: Oops: Exception in kernel mode, sig: 4 [#1] SMP NR_CPUS=8 P2041 RDB Modules linked in: NIP: c067df4c LR: c067df44 CTR: 00000000 REGS: ee055ed0 TRAP: 0700 Not tainted (3.2.0-10391-g36c5afe) MSR: 00029002 <CE,EE,ME> CR: 24042022 XER: 00000000 TASK = ee0429b0[1] 'swapper/0' THREAD: ee054000 CPU: 2 GPR00: c067df44 ee055f80 ee0429b0 00000000 00000058 0000003f ee211600 60c6b864 GPR08: 7cc903a6 0000002c 00000000 00000001 44042082 2d180088 00000000 00000000 GPR16: c0000a00 00000014 3fffffff 03fe9000 00000015 7ff3be68 c06e0000 00000000 GPR24: 00000000 00000000 00001720 c067df1c c06e0000 00000000 ee054000 c06ab51c NIP [c067df4c] kvmppc_e500_init+0x30/0xf8 LR [c067df44] kvmppc_e500_init+0x28/0xf8 Call Trace: [ee055f80] [c067df44] kvmppc_e500_init+0x28/0xf8 (unreliable) [ee055fb0] [c0001d30] do_one_initcall+0x50/0x1f0 [ee055fe0] [c06721dc] kernel_init+0xa4/0x14c [ee055ff0] [c000e910] kernel_thread+0x4c/0x68 Instruction dump: 9421ffd0 7c0802a6 93410018 9361001c 90010034 93810020 93a10024 93c10028 93e1002c 4bfffe7d 2c030000 408200a4 <7c1082a6> 90010008 7c1182a6 9001000c ---[ end trace b8ef4903fcbf9dd3 ]--- Since it doesn't make sense to run the init function on any non-supported platform, we can just call our "is this platform supported?" function and bail out of init() if it's not. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: Move gfn_to_memslot() to kvm_host.hPaul Mackerras
This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to kvm_host.h to reduce the code duplication caused by the need for non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call gfn_to_memslot() in real mode. Rather than putting gfn_to_memslot() itself in a header, which would lead to increased code size, this puts __gfn_to_memslot() in a header. Then, the non-modular uses of gfn_to_memslot() are changed to call __gfn_to_memslot() instead. This way there is only one place in the source code that needs to be changed should the gfn_to_memslot() implementation need to be modified. On powerpc, the Book3S HV style of KVM has code that is called from real mode which needs to call gfn_to_memslot() and thus needs this. (Module code is allocated in the vmalloc region, which can't be accessed in real mode.) With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: reject SYSENTER in compatibility mode on AMD guestsAvi Kivity
If the guest thinks it's an AMD, it will not have prepared the SYSENTER MSRs, and if the guest executes SYSENTER in compatibility mode, it will fails. Detect this condition and #UD instead, like the spec says. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)Julian Stecklina
If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: "For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1." Signed-off-by: Julian Stecklina <js@alien8.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: SVM: comment nested paging and virtualization module parametersDavidlohr Bueso
Also use true instead of 1 for enabling by default. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from rmap_next()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm_pte_chainTakuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hackAvi Kivity
Currently we treat MOVSX/MOVZX with a byte source as a byte instruction, and change the destination operand size with a hack. Change it to be a word instruction, so the destination receives its natural size, and change the source to be SrcMem8. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: x86 emulator: add 8-bit memory operandsAvi Kivity
Useful for MOVSX/MOVZX. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: PPC: refer to paravirt docs in header fileScott Wood
Instead of keeping separate copies of struct kvm_vcpu_arch_shared (one in the code, one in the docs) that inevitably fail to be kept in sync (already sr[] is missing from the doc version), just point to the header file as the source of documentation on the contents of the magic page. Signed-off-by: Scott Wood <scottwood@freescale.com> Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Rename MMIO register identifiersAlexander Graf
We need the KVM_REG namespace for generic register settings now, so let's rename the existing users to something different, enabling us to reuse the namespace for more visible interfaces. While at it, also move these private constants to a private header. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Move kvm_vcpu_ioctl_[gs]et_one_reg down to platform-specific codePaul Mackerras
This moves the get/set_one_reg implementation down from powerpc.c into booke.c, book3s_pr.c and book3s_hv.c. This avoids #ifdefs in C code, but more importantly, it fixes a bug on Book3s HV where we were accessing beyond the end of the kvm_vcpu struct (via the to_book3s() macro) and corrupting memory, causing random crashes and file corruption. On Book3s HV we only accept setting the HIOR to zero, since the guest runs in supervisor mode and its vectors are never offset from zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> [agraf update to apply on top of changed ONE_REG patches] Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Add support for explicit HIOR settingAlexander Graf
Until now, we always set HIOR based on the PVR, but this is just wrong. Instead, we should be setting HIOR explicitly, so user space can decide what the initial HIOR value is - just like on real hardware. We keep the old PVR based way around for backwards compatibility, but once user space uses the SET_ONE_REG based method, we drop the PVR logic. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Add generic single register ioctlsAlexander Graf
Right now we transfer a static struct every time we want to get or set registers. Unfortunately, over time we realize that there are more of these than we thought of before and the extensibility and flexibility of transferring a full struct every time is limited. So this is a new approach to the problem. With these new ioctls, we can get and set a single register that is identified by an ID. This allows for very precise and limited transmittal of data. When we later realize that it's a better idea to shove over multiple registers at once, we can reuse most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS interface. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Use the vcpu kmem_cache when allocating new VCPUsSasha Levin
Currently the code kzalloc()s new VCPUs instead of using the kmem_cache which is created when KVM is initialized. Modify it to allocate VCPUs from that kmem_cache. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: booke: Add booke206 TLB traceLiu Yu
The existing kvm_stlb_write/kvm_gtlb_write were a poor match for the e500/book3e MMU -- mas1 was passed as "tid", mas2 was limited to "unsigned int" which will be a problem on 64-bit, mas3/7 got split up rather than treated as a single 64-bit word, etc. Signed-off-by: Liu Yu <yu.liu@freescale.com> [scottwood@freescale.com: made mas2 64-bit, and added mas8 init] Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Book3s HV: Implement get_dirty_log using hardware changed bitPaul Mackerras
This changes the implementation of kvm_vm_ioctl_get_dirty_log() for Book3s HV guests to use the hardware C (changed) bits in the guest hashed page table. Since this makes the implementation quite different from the Book3s PR case, this moves the existing implementation from book3s.c to book3s_pr.c and creates a new implementation in book3s_hv.c. That implementation calls kvmppc_hv_get_dirty_log() to do the actual work by calling kvm_test_clear_dirty on each page. It iterates over the HPTEs, clearing the C bit if set, and returns 1 if any C bit was set (including the saved C bit in the rmap entry). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Book3S HV: Use the hardware referenced bit for kvm_age_hvaPaul Mackerras
This uses the host view of the hardware R (referenced) bit to speed up kvm_age_hva() and kvm_test_age_hva(). Instead of removing all the relevant HPTEs in kvm_age_hva(), we now just reset their R bits if set. Also, kvm_test_age_hva() now scans the relevant HPTEs to see if any of them have R set. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Book3s HV: Maintain separate guest and host views of R and C bitsPaul Mackerras
This allows both the guest and the host to use the referenced (R) and changed (C) bits in the guest hashed page table. The guest has a view of R and C that is maintained in the guest_rpte field of the revmap entry for the HPTE, and the host has a view that is maintained in the rmap entry for the associated gfn. Both view are updated from the guest HPT. If a bit (R or C) is zero in either view, it will be initially set to zero in the HPTE (or HPTEs), until set to 1 by hardware. When an HPTE is removed for any reason, the R and C bits from the HPTE are ORed into both views. We have to be careful to read the R and C bits from the HPTE after invalidating it, but before unlocking it, in case of any late updates by the hardware. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Book3S HV: Keep HPTE locked when invalidatingPaul Mackerras
This reworks the implementations of the H_REMOVE and H_BULK_REMOVE hcalls to make sure that we keep the HPTE locked and in the reverse- mapping chain until we have finished invalidating it. Previously we would remove it from the chain and unlock it before invalidating it, leaving a tiny window when the guest could access the page even though we believe we have removed it from the guest (e.g., kvm_unmap_hva() has been called for the page and has found no HPTEs in the chain). In addition, we'll need this for future patches where we will need to read the R and C bits in the HPTE after invalidating it. Doing this required restructuring kvmppc_h_bulk_remove() substantially. Since we want to batch up the tlbies, we now need to keep several HPTEs locked simultaneously. In order to avoid possible deadlocks, we don't spin on the HPTE bitlock for any except the first HPTE in a batch. If we can't acquire the HPTE bitlock for the second or subsequent HPTE, we terminate the batch at that point, do the tlbies that we have accumulated so far, unlock those HPTEs, and then start a new batch to do the remaining invalidations. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Add KVM_CAP_NR_VCPUS and KVM_CAP_MAX_VCPUSMatt Evans
PPC KVM lacks these two capabilities, and as such a userland system must assume a max of 4 VCPUs (following api.txt). With these, a userland can determine a more realistic limit. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Fix vcpu_create dereference before validity check.Matt Evans
Fix usage of vcpu struct before check that it's actually valid. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Allow for read-only pages backing a Book3S HV guestPaul Mackerras
With this, if a guest does an H_ENTER with a read/write HPTE on a page which is currently read-only, we make the actual HPTE inserted be a read-only version of the HPTE. We now intercept protection faults as well as HPTE not found faults, and for a protection fault we work out whether it should be reflected to the guest (e.g. because the guest HPTE didn't allow write access to usermode) or handled by switching to kernel context and calling kvmppc_book3s_hv_page_fault, which will then request write access to the page and update the actual HPTE. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Implement MMU notifiers for Book3S HV guestsPaul Mackerras
This adds the infrastructure to enable us to page out pages underneath a Book3S HV guest, on processors that support virtualized partition memory, that is, POWER7. Instead of pinning all the guest's pages, we now look in the host userspace Linux page tables to find the mapping for a given guest page. Then, if the userspace Linux PTE gets invalidated, kvm_unmap_hva() gets called for that address, and we replace all the guest HPTEs that refer to that page with absent HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit set, which will cause an HDSI when the guest tries to access them. Finally, the page fault handler is extended to reinstantiate the guest HPTE when the guest tries to access a page which has been paged out. Since we can't intercept the guest DSI and ISI interrupts on PPC970, we still have to pin all the guest pages on PPC970. We have a new flag, kvm->arch.using_mmu_notifiers, that indicates whether we can page guest pages out. If it is not set, the MMU notifier callbacks do nothing and everything operates as before. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Implement MMIO emulation support for Book3S HV guestsPaul Mackerras
This provides the low-level support for MMIO emulation in Book3S HV guests. When the guest tries to map a page which is not covered by any memslot, that page is taken to be an MMIO emulation page. Instead of inserting a valid HPTE, we insert an HPTE that has the valid bit clear but another hypervisor software-use bit set, which we call HPTE_V_ABSENT, to indicate that this is an absent page. An absent page is treated much like a valid page as far as guest hcalls (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that an absent HPTE doesn't need to be invalidated with tlbie since it was never valid as far as the hardware is concerned. When the guest accesses a page for which there is an absent HPTE, it will take a hypervisor data storage interrupt (HDSI) since we now set the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults looks up the hash table and if it finds an absent HPTE mapping the requested virtual address, will switch to kernel mode and handle the fault in kvmppc_book3s_hv_page_fault(), which at present just calls kvmppc_hv_emulate_mmio() to set up the MMIO emulation. This is based on an earlier patch by Benjamin Herrenschmidt, but since heavily reworked. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfnPaul Mackerras
This expands the reverse mapping array to contain two links for each HPTE which are used to link together HPTEs that correspond to the same guest logical page. Each circular list of HPTEs is pointed to by the rmap array entry for the guest logical page, pointed to by the relevant memslot. Links are 32-bit HPT entry indexes rather than full 64-bit pointers, to save space. We use 3 of the remaining 32 bits in the rmap array entries as a lock bit, a referenced bit and a present bit (the present bit is needed since HPTE index 0 is valid). The bit lock for the rmap chain nests inside the HPTE lock bit. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>