summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2011-01-19x86: Unify "numa=" command line option handlingJan Beulich
In order to be able to suppress the use of SRAT tables that 32-bit Linux can't deal with (in one case known to lead to a non-bootable system, unless disabling ACPI altogether), move the "numa=" option handling to common code. Signed-off-by: Jan Beulich <jbeulich@novell.com> Reviewed-by: Thomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Renninger <trenn@suse.de> LKML-Reference: <4D36B581020000780002D0FF@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-14Merge branch 'idle-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer intel_idle: open broadcast clock event cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW cpuidle: delete NOP CPUIDLE_FLAG_POLL ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL ACPI, intel_idle: Cleanup idle= internal variables cpuidle: Make cpuidle_enable_device() call poll_idle_init() intel_idle: update Sandy Bridge core C-state residency targets
2011-01-14Merge branch 'stable/gntdev' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/p2m: Fix module linking error. xen p2m: clear the old pte when adding a page to m2p_override xen gntdev: use gnttab_map_refs and gnttab_unmap_refs xen: introduce gnttab_map_refs and gnttab_unmap_refs xen p2m: transparently change the p2m mappings in the m2p override xen/gntdev: Fix circular locking dependency xen/gntdev: stop using "token" argument xen: gntdev: move use of GNTMAP_contains_pte next to the map_op xen: add m2p override mechanism xen: move p2m handling to separate file xen/gntdev: add VM_PFNMAP to vma xen/gntdev: allow usermode to map granted pages xen: define gnttab_set_map_op/unmap_op Fix up trivial conflict in drivers/xen/Kconfig
2011-01-14thp: mmu_notifier_test_youngAndrea Arcangeli
For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: don't allow transparent hugepage support without PSEAndrea Arcangeli
Archs implementing Transparent Hugepage Support must implement a function called has_transparent_hugepage to be sure the virtual or physical CPU supports Transparent Hugepages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: add pmd_modifyJohannes Weiner
Add pmd_modify() for use with mprotect() on huge pmds. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: add x86 32bit supportJohannes Weiner
Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: transparent hugepage coreAndrea Arcangeli
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: add pmd mangling functions to x86Andrea Arcangeli
Add needed pmd mangling functions with symmetry with their pte counterparts. pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's needed to serialize the VM against split_huge_page. It simply atomically sets the splitting bit in a similar way pmdp_clear_flush_young atomically clears the accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it effective against gup_fast, but it wouldn't really require to flush the tlb too. Just the tlb flush is the simplest operation we can invoke to serialize pmdp_splitting_flush() against gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: special pmd_trans_* functionsAndrea Arcangeli
These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: no paravirt version of pmd opsAndrea Arcangeli
No paravirt version of set_pmd_at/pmd_update/pmd_update_defer. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: add pmd paravirt opsAndrea Arcangeli
Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer), but this is to keep full simmetry with pte paravirt ops, which looks cleaner and simpler from a common code POV. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14thp: add native_set_pmd_atAndrea Arcangeli
Used by paravirt and not paravirt set_pmd_at. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13Merge branch 'x86-olpc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-olpc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, olpc: Speed up device tree creation during boot x86, olpc: Add OLPC device-tree support x86, of: Define irq functions to allow drivers/of/* to build on x86
2011-01-13Merge branch 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
* 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits) KVM: Initialize fpu state in preemptible context KVM: VMX: when entering real mode align segment base to 16 bytes KVM: MMU: handle 'map_writable' in set_spte() function KVM: MMU: audit: allow audit more guests at the same time KVM: Fetch guest cr3 from hardware on demand KVM: Replace reads of vcpu->arch.cr3 by an accessor KVM: MMU: only write protect mappings at pagetable level KVM: VMX: Correct asm constraint in vmcs_load()/vmcs_clear() KVM: MMU: Initialize base_role for tdp mmus KVM: VMX: Optimize atomic EFER load KVM: VMX: Add definitions for more vm entry/exit control bits KVM: SVM: copy instruction bytes from VMCB KVM: SVM: implement enhanced INVLPG intercept KVM: SVM: enhance mov DR intercept handler KVM: SVM: enhance MOV CR intercept handler KVM: SVM: add new SVM feature bit names KVM: cleanup emulate_instruction KVM: move complete_insn_gp() into x86.c KVM: x86: fix CR8 handling KVM guest: Fix kvm clock initialization when it's configured out ...
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13x86: support XZ-compressed kernelLasse Collin
This integrates the XZ decompression code to the x86 pre-boot code. mkpiggy.c is updated to reserve about 32 KiB more buffer safety margin for kernel decompression. It is done unconditionally for all decompressors to keep the code simpler. The XZ decompressor needs around 30 KiB of heap, so the heap size is increased to 32 KiB on both x86-32 and x86-64. Documentation/x86/boot.txt is updated to list the XZ magic number. With the x86 BCJ filter in XZ, XZ-compressed x86 kernel tends to be a few percent smaller than the equivalent LZMA-compressed kernel. Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13drivers/staging/olpc_dcon: convert to new cs5535 gpio APIAndres Salomon
Drop the old geode_gpio crud, as well as the raw outl() calls; instead, use the Linux GPIO API where possible, and the cs5535_gpio API in other places. Note that we don't actually clean up the driver properly yet (once loaded, it always remains loaded). That'll come later.. This patch is necessary for building the driver. Signed-off-by: Andres Salomon <dilinger@queued.net> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12Merge branch 'linus' into idle-testLen Brown
2011-01-12ACPI, intel_idle: Cleanup idle= internal variablesThomas Renninger
Having four variables for the same thing: idle_halt, idle_nomwait, force_mwait and boot_option_idle_overrides is rather confusing and unnecessary complex. if idle= boot param is passed, only set up one variable: boot_option_idle_overrides Introduces following functional changes/fixes: - intel_idle driver does not register if any idle=xy boot param is passed. - processor_idle.c will also not register a cpuidle driver and get active if idle=halt is passed. Before a cpuidle driver with one (C1, halt) state got registered Now the default_idle function will be used which finally uses the same idle call to enter sleep state (safe_halt()), but without registering a whole cpuidle driver. That means idle= param will always avoid cpuidle drivers to register with one exception (same behavior as before): idle=nomwait may still register acpi_idle cpuidle driver, but C1 will not use mwait, but hlt. This can be a workaround for IO based deeper sleep states where C1 mwait causes problems. Signed-off-by: Thomas Renninger <trenn@suse.de> cc: x86@kernel.org Signed-off-by: Len Brown <len.brown@intel.com>
2011-01-12KVM: MMU: audit: allow audit more guests at the same timeXiao Guangrong
It only allows to audit one guest in the system since: - 'audit_point' is a glob variable - mmu_audit_disable() is called in kvm_mmu_destroy(), so audit is disabled after a guest exited this patch fix those issues then allow to audit more guests at the same time Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: Fetch guest cr3 from hardware on demandAvi Kivity
Instead of syncing the guest cr3 every exit, which is expensince on vmx with ept enabled, sync it only on demand. [sheng: fix incorrect cr3 seen by Windows XP] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: VMX: Add definitions for more vm entry/exit control bitsAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: SVM: copy instruction bytes from VMCBAndre Przywara
In case of a nested page fault or an intercepted #PF newer SVM implementations provide a copy of the faulting instruction bytes in the VMCB. Use these bytes to feed the instruction emulator and avoid the costly guest instruction fetch in this case. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: SVM: enhance MOV CR intercept handlerAndre Przywara
Newer SVM implementations provide the GPR number in the VMCB, so that the emulation path is no longer necesarry to handle CR register access intercepts. Implement the handling in svm.c and use it when the info is provided. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: cleanup emulate_instructionAndre Przywara
emulate_instruction had many callers, but only one used all parameters. One parameter was unused, another one is now hidden by a wrapper function (required for a future addition anyway), so most callers use now a shorter parameter list. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: move complete_insn_gp() into x86.cAndre Przywara
move the complete_insn_gp() helper function out of the VMX part into the generic x86 part to make it usable by SVM. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: x86: fix CR8 handlingAndre Przywara
The handling of CR8 writes in KVM is currently somewhat cumbersome. This patch makes it look like the other CR register handlers and fixes a possible issue in VMX, where the RIP would be incremented despite an injected #GP. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: SVM: Add xsetbv interceptJoerg Roedel
This patch implements the xsetbv intercept to the AMD part of KVM. This makes AVX usable in a save way for the guest on AVX capable AMD hardware. The patch is tested by using AVX in the guest and host in parallel and checking for data corruption. I also used the KVM xsave unit-tests and they all pass. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: VMX: add module parameter to avoid trapping HLT instructions (v5)Anthony Liguori
In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: SVM: Implement Flush-By-Asid featureJoerg Roedel
This patch adds the new flush-by-asid of upcoming AMD processors to the KVM-AMD module. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: MMU: retry #PF for softmmuXiao Guangrong
Retry #PF for softmmu only when the current vcpu has the same cr3 as the time when #PF occurs Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: MMU: rename 'no_apf' to 'prefault'Xiao Guangrong
It's the speculative path if 'no_apf = 1' and we will specially handle this speculative path in the later patch, so 'prefault' is better to fit the sense. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: SVM: Add clean-bits infrastructure codeRoedel, Joerg
This patch adds the infrastructure for the implementation of the individual clean-bits. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: Don't spin on virt instruction faults during rebootAvi Kivity
Since vmx blocks INIT signals, we disable virtualization extensions during reboot. This leads to virtualization instructions faulting; we trap these faults and spin while the reboot continues. Unfortunately spinning on a non-preemptible kernel may block a task that reboot depends on; this causes the reboot to hang. Fix by skipping over the instruction and hoping for the best. Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: SVM: Add manipulation functions for DRx interceptsJoerg Roedel
This patch wraps changes to the DRx intercepts of SVM into seperate functions to abstract nested-svm better and prepare the implementation of the vmcb-clean-bits feature. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: SVM: Add manipulation functions for CRx interceptsRoedel, Joerg
This patch wraps changes to the CRx intercepts of SVM into seperate functions to abstract nested-svm better and prepare the implementation of the vmcb-clean-bits feature. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: X86: Introduce generic guest-mode representationJoerg Roedel
This patch introduces a generic representation of guest-mode fpr a vcpu. This currently only exists in the SVM code. Having this representation generic will help making the non-svm code aware of nesting when this is necessary. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: Pull extra page fault information into struct x86_exceptionAvi Kivity
Currently page fault cr2 and nesting infomation are carried outside the fault data structure. Instead they are placed in the vcpu struct, which results in confusion as global variables are manipulated instead of passing parameters. Fix this issue by adding address and nested fields to struct x86_exception, so this struct can carry all information associated with a fault. Signed-off-by: Avi Kivity <avi@redhat.com> Tested-by: Joerg Roedel <joerg.roedel@amd.com> Tested-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: Push struct x86_exception info the various gva_to_gpa variantsAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: x86 emulator: make emulator memory callbacks return full exceptionAvi Kivity
This way, they can return #GP, not just #PF. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: x86 emulator: introduce struct x86_exception to communicate faultsAvi Kivity
Introduce a structure that can contain an exception to be passed back to main kvm code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: MMU: remove 'clear_unsync' parameterXiao Guangrong
Remove it since we can judge it by using sp->unsync Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: Add instruction-set-specific exit qualifications to kvm_exit traceAvi Kivity
The exit reason alone is insufficient to understand exactly why an exit occured; add ISA-specific trace parameters for additional information. Because fetching these parameters is expensive on vmx, and because these parameters are fetched even if tracing is disabled, we fetch the parameters via a callback instead of as traditional trace arguments. Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: x86 emulator: preserve an operand's segment identityAvi Kivity
Currently the x86 emulator converts the segment register associated with an operand into a segment base which is added into the operand address. This loss of information results in us not doing segment limit checks properly. Replace struct operand's addr.mem field by a segmented_address structure which holds both the effetive address and segment. This will allow us to do the limit check at the point of access. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: MMU: fix apf prefault if nested guest is enabledXiao Guangrong
If apf is generated in L2 guest and is completed in L1 guest, it will prefault this apf in L1 guest's mmu context. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: remove unused function declarationXiao Guangrong
Remove the declaration of kvm_mmu_set_base_ptes() Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12KVM: handle exit due to INVD in VMXGleb Natapov
Currently the exit is unhandled, so guest halts with error if it tries to execute INVD instruction. Call into emulator when INVD instruction is executed by a guest instead. This instruction is not needed by ordinary guests, but firmware (like OpenBIOS) use it and fail. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: x86: Add missing inline tag to kvm_read_and_reset_pf_reasonJan Kiszka
May otherwise generates build warnings about unused kvm_read_and_reset_pf_reason if included without CONFIG_KVM_GUEST enabled. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12KVM: Let host know whether the guest can handle async PF in non-userspace ↵Gleb Natapov
context. If guest can detect that it runs in non-preemptable context it can handle async PFs at any time, so let host know that it can send async PF even if guest cpu is not in userspace. Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>