summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2009-07-09linker script: unify usage of discard definitionTejun Heo
Discarded sections in different archs share some commonality but have considerable differences. This led to linker script for each arch implementing its own /DISCARD/ definition, which makes maintaining tedious and adding new entries error-prone. This patch makes all linker scripts to move discard definitions to the end of the linker script and use the common DISCARDS macro. As ld uses the first matching section definition, archs can include default discarded sections by including them earlier in the linker script. ia64 is notable because it first throws away some ia64 specific subsections and then include the rest of the sections into the final image, so those sections must be discarded before the inclusion. defconfig compile tested for x86, x86-64, powerpc, powerpc64, ia64, alpha, sparc, sparc64 and s390. Michal Simek tested microblaze. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Mike Frysinger <vapier@gentoo.org> Tested-by: Michal Simek <monstr@monstr.eu> Cc: linux-arch@vger.kernel.org Cc: Michal Simek <monstr@monstr.eu> Cc: microblaze-uclinux@itee.uq.edu.au Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Tony Luck <tony.luck@intel.com>
2009-07-03percpu: teach large page allocator about NUMATejun Heo
Large page first chunk allocator is primarily used for NUMA machines; however, its NUMA handling is extremely simplistic. Regardless of their proximity, each cpu is put into separate large page just to return most of the allocated space back wasting large amount of vmalloc space and increasing cache footprint. This patch teachs NUMA details to large page allocator. Given processor proximity information, pcpu_lpage_build_unit_map() will find fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the same large page and not too much virtual address space is wasted. This greatly reduces the unit and thus chunk size and wastes much less address space for the first chunk. For example, on 4/4 NUMA machine, the original code occupied 16MB of virtual space for the first chunk while the new code only uses 4MB - one 2MB page for each node. [ Impact: much better space efficiency on NUMA machines ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net>
2009-07-03x86,percpu: generalize lpage first chunk allocatorTejun Heo
Generalize and move x86 setup_pcpu_lpage() into pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper around the generalized version. Other than taking size parameters and using arch supplied callbacks to allocate/free/map memory, pcpu_lpage_first_chunk() is identical to the original implementation. This simplifies arch code and will help converting more archs to dynamic percpu allocator. While at it, factor out pcpu_calc_fc_sizes() which is common to pcpu_embed_first_chunk() and pcpu_lpage_first_chunk(). [ Impact: code reorganization and generalization ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-07-03x86,percpu: generalize 4k first chunk allocatorTejun Heo
Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk(). setup_pcpu_4k() now is a simple wrapper around the generalized version. Other than taking size parameters and using arch supplied callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical to the original implementation. This simplifies arch code and will help converting more archs to dynamic percpu allocator. While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for consistency. [ Impact: code reorganization and generalization ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-07-03percpu: drop @unit_size from embed first chunk allocatorTejun Heo
The only extra feature @unit_size provides is making dead space at the end of the first chunk which doesn't have any valid usecase. Drop the parameter. This will increase consistency with generalized 4k allocator. James Bottomley spotted missing conversion for the default setup_per_cpu_areas() which caused build breakage on all arcsh which use it. [ Impact: drop unused code path ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Ingo Molnar <mingo@elte.hu>
2009-07-03Merge branch 'master' into for-nextTejun Heo
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix changes. As alpha in percpu tree uses 'weak' attribute instead of inline assembly, there's no need for __used attribute. Conflicts: arch/alpha/include/asm/percpu.h arch/mn10300/kernel/vmlinux.lds.S include/linux/percpu-defs.h
2009-07-02Merge git://git.infradead.org/iommu-2.6Linus Torvalds
* git://git.infradead.org/iommu-2.6: (38 commits) intel-iommu: Don't keep freeing page zero in dma_pte_free_pagetable() intel-iommu: Introduce first_pte_in_page() to simplify PTE-setting loops intel-iommu: Use cmpxchg64_local() for setting PTEs intel-iommu: Warn about unmatched unmap requests intel-iommu: Kill superfluous mapping_lock intel-iommu: Ensure that PTE writes are 64-bit atomic, even on i386 intel-iommu: Make iommu=pt work on i386 too intel-iommu: Performance improvement for dma_pte_free_pagetable() intel-iommu: Don't free too much in dma_pte_free_pagetable() intel-iommu: dump mappings but don't die on pte already set intel-iommu: Combine domain_pfn_mapping() and domain_sg_mapping() intel-iommu: Introduce domain_sg_mapping() to speed up intel_map_sg() intel-iommu: Simplify __intel_alloc_iova() intel-iommu: Performance improvement for domain_pfn_mapping() intel-iommu: Performance improvement for dma_pte_clear_range() intel-iommu: Clean up iommu_domain_identity_map() intel-iommu: Remove last use of PHYSICAL_PAGE_MASK, for reserving PCI BARs intel-iommu: Make iommu_flush_iotlb_psi() take pfn as argument intel-iommu: Change aligned_size() to aligned_nrpages() intel-iommu: Clean up intel_map_sg(), remove domain_page_mapping() ...
2009-07-02x86: add boundary check for 32bit res before expand e820 resource to alignmentYinghai Lu
fix hang with HIGHMEM_64G and 32bit resource. According to hpa and Linus, use (resource_size_t)-1 to fend off big ranges. Analyzed by hpa Reported-and-tested-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-01intel-iommu: Make iommu=pt work on i386 tooDavid Woodhouse
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-07-01Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (47 commits) perf report: Add --symbols parameter perf report: Add --comms parameter perf report: Add --dsos parameter perf_counter tools: Adjust only prelinked symbol's addresses perf_counter: Provide a way to enable counters on exec perf_counter tools: Reduce perf stat measurement overhead/skew perf stat: Use percentages for scaling output perf_counter, x86: Update x86_pmu after WARN() perf stat: Micro-optimize the code: memcpy is only required if no event is selected and !null_run perf stat: Improve output perf stat: Fix multi-run stats perf stat: Add -n/--null option to run without counters perf_counter tools: Remove dead code perf_counter: Complete counter swap perf report: Print sorted callchains per histogram entries perf_counter tools: Prepare a small callchain framework perf record: Fix unhandled io return value perf_counter tools: Add alias for 'l1d' and 'l1i' perf-report: Add bare minimum PERF_EVENT_READ parsing perf-report: Add modes for inherited stats and no-samples ...
2009-06-29Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Revert "x86: cap iomem_resource to addressable physical memory"
2009-06-29perf_counter, x86: Update x86_pmu after WARN()Yinghai Lu
The print out should read the value before changing the value. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4A487017.4090007@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-28Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, delay: tsc based udelay should have rdtsc_barrier x86, setup: correct include file in <asm/boot.h> x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h> x86, mce: percpu mcheck_timer should be pinned x86: Add sysctl to allow panic on IOCK NMI error x86: Fix uv bau sending buffer initialization x86, mce: Fix mce resume on 32bit x86: Move init_gbpages() to setup_arch() x86: ensure percpu lpage doesn't consume too much vmalloc space x86: implement percpu_alloc kernel parameter x86: fix pageattr handling for lpage percpu allocator and re-enable it x86: reorganize cpa_process_alias() x86: prepare setup_pcpu_lpage() for pageattr fix x86: rename remap percpu first chunk allocator to lpage x86: fix duplicate free in setup_pcpu_remap() failure path percpu: fix too lazy vunmap cache flushing x86: Set cpu_llc_id on AMD CPUs
2009-06-28Revert "x86: cap iomem_resource to addressable physical memory"H. Peter Anvin
This reverts commit 95ee14e4379c5e19c0897c872350570402014742. Mikael Petterson <mikepe@it.uu.se> reported that at least one of his systems will not boot as a result. We have ruled out the detection algorithm malfunctioning, so it is not a matter of producing the incorrect bitmasks; rather, something in the application of them fails. Revert the commit until we can root cause and correct this problem. -stable team: this means the underlying commit should be rejected. Reported-and-isolated-by: Mikael Petterson <mikpe@it.uu.se> Signed-off-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <200906261559.n5QFxJH8027336@pilspetsen.it.uu.se> Cc: stable@kernel.org Cc: Grant Grundler <grundler@parisc-linux.org>
2009-06-25x86, mce: percpu mcheck_timer should be pinnedHidetoshi Seto
If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might be migrated on other cpu. Use add_timer_on() instead. Avoids the following failure: Maciej Rutecki wrote: > > After normal boot I try: > > > > echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval > > > > I found this in dmesg: > > > > [ 141.704025] ------------[ cut here ]------------ > > [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102 > > mcheck_timer+0xf5/0x100() Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-25x86: Add sysctl to allow panic on IOCK NMI errorKurt Garloff
This patch introduces a new sysctl: /proc/sys/kernel/panic_on_io_nmi which defaults to 0 (off). When enabled, the kernel panics when the kernel receives an NMI caused by an IO error. The IO error triggered NMI indicates a serious system condition, which could result in IO data corruption. Rather than contiuing, panicing and dumping might be a better choice, so one can figure out what's causing the IO error. This could be especially important to companies running IO intensive applications where corruption must be avoided, e.g. a bank's databases. [ SuSE has been shipping it for a while, it was done at the request of a large database vendor, for their users. ] Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Roberto Angelino <robertangelino@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> LKML-Reference: <20090624213211.GA11291@kroah.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-25perf_counter, x86: Add mmap counter read supportPeter Zijlstra
Update the mmap control page with the needed information to use the userspace RDPMC instruction for self monitoring. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-24Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (72 commits) asus-laptop: remove EXPERIMENTAL dependency asus-laptop: use pr_fmt and pr_<level> eeepc-laptop: cpufv updates eeepc-laptop: sync eeepc-laptop with asus_acpi asus_acpi: Deprecate in favor of asus-laptop acpi4asus: update MAINTAINER and KConfig links asus-laptop: platform dev as parent for led and backlight eeepc-laptop: enable camera by default ACPI: Rename ACPI processor device bus ID acerhdf: Acer Aspire One fan control ACPI: video: DMI workaround broken Acer 7720 BIOS enabling display brightness ACPI: run ACPI device hot removal in kacpi_hotplug_wq ACPI: Add the reference count to avoid unloading ACPI video bus twice ACPI: DMI to disable Vista compatibility on some Sony laptops ACPI: fix a deadlock in hotplug case Show the physical device node of backlight class device. ACPI: pdc init related memory leak with physical CPU hotplug ACPI: pci_root: remove unused dev/fn information ACPI: pci_root: simplify list traversals ACPI: pci_root: use driver data rather than list lookup ...
2009-06-24x86: Fix uv bau sending buffer initializationCliff Wickman
The initialization of the UV Broadcast Assist Unit's sending buffers was making an invalid assumption about the initialization of an MMR that defines its address. The BIOS will not be providing that MMR. So uv_activation_descriptor_init() should unconditionally set it. Tested on UV simulator. Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: <stable@kernel.org> # for v2.6.30.x LKML-Reference: <E1MJTfj-0005i1-W8@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-24perf_counter, x86: Set global control MSR correctlyYong Wang
Previous code made an assumption that the power on value of global control MSR has enabled all fixed and general purpose counters properly. However, this is not the case for certain Intel processors, such as Atom - and it might also be firmware dependent. Each enable bit in IA32_PERF_GLOBAL_CTRL is AND'ed with the enable bits for all privilege levels in the respective IA32_PERFEVTSELx or IA32_PERF_FIXED_CTR_CTRL MSRs to start/stop the counting of respective counters. Counting is enabled if the AND'ed results is true; counting is disabled when the result is false. The end result is that all fixed counters are always disabled on Atom processors because the assumption is just invalid. Fix this by not initializing the ctrl-mask out of the global MSR, but setting it to perf_counter_mask. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Yong Wang <yong.y.wang@intel.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20090624021324.GA2788@ywang-moblin2.bj.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-24percpu: clean up percpu variable definitionsTejun Heo
Percpu variable definition is about to be updated such that all percpu symbols including the static ones must be unique. Update percpu variable definitions accordingly. * as,cfq: rename ioc_count uniquely * cpufreq: rename cpu_dbs_info uniquely * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and rename it * ipv4,6: rename cookie_scratch uniquely * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to pmc_irq_entry and nmi_entry to pmc_nmi_entry * perf_counter: rename disable_count to perf_disable_count * ftrace: rename test_event_disable to ftrace_test_event_disable * kmemleak: rename test_pointer to kmemleak_test_pointer * mce: rename next_interval to mce_next_interval [ Impact: percpu usage cleanups, no duplicate static percpu var names ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm <linux-mm@kvack.org> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andi Kleen <andi@firstfloor.org>
2009-06-24percpu: cleanup percpu array definitionsTejun Heo
Currently, the following three different ways to define percpu arrays are in use. 1. DEFINE_PER_CPU(elem_type[array_len], array_name); 2. DEFINE_PER_CPU(elem_type, array_name[array_len]); 3. DEFINE_PER_CPU(elem_type, array_name)[array_len]; Unify to #1 which correctly separates the roles of the two parameters and thus allows more flexibility in the way percpu variables are defined. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net>
2009-06-24Merge branches 'acerhdf', 'acpi-pci-bind', 'bjorn-pci-root', ↵Len Brown
'bugzilla-12904', 'bugzilla-13121', 'bugzilla-13396', 'bugzilla-13533', 'bugzilla-13612', 'c3_lock', 'hid-cleanups', 'misc-2.6.31', 'pdc-leak-fix', 'pnpacpi', 'power_nocheck', 'thinkpad_acpi', 'video' and 'wmi' into release
2009-06-23Intel-IOMMU, intr-remap: source-id checkingWeidong Han
To support domain-isolation usages, the platform hardware must be capable of uniquely identifying the requestor (source-id) for each interrupt message. Without source-id checking for interrupt remapping , a rouge guest/VM with assigned devices can launch interrupt attacks to bring down anothe guest/VM or the VMM itself. This patch adds source-id checking for interrupt remapping, and then really isolates interrupts for guests/VMs with assigned devices. Because PCI subsystem is not initialized yet when set up IOAPIC entries, use read_pci_config_byte to access PCI config space directly. Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-06-23x86: Move init_gbpages() to setup_arch()Pekka J Enberg
The init_gbpages() function is conditionally called from init_memory_mapping() function. There are two call-sites where this 'after_bootmem' condition can be true: setup_arch() and mem_init() via pci_iommu_alloc(). Therefore, it's safe to move the call to init_gbpages() to setup_arch() as it's always called before mem_init(). This removes an after_bootmem use - paving the way to remove all uses of that state variable. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <Pine.LNX.4.64.0906221731210.19474@melkki.cs.Helsinki.FI> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-23Merge git://git.infradead.org/~dwmw2/iommu-2.6.31Linus Torvalds
* git://git.infradead.org/~dwmw2/iommu-2.6.31: intel-iommu: Fix one last ia64 build problem in Pass Through Support VT-d: support the device IOTLB VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: add device IOTLB invalidation support VT-d: parse ATSR in DMA Remapping Reporting Structure PCI: handle Virtual Function ATS enabling PCI: support the ATS capability intel-iommu: dmar_set_interrupt return error value intel-iommu: Tidy up iommu->gcmd handling intel-iommu: Fix tiny theoretical race in write-buffer flush. intel-iommu: Clean up handling of "caching mode" vs. IOTLB flushing. intel-iommu: Clean up handling of "caching mode" vs. context flushing. VT-d: fix invalid domain id for KVM context flush Fix !CONFIG_DMAR build failure introduced by Intel IOMMU Pass Through Support Intel IOMMU Pass Through Support Fix up trivial conflicts in drivers/pci/{intel-iommu.c,intr_remapping.c}
2009-06-22Merge branch 'for-tip' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into x86/urgent
2009-06-22x86: ensure percpu lpage doesn't consume too much vmalloc spaceTejun Heo
On extreme configuration (e.g. 32bit 32-way NUMA machine), lpage percpu first chunk allocator can consume too much of vmalloc space. Make it fall back to 4k allocator if the consumption goes over 20%. [ Impact: add sanity check for lpage percpu first chunk allocator ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: implement percpu_alloc kernel parameterTejun Heo
According to Andi, it isn't clear whether lpage allocator is worth the trouble as there are many processors where PMD TLB is far scarcer than PTE TLB. The advantage or disadvantage probably depends on the actual size of percpu area and specific processor. As performance degradation due to TLB pressure tends to be highly workload specific and subtle, it is difficult to decide which way to go without more data. This patch implements percpu_alloc kernel parameter to allow selecting which first chunk allocator to use to ease debugging and testing. While at it, make sure all the failure paths report why something failed to help determining why certain allocator isn't working. Also, kill the "Great future plan" comment which had already been realized quite some time ago. [ Impact: allow explicit percpu first chunk allocator selection ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: fix pageattr handling for lpage percpu allocator and re-enable itTejun Heo
lpage allocator aliases a PMD page for each cpu and returns whatever is unused to the page allocator. When the pageattr of the recycled pages are changed, this makes the two aliases point to the overlapping regions with different attributes which isn't allowed and known to cause subtle data corruption in certain cases. This can be handled in simliar manner to the x86_64 highmap alias. pageattr code should detect if the target pages have PMD alias and split the PMD alias and synchronize the attributes. pcpur allocator is updated to keep the allocated PMD pages map sorted in ascending address order and provide pcpu_lpage_remapped() function which binary searches the array to determine whether the given address is aliased and if so to which address. pageattr is updated to use pcpu_lpage_remapped() to detect the PMD alias and split it up as necessary from cpa_process_alias(). Jan Beulich spotted the original problem and incorrect usage of vaddr instead of laddr for lookup. With this, lpage percpu allocator should work correctly. Re-enable it. [ Impact: fix subtle lpage pageattr bug and re-enable lpage ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: prepare setup_pcpu_lpage() for pageattr fixTejun Heo
Make the following changes in preparation of coming pageattr updates. * Define and use array of struct pcpul_ent instead of array of pointers. The only difference is ->cpu field which is set but unused yet. * Rename variables according to the above change. * Rename local variable vm to pcpul_vm and move it out of the function. [ Impact: no functional difference ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: rename remap percpu first chunk allocator to lpageTejun Heo
The "remap" allocator remaps large pages to build the first chunk; however, the name isn't very good because 4k allocator remaps too and the whole point of the remap allocator is using large page mapping. The allocator will be generalized and exported outside of x86, rename it to lpage before that happens. percpu_alloc kernel parameter is updated to accept both "remap" and "lpage" for lpage allocator. [ Impact: code cleanup, kernel parameter argument updated ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: fix duplicate free in setup_pcpu_remap() failure pathTejun Heo
In the failure path, setup_pcpu_remap() tries to free the area which has already been freed to make holes in the large page. Fix it. [ Impact: fix duplicate free in failure path ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-21perf_counter, x8: Fix L1-data-Cache-Store-Referencees for AMDJaswinder Singh Rajput
Fix AMD's Data Cache Refills from System event. After this patch : ./tools/perf/perf stat -e l1d -e l1d-misses -e l1d-write -e l1d-prefetch -e l1d-prefetch-miss -e l1i -e l1i-misses -e l1i-prefetch -e l2 -e l2-misses -e l2-write -e dtlb -e dtlb-misses -e itlb -e itlb-misses -e bpu -e bpu-misses ls /dev/ > /dev/null Performance counter stats for 'ls /dev/': 2499484 L1-data-Cache-Load-Referencees (scaled from 3.97%) 70347 L1-data-Cache-Load-Misses (scaled from 7.30%) 9360 L1-data-Cache-Store-Referencees (scaled from 8.64%) 32804 L1-data-Cache-Prefetch-Referencees (scaled from 17.72%) 7693 L1-data-Cache-Prefetch-Misses (scaled from 22.97%) 2180945 L1-instruction-Cache-Load-Referencees (scaled from 28.48%) 14518 L1-instruction-Cache-Load-Misses (scaled from 35.00%) 2405 L1-instruction-Cache-Prefetch-Referencees (scaled from 34.89%) 71387 L2-Cache-Load-Referencees (scaled from 34.94%) 18732 L2-Cache-Load-Misses (scaled from 34.92%) 79918 L2-Cache-Store-Referencees (scaled from 36.02%) 1295294 Data-TLB-Cache-Load-Referencees (scaled from 35.99%) 30896 Data-TLB-Cache-Load-Misses (scaled from 33.36%) 1222030 Instruction-TLB-Cache-Load-Referencees (scaled from 29.46%) 357 Instruction-TLB-Cache-Load-Misses (scaled from 20.46%) 530888 Branch-Cache-Load-Referencees (scaled from 11.48%) 8638 Branch-Cache-Load-Misses (scaled from 5.09%) 0.011295149 seconds time elapsed. Earlier it always shows value 0. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> LKML-Reference: <1245484165.3102.6.camel@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-21x86: Set cpu_llc_id on AMD CPUsAndreas Herrmann
This counts when building sched domains in case NUMA information is not available. ( See cpu_coregroup_mask() which uses llc_shared_map which in turn is created based on cpu_llc_id. ) Currently Linux builds domains as follows: (example from a dual socket quad-core system) CPU0 attaching sched-domain: domain 0: span 0-7 level CPU groups: 0 1 2 3 4 5 6 7 ... CPU7 attaching sched-domain: domain 0: span 0-7 level CPU groups: 7 0 1 2 3 4 5 6 Ever since that is borked for multi-core AMD CPU systems. This patch fixes that and now we get a proper: CPU0 attaching sched-domain: domain 0: span 0-3 level MC groups: 0 1 2 3 domain 1: span 0-7 level CPU groups: 0-3 4-7 ... CPU7 attaching sched-domain: domain 0: span 4-7 level MC groups: 7 4 5 6 domain 1: span 0-7 level CPU groups: 4-7 0-3 This allows scheduler to assign tasks to cores on different sockets (i.e. that don't share last level cache) for performance reasons. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> LKML-Reference: <20090619085909.GJ5218@alberich.amd.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-20Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits) perfcounter: Handle some IO return values perf_counter: Push perf_sample_data through the swcounter code perf_counter tools: Define and use our own u64, s64 etc. definitions perf_counter: Close race in perf_lock_task_context() perf_counter, x86: Improve interactions with fast-gup perf_counter: Simplify and fix task migration counting perf_counter tools: Add a data file header perf_counter: Update userspace callchain sampling uses perf_counter: Make callchain samples extensible perf report: Filter to parent set by default perf_counter tools: Handle lost events perf_counter: Add event overlow handling fs: Provide empty .set_page_dirty() aop for anon inodes perf_counter: tools: Makefile tweaks for 64-bit powerpc perf_counter: powerpc: Add processor back-end for MPC7450 family perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels perf_counter: powerpc: Change how processor-specific back-ends get selected perf_counter: powerpc: Use unsigned long for register and constraint values perf_counter: powerpc: Enable use of software counters on 32-bit powerpc perf_counter tools: Add and use isprint() ...
2009-06-20Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix out of scope variable access in sched_slice() sched: Hide runqueues from direct refer at source code level sched: Remove unneeded __ref tag sched, x86: Fix cpufreq + sched_clock() TSC scaling
2009-06-20Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) tracing/urgent: warn in case of ftrace_start_up inbalance tracing/urgent: fix unbalanced ftrace_start_up function-graph: add stack frame test function-graph: disable when both x86_32 and optimize for size are configured ring-buffer: have benchmark test print to trace buffer ring-buffer: do not grab locks in nmi ring-buffer: add locks around rb_per_cpu_empty ring-buffer: check for less than two in size allocation ring-buffer: remove useless compile check for buffer_page size ring-buffer: remove useless warn on check ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index tracing: update sample event documentation tracing/filters: fix race between filter setting and module unload tracing/filters: free filter_string in destroy_preds() ring-buffer: use commit counters for commit pointer accounting ring-buffer: remove unused variable ring-buffer: have benchmark test handle discarded events ring-buffer: prevent adding write in discarded area tracing/filters: strloc should be unsigned short tracing/filters: operand can be negative ... Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually
2009-06-20Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (45 commits) x86, mce: fix error path in mce_create_device() x86: use zalloc_cpumask_var for mce_dev_initialized x86: fix duplicated sysfs attribute x86: de-assembler-ize asm/desc.h i386: fix/simplify espfix stack switching, move it into assembly i386: fix return to 16-bit stack from NMI handler x86, ioapic: Don't call disconnect_bsp_APIC if no APIC present x86: Remove duplicated #include's x86: msr.h linux/types.h is only required for __KERNEL__ x86: nmi: Add Intel processor 0x6f4 to NMI perfctr1 workaround x86, mce: mce_intel.c needs <asm/apic.h> x86: apic/io_apic.c: dmar_msi_type should be static x86, io_apic.c: Work around compiler warning x86: mce: Don't touch THERMAL_APIC_VECTOR if no active APIC present x86: mce: Handle banks == 0 case in K7 quirk x86, boot: use .code16gcc instead of .code16 x86: correct the conversion of EFI memory types x86: cap iomem_resource to addressable physical memory x86, mce: rename _64.c files which are no longer 64-bit-specific x86, mce: mce.h cleanup ... Manually fix up trivial conflict in arch/x86/mm/fault.c
2009-06-20Merge branch 'x86/mce3' into x86/urgentIngo Molnar
2009-06-20ACPI: pdc init related memory leak with physical CPU hotplugPallipadi, Venkatesh
arch_acpi_processor_cleanup_pdc() in x86 and ia64 results in memory allocated for _PDC objects that is never freed and will cause memory leak in case of physical CPU remove and add. Patch fixes the memory leak by freeing the objects soon after _PDC is evaluated. Reported-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2009-06-19perf_counter: Make callchain samples extensiblePeter Zijlstra
Before exposing upstream tools to a callchain-samples ABI, tidy it up to make it more extensible in the future: Use markers in the IP chain to denote context, use (u64)-1..-4095 range for these context markers because we use them for ERR_PTR(), so these addresses are unlikely to be mapped. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-18function-graph: add stack frame testSteven Rostedt
In case gcc does something funny with the stack frames, or the return from function code, we would like to detect that. An arch may implement passing of a variable that is unique to the function and can be saved on entering a function and can be tested when exiting the function. Usually the frame pointer can be used for this purpose. This patch also implements this for x86. Where it passes in the stack frame of the parent function, and will test that frame on exit. There was a case in x86_32 with optimize for size (-Os) where, for a few functions, gcc would align the stack frame and place a copy of the return address into it. The function graph tracer modified the copy and not the actual return address. On return from the funtion, it did not go to the tracer hook, but returned to the parent. This broke the function graph tracer, because the return of the parent (where gcc did not do this funky manipulation) returned to the location that the child function was suppose to. This caused strange kernel crashes. This test detected the problem and pointed out where the issue was. This modifies the parameters of one of the functions that the arch specific code calls, so it includes changes to arch code to accommodate the new prototype. Note, I notice that the parsic arch implements its own push_return_trace. This is now a generic function and the ftrace_push_return_trace should be used instead. This patch does not touch that code. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-18gcov: enable GCOV_PROFILE_ALL for x86_64Peter Oberparleiter
Enable gcov profiling of the entire kernel on x86_64. Required changes include disabling profiling for: * arch/kernel/acpi/realmode and arch/kernel/boot/compressed: not linked to main kernel * arch/vdso, arch/kernel/vsyscall_64 and arch/kernel/hpet: profiling causes segfaults during boot (incompatible context) Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Li Wei <W.Li@Sun.COM> Cc: Michael Ellerman <michaele@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com> Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18x86, mce: fix error path in mce_create_device()Hidetoshi Seto
Don't skip removing mce_attrs in route from error2. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18x86: use zalloc_cpumask_var for mce_dev_initializedYinghai Lu
We need a cleared cpu_mask to record if mce is initialized, especially when MAXSMP is used. used zalloc_... instead Signed-off-by: Yinghai Lu <yinghai@kernel.org> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18x86: fix duplicated sysfs attributeYinghai Lu
The sysfs attribute cmci_disabled was accidentall turned into a duplicate of ignore_ce, breaking all other attributes. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18x86: de-assembler-ize asm/desc.hAlexander van Heukelum
asm/desc.h is included in three assembly files, but the only macro it defines, GET_DESC_BASE, is never used. This patch removes the includes, removes the macro GET_DESC_BASE and the ASSEMBLY guard from asm/desc.h. Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18i386: fix/simplify espfix stack switching, move it into assemblyAlexander van Heukelum
The espfix code triggers if we have a protected mode userspace application with a 16-bit stack. On returning to userspace, with iret, the CPU doesn't restore the high word of the stack pointer. This is an "official" bug, and the work-around used in the kernel is to temporarily switch to a 32-bit stack segment/pointer pair where the high word of the pointer is equal to the high word of the userspace stackpointer. The current implementation uses THREAD_SIZE to determine the cut-off, but there is no good reason not to use the more natural 64kb... However, implementing this by simply substituting THREAD_SIZE with 65536 in patch_espfix_desc crashed the test application. patch_espfix_desc tries to do what is described above, but gets it subtly wrong if the userspace stack pointer is just below a multiple of THREAD_SIZE: an overflow occurs to bit 13... With a bit of luck, when the kernelspace stackpointer is just below a 64kb-boundary, the overflow then ripples trough to bit 16 and userspace will see its stack pointer changed by 65536. This patch moves all espfix code into entry_32.S. Selecting a 16-bit cut-off simplifies the code. The game with changing the limit dynamically is removed too. It complicates matters and I see no value in it. Changing only the top 16-bit word of ESP is one instruction and it also implies that only two bytes of the ESPFIX GDT entry need to be changed and this can be implemented in just a handful simple to understand instructions. As a side effect, the operation to compute the original ESP from the ESPFIX ESP and the GDT entry simplifies a bit too, and the remaining three instructions have been expanded inline in entry_32.S. impact: can now reliably run userspace with ESP=xxxxfffc on 16-bit stack segment Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Acked-by: Stas Sergeev <stsp@aknet.ru> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18i386: fix return to 16-bit stack from NMI handlerAlexander van Heukelum
Returning to a task with a 16-bit stack requires special care: the iret instruction does not restore the high word of esp in that case. The espfix code fixes this, but currently is not invoked on NMIs. This means that a running task gets the upper word of esp clobbered due intervening NMIs. To reproduce, compile and run the following program with the nmi watchdog enabled (nmi_watchdog=2 on the command line). Using gdb you can see that the high bits of esp contain garbage, while the low bits are still correct. This patch puts the espfix code back into the NMI code path. The patch is slightly complicated due to the irqtrace infrastructure not being NMI-safe. The NMI return path cannot call TRACE_IRQS_IRET. Otherwise, the tail of the normal iret-code is correct for the nmi code path too. To be able to share this code-path, the TRACE_IRQS_IRET was move up a bit. The espfix code exists after the TRACE_IRQS_IRET, but this code explicitly disables interrupts. This short interrupts-off section is now not traced anymore. The return-to-kernel path now always includes the preliminary test to decide if the espfix code should be called. This is never the case, but doing it this way keeps the patch as simple as possible and the few extra instructions should not affect timing in any significant way. #define _GNU_SOURCE #include <stdio.h> #include <sys/types.h> #include <sys/mman.h> #include <unistd.h> #include <sys/syscall.h> #include <asm/ldt.h> int modify_ldt(int func, void *ptr, unsigned long bytecount) { return syscall(SYS_modify_ldt, func, ptr, bytecount); } /* this is assumed to be usable */ #define SEGBASEADDR 0x10000 #define SEGLIMIT 0x20000 /* 16-bit segment */ struct user_desc desc = { .entry_number = 0, .base_addr = SEGBASEADDR, .limit = SEGLIMIT, .seg_32bit = 0, .contents = 0, /* ??? */ .read_exec_only = 0, .limit_in_pages = 0, .seg_not_present = 0, .useable = 1 }; int main(void) { setvbuf(stdout, NULL, _IONBF, 0); /* map a 64 kb segment */ char *pointer = mmap((void *)SEGBASEADDR, SEGLIMIT+1, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); if (pointer == NULL) { printf("could not map space\n"); return 0; } /* write ldt, new mode */ int err = modify_ldt(0x11, &desc, sizeof(desc)); if (err) { printf("error modifying ldt: %i\n", err); return 0; } for (int i=0; i<1000; i++) { asm volatile ( "pusha\n\t" "mov %ss, %eax\n\t" /* preserve ss:esp */ "mov %esp, %ebp\n\t" "push $7\n\t" /* index 0, ldt, user mode */ "push $65536-4096\n\t" /* esp */ "lss (%esp), %esp\n\t" /* switch to new stack */ "push %eax\n\t" /* save old ss:esp on new stack */ "push %ebp\n\t" "add $17*65536, %esp\n\t" /* set high bits */ "mov %esp, %edx\n\t" "mov $10000000, %ecx\n\t" /* wait... */ "1: loop 1b\n\t" /* ... a bit */ "cmp %esp, %edx\n\t" "je 1f\n\t" "ud2\n\t" /* esp changed inexplicably! */ "1:\n\t" "sub $17*65536, %esp\n\t" /* restore high bits */ "lss (%esp), %esp\n\t" /* restore old ss:esp */ "popa\n\t"); printf("\rx%ix", i); } return 0; } Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Acked-by: Stas Sergeev <stsp@aknet.ru> Signed-off-by: H. Peter Anvin <hpa@zytor.com>