summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2013-07-01powerpc/pseries: Inform the hypervisor we are using EBB regsMichael Ellerman
On LPAR systems we need to inform the hypervisor that we are using the EBB registers. We do this by setting a bit in the Virtual Processor Area (VPA) - formerly known as the lppaca. For now we do this always, ie. we do not dynamically enable/disable. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Add power8 EBB supportMichael Ellerman
Add logic to the power8 PMU code to support EBB. Future processors would also be expected to implement similar constraints. At that time we could possibly factor these out into common code. Finally mark the power8 PMU as supporting EBB, which is the actual enable switch which allows EBBs to be configured. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Core EBB support for 64-bit book3sMichael Ellerman
Add support for EBB (Event Based Branches) on 64-bit book3s. See the included documentation for more details. EBBs are a feature which allows the hardware to branch directly to a specified user space address when a PMU event overflows. This can be used by programs for self-monitoring with no kernel involvement in the inner loop. Most of the logic is in the generic book3s code, primarily to avoid a proliferation of PMU callbacks. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Drop MMCRA from thread_structMichael Ellerman
In commit 59affcd "Context switch more PMU related SPRs" I added more PMU SPRs to thread_struct, later modified in commit b11ae95. To add insult to injury it turns out we don't need to switch MMCRA as it's only user readable, and the value is recomputed by the PMU code. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Don't enable if we have zero eventsMichael Ellerman
In power_pmu_enable() we still enable the PMU even if we have zero events. This should have no effect but doesn't make much sense. Instead just return after telling the hypervisor that we are not using the PMCs. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Use existing out label in power_pmu_enable()Michael Ellerman
In power_pmu_enable() we can use the existing out label to reduce the number of return paths. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Freeze PMC5/6 if we're not using themMichael Ellerman
On Power8 we can freeze PMC5 and 6 if we're not using them. Normally they run all the time. As noticed by Anshuman, we should unfreeze them when we disable the PMU as there are legacy tools which expect them to run all the time. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Rework disable logic in pmu_disable()Michael Ellerman
In pmu_disable() we disable the PMU by setting the FC (Freeze Counters) bit in MMCR0. In order to do this we have to read/modify/write MMCR0. It's possible that we read a value from MMCR0 which has PMAO (PMU Alert Occurred) set. When we write that value back it will cause an interrupt to occur. We will then end up in the PMU interrupt handler even though we are supposed to have just disabled the PMU. We can avoid this by making sure we never write PMAO back. We should not lose interrupts because when the PMU is re-enabled the overflowed values will cause another interrupt. We also reorder the clearing of SAMPLE_ENABLE so that is done after the PMU is frozen. Otherwise there is a small window between the clearing of SAMPLE_ENABLE and the setting of FC where we could take an interrupt and incorrectly see SAMPLE_ENABLE not set. This would for example change the logic in perf_read_regs(). Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/perf: Check that events only include valid bits on Power8Michael Ellerman
A mistake we have made in the past is that we pull out the fields we need from the event code, but don't check that there are no unknown bits set. This means that we can't ever assign meaning to those unknown bits in future. Although we have once again failed to do this at release, it is still early days for Power8 so I think we can still slip this in and get away with it. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Wire up the HV facility unavailable exceptionMichael Ellerman
Similar to the facility unavailble exception, except the facilities are controlled by HFSCR. Adapt the facility_unavailable_exception() so it can be called for either the regular or Hypervisor facility unavailable exceptions. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Rename and flesh out the facility unavailable exception handlerMichael Ellerman
The exception at 0xf60 is not the TM (Transactional Memory) unavailable exception, it is the "Facility Unavailable Exception", rename it as such. Flesh out the handler to acknowledge the fact that it can be called for many reasons, one of which is TM being unavailable. Use STD_EXCEPTION_COMMON() for the exception body, for some reason we had it open-coded, I've checked the generated code is identical. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.10] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Remove KVMTEST from RELON exception handlersMichael Ellerman
KVMTEST is a macro which checks whether we are taking an exception from guest context, if so we branch out of line and eventually call into the KVM code to handle the switch. When running real guests on bare metal (HV KVM) the hardware ensures that we never take a relocation on exception when transitioning from guest to host. For PR KVM we disable relocation on exceptions ourself in kvmppc_core_init_vm(), as of commit a413f47 "Disable relocation on exceptions whenever PR KVM is active". So convert all the RELON macros to use NOTEST, and drop the remaining KVM_HANDLER() definitions we have for 0xe40 and 0xe80. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.9+] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Remove unreachable relocation on exception handlersMichael Ellerman
We have relocation on exception handlers defined for h_data_storage and h_instr_storage. However we will never take relocation on exceptions for these because they can only come from a guest, and we never take relocation on exceptions when we transition from guest to host. We also have a handler for hmi_exception (Hypervisor Maintenance) which is defined in the architecture to never be delivered with relocation on, see see v2.07 Book III-S section 6.5. So remove the handlers, leaving a branch to self just to be double extra paranoid. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> CC: <stable@vger.kernel.org> [v3.9+] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/numa: Do not update sysfs cpu registration from invalid contextNathan Fontenot
The topology update code that updates the cpu node registration in sysfs should not be called while in stop_machine(). The register/unregister calls take a lock and may sleep. This patch moves these calls outside of the call to stop_machine(). Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/smp: Section mismatch from smp_release_cpus to __initdata ↵Chen Gang
spinning_secondaries the smp_release_cpus is a normal funciton and called in normal environments, but it calls the __initdata spinning_secondaries. need modify spinning_secondaries to match smp_release_cpus. the related warning: (the linker report boot_paca.33377, but it should be spinning_secondaries) ----------------------------------------------------------------------------- WARNING: arch/powerpc/kernel/built-in.o(.text+0x23176): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377 The function .smp_release_cpus() references the variable __initdata boot_paca.33377. This is often because .smp_release_cpus lacks a __initdata annotation or the annotation of boot_paca.33377 is wrong. WARNING: arch/powerpc/kernel/built-in.o(.text+0x231fe): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377 The function .smp_release_cpus() references the variable __initdata boot_paca.33377. This is often because .smp_release_cpus lacks a __initdata annotation or the annotation of boot_paca.33377 is wrong. ----------------------------------------------------------------------------- Signed-off-by: Chen Gang <gang.chen@asianux.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/nvram64: Need return the related error code on failure occursChen Gang
When error occurs, need return the related error code to let upper caller know about it. ppc_md.nvram_size() can return the error code (e.g. core99_nvram_size() in 'arch/powerpc/platforms/powermac/nvram.c'). Also set ret value when only need it, so can save structions for normal cases. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Set cpu sibling mask before online cpuLi Zhong
It seems following race is possible: cpu0 cpux smp_init->cpu_up->_cpu_up __cpu_up kick_cpu(1) ------------------------------------------------------------------------- waiting online ... ... notify CPU_STARTING set cpux active set cpux online ------------------------------------------------------------------------- finish waiting online ... sched_init_smp init_sched_domains(cpu_active_mask) build_sched_domains set cpux sibling info ------------------------------------------------------------------------- Execution of cpu0 and cpux could be concurrent between two separator lines. So if the cpux sibling information was set too late (normally impossible, but could be triggered by adding some delay in start_secondary, after setting cpu online), build_sched_domains() running on cpu0 might see cpux active, with an empty sibling mask, then cause some bad address accessing like following: [ 0.099855] Unable to handle kernel paging request for data at address 0xc00000038518078f [ 0.099868] Faulting instruction address: 0xc0000000000b7a64 [ 0.099883] Oops: Kernel access of bad area, sig: 11 [#1] [ 0.099895] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries [ 0.099922] Modules linked in: [ 0.099940] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-00120-gb973425-dirty #16 [ 0.099956] task: c0000001fed80000 ti: c0000001fed7c000 task.ti: c0000001fed7c000 [ 0.099971] NIP: c0000000000b7a64 LR: c0000000000b7a40 CTR: c0000000000b4934 [ 0.099985] REGS: c0000001fed7f760 TRAP: 0300 Not tainted (3.10.0-rc1-00120-gb973425-dirty) [ 0.099997] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24272828 XER: 20000003 [ 0.100045] SOFTE: 1 [ 0.100053] CFAR: c000000000445ee8 [ 0.100064] DAR: c00000038518078f, DSISR: 40000000 [ 0.100073] GPR00: 0000000000000080 c0000001fed7f9e0 c000000000c84d48 0000000000000010 GPR04: 0000000000000010 0000000000000000 c0000001fc55e090 0000000000000000 GPR08: ffffffffffffffff c000000000b80b30 c000000000c962d8 00000003845ffc5f GPR12: 0000000000000000 c00000000f33d000 c00000000000b9e4 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000000 GPR20: c000000000ccf750 0000000000000000 c000000000c94d48 c0000001fc504000 GPR24: c0000001fc504000 c0000001fecef848 c000000000c94d48 c000000000ccf000 GPR28: c0000001fc522090 0000000000000010 c0000001fecef848 c0000001fed7fae0 [ 0.100293] NIP [c0000000000b7a64] .get_group+0x84/0xc4 [ 0.100307] LR [c0000000000b7a40] .get_group+0x60/0xc4 [ 0.100318] Call Trace: [ 0.100332] [c0000001fed7f9e0] [c0000000000dbce4] .lock_is_held+0xa8/0xd0 (unreliable) [ 0.100354] [c0000001fed7fa70] [c0000000000bf62c] .build_sched_domains+0x728/0xd14 [ 0.100375] [c0000001fed7fbe0] [c000000000af67bc] .sched_init_smp+0x4fc/0x654 [ 0.100394] [c0000001fed7fce0] [c000000000adce24] .kernel_init_freeable+0x17c/0x30c [ 0.100413] [c0000001fed7fdb0] [c00000000000ba08] .kernel_init+0x24/0x12c [ 0.100431] [c0000001fed7fe30] [c000000000009f74] .ret_from_kernel_thread+0x5c/0x68 [ 0.100445] Instruction dump: [ 0.100456] 38800010 38a00000 4838e3f5 60000000 7c6307b4 2fbf0000 419e0040 3d220001 [ 0.100496] 78601f24 39491590 e93e0008 7d6a002a <7d69582a> f97f0000 7d4a002a e93e0010 [ 0.100559] ---[ end trace 31fd0ba7d8756001 ]--- This patch tries to move the sibling maps updating before notify_cpu_starting() and cpu online, and a write barrier there to make sure sibling maps are updated before active and online mask. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Delete __cpuinit usage from all usersPaul Gortmaker
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the powerpc uses of the __cpuinit macros. There are no __CPUINIT users in assembly files in powerpc. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/idle: Convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/iommu: Remove unused pci_iommu_init() and pci_direct_iommu_init()Bjorn Helgaas
pci_iommu_init() and pci_direct_iommu_init() are not referenced anywhere, so remove them. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Don't flush/invalidate the d/icache for an unknown relocation typeKevin Hao
For an unknown relocation type since the value of r4 is just the 8bit relocation type, the sum of r4 and r7 may yield an invalid memory address. For example: In normal case: r4 = c00xxxxx r7 = 40000000 r4 + r7 = 000xxxxx For an unknown relocation type: r4 = 000000xx r7 = 40000000 r4 + r7 = 400000xx 400000xx is an invalid memory address for a board which has just 512M memory. And for operations such as dcbst or icbi may cause bus error for an invalid memory address on some platforms and then cause the board reset. So we should skip the flush/invalidate the d/icache for an unknown relocation type. Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: Suzuki K. Poulose <suzuki@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/powernv: Use dev-node in PCI config accessorsGavin Shan
Currently, we're using the combo (PCI bus + devfn) in the PCI config accessors and PCI config accessors in EEH depends on them. However, it's not safe to refer the PCI bus which might have been removed during hotplug. So we're using device node in the PCI config accessors and the corresponding backends just reuse them. The patch also fix one potential risk: We possiblly have frozen PE during the early PCI probe time, but we haven't setup the PE mapping yet. So the errors should be counted to PE#0. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/eeh: Avoid build warningsGavin Shan
The patch is for avoiding following build warnings: The function .pnv_pci_ioda_fixup() references the function __init .eeh_init(). This is often because .pnv_pci_ioda_fixup lacks a __init The function .pnv_pci_ioda_fixup() references the function __init .eeh_addr_cache_build(). This is often because .pnv_pci_ioda_fixup lacks a __init Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/eeh: Refactor the output messageGavin Shan
We needn't the the whole backtrace other than one-line message in the error reporting interrupt handler. For errors triggered by access PCI config space or MMIO, we replace "WARN(1, ...)" with pr_err() and dump_stack(). The patch also adds more output messages to indicate what EEH core is doing. Besides, some printk() are replaced with pr_warning(). Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/eeh: Fix address catch for PowerNVGavin Shan
On the PowerNV platform, the EEH address cache isn't built correctly because we skipped the EEH devices without binding PE. The patch fixes that. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/powernv: Replace variables with flagsGavin Shan
We have 2 fields in "struct pnv_phb" to trace the states. The patch replace the fields with one and introduces flags for that. The patch doesn't impact the logic. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/eeh: Check PCIe link after resetGavin Shan
After reset (e.g. complete reset) in order to bring the fenced PHB back, the PCIe link might not be ready yet. The patch intends to make sure the PCIe link is ready before accessing its subordinate PCI devices. The patch also fixes that wrong values restored to PCI_COMMAND register for PCI bridges. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/eeh: Don't collect PCI-CFG data on PHBGavin Shan
When the PHB is fenced or dead, it's pointless to collect the data from PCI config space of subordinate PCI devices since it should return 0xFF's. The patch also fixes overwritten buffer while getting PCI config data. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-30powerpc/tm: Clear MSR RI in non-recoverable TM codeMichael Neuling
When we treclaim and trecheckpoint there's an unavoidable period when r1 will not be a valid kernel stack pointer. This patch clears the MSR recoverable interrupt (RI) bit over these regions to indicate we have an invalid kernel stack pointer. For treclaim, the region over which we clear MSR RI is larger than required to avoid the need for an extra costly mtmsrd. Thanks to Paulus for suggesting this change. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-30powerpc: Fix string instr. emulation for 32-bit processes on ppc64James Yang
String instruction emulation would erroneously result in a segfault if the upper bits of the EA are set and is so high that it fails access check. Truncate the EA to 32 bits if the process is 32-bit. Signed-off-by: James Yang <James.Yang@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-30trivial: powerpc: Fix typo in ioei_interrupt() descriptionSebastien Bessiere
Signed-off-by: Sebastien Bessiere <sebastien.bessiere@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/eeh: Use interruptible sleep in keehdGavin Shan
To replace down() with down_interrutible() to avoid following warning: [c00000007ba7b710] [c000000000014410] .__switch_to+0x1b0/0x380 [c00000007ba7b7c0] [c0000000007b408c] .__schedule+0x3ec/0x970 [c00000007ba7ba50] [c0000000007b1f24] .schedule_timeout+0x1a4/0x2b0 [c00000007ba7bb30] [c0000000007b34a4] .__down+0xa4/0x104 [c00000007ba7bbf0] [c0000000000b9230] .down+0x60/0x70 [c00000007ba7bc80] [c0000000000336d0] .eeh_event_handler+0x70/0x190 [c00000007ba7bd30] [c0000000000b1a58] .kthread+0xe8/0xf0 [c00000007ba7be30] [c00000000000a05c] .ret_from_kernel_thread+0x5c/0x8 This also avoids keeping the load average up while doing nothing. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/eeh: Remove eeh_mutexGavin Shan
Originally, eeh_mutex was introduced to protect the PE hierarchy tree and the attached EEH devices because EEH core was possiblly running with multiple threads to access the PE hierarchy tree. However, we now have only one kthread in EEH core. So we needn't the eeh_mutex and just remove it. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/mm: Fix build warnings with CONFIG_TRANSPARENT_HUGEPAGE disabledNathan Fontenot
Building with CONFIG_TRANSPARENT_HUGEPAGE disabled causes the following build wearnings; powerpc/arch/powerpc/include/asm/mmu-hash64.h: In function ‘__hash_page_thp’: powerpc/arch/powerpc/include/asm/mmu-hash64.h:354: warning: no return statement in function returning non-void This patch adds a return -1 to the static inline for __hash_page_thp() to correct the warnings. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/pseries: Enable PSTORE in pseries_defconfigAruna Balakrishnaiah
Since now we have pstore support for nvram in pseries, enable it in the default config. With this config option enabled, pstore infra-structure will be used to read/write the messages from/to nvram. Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/hw_brk: Fix clearing of extraneous IRQMichael Neuling
In 9422de3 "powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint registers" we changed the way we mark extraneous irqs with this: - info->extraneous_interrupt = !((bp->attr.bp_addr <= dar) && - (dar - bp->attr.bp_addr < bp->attr.bp_len)); + if (!((bp->attr.bp_addr <= dar) && + (dar - bp->attr.bp_addr < bp->attr.bp_len))) + info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ; Unfortunately this is bogus as it never clears extraneous IRQ if it's already set. This correctly clears extraneous IRQ before possibly setting it. Signed-off-by: Michael Neuling <mikey@neuling.org> Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-25powerpc/hw_brk: Fix setting of length for exact mode breakpointsMichael Neuling
The smallest match region for both the DABR and DAWR is 8 bytes, so the kernel needs to filter matches when users want to look at regions smaller than this. Currently we set the length of PPC_BREAKPOINT_MODE_EXACT breakpoints to 8. This is wrong as in exact mode we should only match on 1 address, hence the length should be 1. This ensures that the kernel will filter out any exact mode hardware breakpoint matches on any addresses other than the requested one. Signed-off-by: Michael Neuling <mikey@neuling.org> Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Optimize hugepage invalidateAneesh Kumar K.V
Hugepage invalidate involves invalidating multiple hpte entries. Optimize the operation using H_BULK_REMOVE on lpar platforms. On native, reduce the number of tlb flush. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/THP: Enable THP on PPC64Aneesh Kumar K.V
We enable only if the we support 16MB page size. Reviewed-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: split hugepage when using subpage protectionAneesh Kumar K.V
We find all the overlapping vma and mark them such that we don't allocate hugepage in that range. Also we split existing huge page so that the normal page hash can be invalidated and new page faulted in with new protection bits. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: disable assert_pte_locked for collapse_huge_pageAneesh Kumar K.V
With THP we set pmd to none, before we do pte_clear. Hence we can't walk page table to get the pte lock ptr and verify whether it is locked. THP do take pte lock before calling pte_clear. So we don't change the locking rules here. It is that we can't use page table walking to check whether pte locks are held with THP. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Prevent gcc to re-read the pagetablesAneesh Kumar K.V
GCC is very likely to read the pagetables just once and cache them in the local stack or in a register, but it is can also decide to re-read the pagetables. The problem is that the pagetable in those places can change from under gcc. With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can change under gup_fast. The pages won't be freed untill we finish gup fast because we have irq disabled and we free these pages via rcu callback. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Make linux pagetable walk safe with THP enabledAneesh Kumar K.V
We need to have irqs disabled to handle all the possible parallel update for linux page table without holding locks. Events that we are intersted in while walking page tables are 1) Page fault 2) umap 3) THP split 4) THP collapse A) local_irq_disabled: ------------------------ 1) page fault: A none to valid transition via page fault is not an issue because we would either see a none or valid. If it is none, we would error out the page table walk. We may need to use on stack values when checking for type of page table elements, because if we do if (!is_hugepd()) { if (!pmd_none() { if (pmd_bad() { We could take that bad condition because the pmd got converted to a hugepd after the !is_hugepd check via a hugetlb fault. The right way would be to check for pmd_none higher up or use on stack value. 2) A valid to none conversion via unmap: We can safely walk the upper level table, because we don't remove the the page table entries until rcu grace period. So even if we followed a wrong pointer we still have the pointer valid till the grace period. A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent pte_clear we take _PAGE_BUSY. 3) THP split: A valid transparent hugepage is converted to nomal page. Before we split we do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING So when walking page table we need to check for pmd_trans_splitting and handle that. The pte returned should also need to be checked for _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save the value of PTE on stack and check for the flag in the local pte value. If we don't have the value set we can safely operate on the local pte value and we atomicaly set _PAGE_BUSY. 4) THP collapse: A normal page gets converted to hugepage. In the collapse path, we mark the pmd none early (pmdp_clear_flush). With irq disabled, if we are aleady walking page table we would see the pmd_none and won't continue. If we see a valid PMD, we should still check for _PAGE_PRESENT before setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/THP: Add code to handle HPTE faults for hugepagesAneesh Kumar K.V
The deposted PTE page in the second half of the PMD table is used to track the state on hash PTEs. After updating the HPTE, we mark the coresponding slot in the deposted PTE page valid. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Update gup_pmd_range to handle transparent hugepagesAneesh Kumar K.V
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/kvm: Handle transparent hugepage in KVMAneesh Kumar K.V
We can find pte that are splitting while walking page tables. Return None pte in that case. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Replace find_linux_pte with find_linux_pte_or_hugepteAneesh Kumar K.V
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly document why we don't need to handle transparent hugepages at callsites. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepagesAneesh Kumar K.V
Reviewed-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common codeAneesh Kumar K.V
We will use this in the later patch for handling THP pages Reviewed-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/THP: Implement transparent hugepages for ppc64Aneesh Kumar K.V
We now have pmd entries covering 16MB range and the PMD table double its original size. We use the second half of the PMD table to deposit the pgtable (PTE page). The depoisted PTE page is further used to track the HPTE information. The information include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry. With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need 4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk the PTE page and invalidate all valid HPTEs. This patch implements necessary arch specific functions for THP support and also hugepage invalidate logic. These PMD related functions are intentionally kept similar to their PTE counter-part. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>