summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)Author
2015-02-13x86/mce: Defer mce wakeups to threads for PREEMPT_RTSteven Rostedt
We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13x86/mce: fix mce timer intervalMike Galbraith
Seems mce timer fire at the wrong frequency in -rt kernels since roughly forever due to 32 bit overflow. 3.8-rt is also missing a multiplier. Add missing us -> ns conversion and 32 bit overflow prevention. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <bitbucket@online.de> [bigeasy: use ULL instead of u64 cast] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13x86: Convert mce timer to hrtimerThomas Gleixner
mce_timer is started in atomic contexts of cpu bringup. This results in might_sleep() warnings on RT. Convert mce_timer to a hrtimer to avoid this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13Reset to 3.12.37Scott Wood
2014-05-14Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"Sebastian Andrzej Siewior
where do I start. Let me explain what is going on here. The code sequence | pushf | pop %edx | or $0x1,%dh | push %edx | mov $0xe0,%eax | popf | sysenter triggers the bug. On 64bit kernel we see the double fault (with 32bit and 64bit userland) and on 32bit kernel there is no problem. The reporter said that double fault does not happen on 64bit kernel with 64bit userland and this is because in that case the VDSO uses the "syscall" interface instead of "sysenter". The bug. "popf" loads the flags with the TF bit set which enables "single stepping" and this leads to a debug exception. Usually on 64bit we have a special IST stack for the debug exception. Due to patch [0] we do not use the IST stack but the kernel stack instead. On 64bit the sysenter instruction starts in kernel with the stack address NULL. The code sequence above enters the debug exception (TF flag) after the sysenter instruction was executed which sets the stack pointer to NULL and we have a fault (it seems that the debug exception saves some bytes on the stack). To fix the double fault I'm going to drop patch [0]. It is completely pointless. In do_debug() and do_stack_segment() we disable preemption which means the task can't leave the CPU. So it does not matter if we run on IST or on kernel stack. There is a patch [1] which drops preempt_disable() call for a 32bit kernel but not for 64bit so there should be no regression. And [1] seems valid even for this code sequence. We enter the debug exception with a 256bytes long per cpu stack and migrate to the kernel stack before calling do_debug(). [0] x86-disable-debug-stack.patch [1] fix-rt-int3-x86_32-3.2-rt.patch Cc: stable-rt@vger.kernel.org Reported-by: Brian Silverman <bsilver16384@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RTAndi Kleen
Normally the x86-64 trap handlers for debug/int 3/stack fault run on a special interrupt stack to make them more robust when dealing with kernel code. The PREEMPT_RT kernel can sleep in locks even while allocating GFP_ATOMIC memory. When one of these trap handlers needs to send real time signals for ptrace it allocates memory and could then try to to schedule. But it is not allowed to schedule on a IST stack. This can cause warnings and hangs. This patch disables the IST stacks for these handlers for PREEMPT_RT kernel. Instead let them run on the normal process stack. The kernel only really needs the ISTs here to make kernel debuggers more robust in case someone sets a break point somewhere where the stack is invalid. But there are no kernel debuggers in the standard kernel that do this. It also means kprobes cannot be set in situations with invalid stack; but that sounds like a reasonable restriction. The stack fault change could minimally impact oops quality, but not very much because stack faults are fairly rare. A better solution would be to use similar logic as the NMI "paranoid" path: check if signal is for user space, if yes go back to entry.S, switch stack, call sync_regs, then do the signal sending etc. But this patch is much simpler and should work too with minimal impact. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14x86/mce: Defer mce wakeups to threads for PREEMPT_RTSteven Rostedt
We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14x86/mce: fix mce timer intervalMike Galbraith
Seems mce timer fire at the wrong frequency in -rt kernels since roughly forever due to 32 bit overflow. 3.8-rt is also missing a multiplier. Add missing us -> ns conversion and 32 bit overflow prevention. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <bitbucket@online.de> [bigeasy: use ULL instead of u64 cast] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14x86: Convert mce timer to hrtimerThomas Gleixner
mce_timer is started in atomic contexts of cpu bringup. This results in might_sleep() warnings on RT. Convert mce_timer to a hrtimer to avoid this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14Reset to 3.12.19Scott Wood
2014-04-10Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"Sebastian Andrzej Siewior
where do I start. Let me explain what is going on here. The code sequence | pushf | pop %edx | or $0x1,%dh | push %edx | mov $0xe0,%eax | popf | sysenter triggers the bug. On 64bit kernel we see the double fault (with 32bit and 64bit userland) and on 32bit kernel there is no problem. The reporter said that double fault does not happen on 64bit kernel with 64bit userland and this is because in that case the VDSO uses the "syscall" interface instead of "sysenter". The bug. "popf" loads the flags with the TF bit set which enables "single stepping" and this leads to a debug exception. Usually on 64bit we have a special IST stack for the debug exception. Due to patch [0] we do not use the IST stack but the kernel stack instead. On 64bit the sysenter instruction starts in kernel with the stack address NULL. The code sequence above enters the debug exception (TF flag) after the sysenter instruction was executed which sets the stack pointer to NULL and we have a fault (it seems that the debug exception saves some bytes on the stack). To fix the double fault I'm going to drop patch [0]. It is completely pointless. In do_debug() and do_stack_segment() we disable preemption which means the task can't leave the CPU. So it does not matter if we run on IST or on kernel stack. There is a patch [1] which drops preempt_disable() call for a 32bit kernel but not for 64bit so there should be no regression. And [1] seems valid even for this code sequence. We enter the debug exception with a 256bytes long per cpu stack and migrate to the kernel stack before calling do_debug(). [0] x86-disable-debug-stack.patch [1] fix-rt-int3-x86_32-3.2-rt.patch Cc: stable-rt@vger.kernel.org Reported-by: Brian Silverman <bsilver16384@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RTAndi Kleen
Normally the x86-64 trap handlers for debug/int 3/stack fault run on a special interrupt stack to make them more robust when dealing with kernel code. The PREEMPT_RT kernel can sleep in locks even while allocating GFP_ATOMIC memory. When one of these trap handlers needs to send real time signals for ptrace it allocates memory and could then try to to schedule. But it is not allowed to schedule on a IST stack. This can cause warnings and hangs. This patch disables the IST stacks for these handlers for PREEMPT_RT kernel. Instead let them run on the normal process stack. The kernel only really needs the ISTs here to make kernel debuggers more robust in case someone sets a break point somewhere where the stack is invalid. But there are no kernel debuggers in the standard kernel that do this. It also means kprobes cannot be set in situations with invalid stack; but that sounds like a reasonable restriction. The stack fault change could minimally impact oops quality, but not very much because stack faults are fairly rare. A better solution would be to use similar logic as the NMI "paranoid" path: check if signal is for user space, if yes go back to entry.S, switch stack, call sync_regs, then do the signal sending etc. But this patch is much simpler and should work too with minimal impact. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10x86/mce: Defer mce wakeups to threads for PREEMPT_RTSteven Rostedt
We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10x86/mce: fix mce timer intervalMike Galbraith
Seems mce timer fire at the wrong frequency in -rt kernels since roughly forever due to 32 bit overflow. 3.8-rt is also missing a multiplier. Add missing us -> ns conversion and 32 bit overflow prevention. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <bitbucket@online.de> [bigeasy: use ULL instead of u64 cast] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10x86: Convert mce timer to hrtimerThomas Gleixner
mce_timer is started in atomic contexts of cpu bringup. This results in might_sleep() warnings on RT. Convert mce_timer to a hrtimer to avoid this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-05perf/x86: Fix event schedulingPeter Zijlstra
commit 26e61e8939b1fe8729572dabe9a9e97d930dd4f6 upstream. Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures, with perf WARN_ON()s triggering. He also provided traces of the failures. This is I think the relevant bit: > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1 > pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800) So we add the insn:p event (fd[23]). At this point we should have: n_events = 2, n_added = 1, n_txn = 1 > pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800) > pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00) We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]). These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so that's not visible. group_sched_in() pmu->start_txn() /* nop - BP pmu */ event_sched_in() event->pmu->add() So here we should end up with: 0: n_events = 3, n_added = 2, n_txn = 2 4: n_events = 4, n_added = 3, n_txn = 3 But seeing the below state on x86_pmu_enable(), the must have failed, because the 0 and 4 events aren't there anymore. Looking at group_sched_in(), since the BP is the leader, its event_sched_in() must have succeeded, for otherwise we would not have seen the sibling adds. But since neither 0 or 4 are in the below state; their event_sched_in() must have failed; but I don't see why, the complete state: 0,0,1:p,4 fits perfectly fine on a core2. However, since we try and schedule 4 it means the 0 event must have succeeded! Therefore the 4 event must have failed, its failure will have put group_sched_in() into the fail path, which will call: event_sched_out() event->pmu->del() on 0 and the BP event. Now x86_pmu_del() will reduce n_events; but it will not reduce n_added; giving what we see below: n_event = 2, n_added = 2, n_txn = 2 > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2 > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0 So the problem is that x86_pmu_del(), when called from a group_sched_in() that fails (for whatever reason), and without x86_pmu TXN support (because the leader is !x86_pmu), will corrupt the n_added state. Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stephane Eranian <eranian@google.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-22x86, smap: Don't enable SMAP if CONFIG_X86_SMAP is disabledH. Peter Anvin
commit 03bbd596ac04fef47ce93a730b8f086d797c3021 upstream. If SMAP support is not compiled into the kernel, don't enable SMAP in CR4 -- in fact, we should clear it, because the kernel doesn't contain the proper STAC/CLAC instructions for SMAP support. Found by Fengguang Wu's test system. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20x86: mm: change tlb_flushall_shift for IvyBridgeMel Gorman
commit f98b7a772ab51b52ca4d2a14362fc0e0c8a2e0f3 upstream. There was a large performance regression that was bisected to commit 611ae8e3 ("x86/tlb: enable tlb flush range support for x86"). This patch simply changes the default balance point between a local and global flush for IvyBridge. In the interest of allowing the tests to be reproduced, this patch was tested using mmtests 0.15 with the following configurations configs/config-global-dhp__tlbflush-performance configs/config-global-dhp__scheduler-performance configs/config-global-dhp__network-performance Results are from two machines Ivybridge 4 threads: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz Ivybridge 8 threads: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Page fault microbenchmark showed nothing interesting. Ebizzy was configured to run multiple iterations and threads. Thread counts ranged from 1 to NR_CPUS*2. For each thread count, it ran 100 iterations and each iteration lasted 10 seconds. Ivybridge 4 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 6395.44 ( 0.00%) 6789.09 ( 6.16%) Mean 2 7012.85 ( 0.00%) 8052.16 ( 14.82%) Mean 3 6403.04 ( 0.00%) 6973.74 ( 8.91%) Mean 4 6135.32 ( 0.00%) 6582.33 ( 7.29%) Mean 5 6095.69 ( 0.00%) 6526.68 ( 7.07%) Mean 6 6114.33 ( 0.00%) 6416.64 ( 4.94%) Mean 7 6085.10 ( 0.00%) 6448.51 ( 5.97%) Mean 8 6120.62 ( 0.00%) 6462.97 ( 5.59%) Ivybridge 8 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 7336.65 ( 0.00%) 7787.02 ( 6.14%) Mean 2 8218.41 ( 0.00%) 9484.13 ( 15.40%) Mean 3 7973.62 ( 0.00%) 8922.01 ( 11.89%) Mean 4 7798.33 ( 0.00%) 8567.03 ( 9.86%) Mean 5 7158.72 ( 0.00%) 8214.23 ( 14.74%) Mean 6 6852.27 ( 0.00%) 7952.45 ( 16.06%) Mean 7 6774.65 ( 0.00%) 7536.35 ( 11.24%) Mean 8 6510.50 ( 0.00%) 6894.05 ( 5.89%) Mean 12 6182.90 ( 0.00%) 6661.29 ( 7.74%) Mean 16 6100.09 ( 0.00%) 6608.69 ( 8.34%) Ebizzy hits the worst case scenario for TLB range flushing every time and it shows for these Ivybridge CPUs at least that the default choice is a poor on. The patch addresses the problem. Next was a tlbflush microbenchmark written by Alex Shi at http://marc.info/?l=linux-kernel&m=133727348217113 . It measures access costs while the TLB is being flushed. The expectation is that if there are always full TLB flushes that the benchmark would suffer and it benefits from range flushing There are 320 iterations of the test per thread count. The number of entries is randomly selected with a min of 1 and max of 512. To ensure a reasonably even spread of entries, the full range is broken up into 8 sections and a random number selected within that section. iteration 1, random number between 0-64 iteration 2, random number between 64-128 etc This is still a very weak methodology. When you do not know what are typical ranges, random is a reasonable choice but it can be easily argued that the opimisation was for smaller ranges and an even spread is not representative of any workload that matters. To improve this, we'd need to know the probability distribution of TLB flush range sizes for a set of workloads that are considered "common", build a synthetic trace and feed that into this benchmark. Even that is not perfect because it would not account for the time between flushes but there are limits of what can be reasonably done and still be doing something useful. If a representative synthetic trace is provided then this benchmark could be revisited and the shift values retuned. Ivybridge 4 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 10.50 ( 0.00%) 10.50 ( 0.03%) Mean 2 17.59 ( 0.00%) 17.18 ( 2.34%) Mean 3 22.98 ( 0.00%) 21.74 ( 5.41%) Mean 5 47.13 ( 0.00%) 46.23 ( 1.92%) Mean 8 43.30 ( 0.00%) 42.56 ( 1.72%) Ivybridge 8 threads 3.13.0-rc7 3.13.0-rc7 vanilla altshift-v3 Mean 1 9.45 ( 0.00%) 9.36 ( 0.93%) Mean 2 9.37 ( 0.00%) 9.70 ( -3.54%) Mean 3 9.36 ( 0.00%) 9.29 ( 0.70%) Mean 5 14.49 ( 0.00%) 15.04 ( -3.75%) Mean 8 41.08 ( 0.00%) 38.73 ( 5.71%) Mean 13 32.04 ( 0.00%) 31.24 ( 2.49%) Mean 16 40.05 ( 0.00%) 39.04 ( 2.51%) For both CPUs, average access time is reduced which is good as this is the benchmark that was used to tune the shift values in the first place albeit it is now known *how* the benchmark was used. The scheduler benchmarks were somewhat inconclusive. They showed gains and losses and makes me reconsider how stable those benchmarks really are or if something else might be interfering with the test results recently. Network benchmarks were inconclusive. Almost all results were flat except for netperf-udp tests on the 4 thread machine. These results were unstable and showed large variations between reboots. It is unknown if this is a recent problems but I've noticed before that netperf-udp results tend to vary. Based on these results, changing the default for Ivybridge seems like a logical choice. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Alex Shi <alex.shi@linaro.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-cqnadffh1tiqrshthRj3Esge@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06x86, cpu, amd: Add workaround for family 16h, erratum 793Borislav Petkov
commit 3b56496865f9f7d9bcb2f93b44c63f274f08e3b6 upstream. This adds the workaround for erratum 793 as a precaution in case not every BIOS implements it. This addresses CVE-2013-6885. Erratum text: [Revision Guide for AMD Family 16h Models 00h-0Fh Processors, document 51810 Rev. 3.04 November 2013] 793 Specific Combination of Writes to Write Combined Memory Types and Locked Instructions May Cause Core Hang Description Under a highly specific and detailed set of internal timing conditions, a locked instruction may trigger a timing sequence whereby the write to a write combined memory type is not flushed, causing the locked instruction to stall indefinitely. Potential Effect on System Processor core hang. Suggested Workaround BIOS should set MSR C001_1020[15] = 1b. Fix Planned No fix planned [ hpa: updated description, fixed typo in MSR name ] Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10hRobert Richter
commit bee09ed91cacdbffdbcd3b05de8409c77ec9fcd6 upstream. On AMD family 10h we see following error messages while waking up from S3 for all non-boot CPUs leading to a failed IBS initialization: Enabling non-boot CPUs ... smpboot: Booting Node 0 Processor 1 APIC 0x1 [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu perf: IBS APIC setup failed on cpu #1 process: Switch to broadcast mode on CPU1 CPU1 is up ... ACPI: Waking up from system sleep state S3 Reason for this is that during suspend the LVT offset for the IBS vector gets lost and needs to be reinialized while resuming. The offset is read from the IBSCTL msr. On family 10h the offset needs to be 1 as offset 0 is used for the MCE threshold interrupt, but firmware assings it for IBS to 0 too. The kernel needs to reprogram the vector. The msr is a readonly node msr, but a new value can be written via pci config space access. The reinitialization is implemented for family 10h in setup_ibs_ctl() which is forced during IBS setup. This patch fixes IBS setup after waking up from S3 by adding resume/supend hooks for the boot cpu which does the offset reinitialization. Marking it as stable to let distros pick up this fix. Signed-off-by: Robert Richter <rric@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09x86 idle: Repair large-server 50-watt idle-power regressionLen Brown
commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream. Linux 3.10 changed the timing of how thread_info->flags is touched: x86: Use generic idle loop (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) This caused Intel NHM-EX and WSM-EX servers to experience a large number of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote from deep C-states to shallow C-states, which caused these platforms to experience a significant increase in idle power. Note that this issue was already present before the commit above, however, it wasn't seen often enough to be noticed in power measurements. Here we extend an errata workaround from the Core2 EX "Dunnington" to extend to NHM-EX and WSM-EX, to prevent these immediate returns from MWAIT, reducing idle power on these platforms. While only acpi_idle ran on Dunnington, intel_idle may also run on these two newer systems. As of today, there are no other models that are known to need this tweak. Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-29perf/x86: Fix NMI measurementsPeter Zijlstra
OK, so what I'm actually seeing on my WSM is that sched/clock.c is 'broken' for the purpose we're using it for. What triggered it is that my WSM-EP is broken :-( [ 0.001000] tsc: Fast TSC calibration using PIT [ 0.002000] tsc: Detected 2533.715 MHz processor [ 0.500180] TSC synchronization [CPU#0 -> CPU#6]: [ 0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock. [ 0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed For some reason it consistently detects TSC skew, even though NHM+ should have a single clock domain for 'reasonable' systems. This marks sched_clock_stable=0, which means that we do fancy stuff to try and get a 'sane' clock. Part of this fancy stuff relies on the tick, clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until it either wakes up or gets kicked by another cpu. While this is perfectly fine for the scheduler -- it only cares about actually running stuff, and when we're running stuff we're obviously not idle. This does somewhat break down for perf which can trigger events just fine on an otherwise idle cpu. So I've got NMIs get get 'measured' as taking ~1ms, which actually don't last nearly that long: <idle>-0 [013] d.h. 886.311970: rcu_nmi_enter <-do_nmi ... <idle>-0 [013] d.h. 886.311997: perf_sample_event_took: HERE!!! : 1040990 So ftrace (which uses sched_clock(), not the fancy bits) only sees ~27us, but we measure ~1ms !! Now since all this measurement stuff lives in x86 code, we can actually fix it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mingo@kernel.org Cc: dave.hansen@linux.intel.com Cc: eranian@google.com Cc: Don Zickus <dzickus@redhat.com> Cc: jmario@redhat.com Cc: acme@infradead.org Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-04perf/x86: Clean up cap_user_time* settingPeter Zijlstra
Currently the cap_user_time_zero capability has different tests than cap_user_time; even though they expose the exact same data. Switch from CONSTANT && NONSTOP to sched_clock_stable to also deal with multi cabinet machines and drop the tsc_disabled() check.. non of this will work sanely without tsc anyway. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nmgn0j0muo1r4c94vlfh23xy@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "A couple of tooling fixlets and a PMU detection printout fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix PMU detection printout when no PMU is detected perf symbols: Demangle cloned functions perf machine: Fix path unpopulated in machine__create_modules() perf tools: Explicitly add libdl dependency perf probe: Fix probing symbols with optimization suffix perf trace: Add mmap2 handler perf kmem: Make it work again on non NUMA machines
2013-09-28perf/x86: Fix PMU detection printout when no PMU is detectedIngo Molnar
Ran into this cryptic PMU bootup log recently: [ 0.124047] Performance Events: [ 0.125000] smpboot: ... Turns out we print this if no PMU is detected. Fall back to the right condition so that the following is printed: [ 0.122381] Performance Events: no PMU driver, software events only. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/n/tip-u2fwaUffakjp0qkpRfqljgsn@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Assorted standalone fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add model number for Avoton Silvermont perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group() perf: Update ABI comment tools lib lk: Uninclude linux/magic.h in debugfs.c perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms() perf probe: Fix finder to find lines of given function perf session: Check for SIGINT in more loops perf tools: Fix compile with libelf without get_phdrnum perf tools: Fix buildid cache handling of kallsyms with kcore perf annotate: Fix objdump line parsing offset validation perf tools: Fill in new definitions for madvise()/mmap() flags perf tools: Sharpen the libaudit dependencies test
2013-09-23perf/x86/intel: Add model number for Avoton SilvermontYan, Zheng
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Cc: a.p.zijlstra@chello.nl Cc: eranian@google.com Cc: ak@linux.intel.com Link: http://lkml.kernel.org/r/1379837953-17755-1-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'Peter Zijlstra
Solve the problems around the broken definition of perf_event_mmap_page:: cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially fixed by: 860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'") The problem with the fix (merged in v3.12-rc1 and not yet released officially), noticed by Vince Weaver is that the new behavior is not detectable by new user-space, and that due to the reuse of the field names it's easy to mis-compile a binary if old headers are used on a new kernel or new headers are used on an old kernel. To solve all that make this change explicit, detectable and self-contained, by iterating the ABI the following way: - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not confuse old user-space binaries. RDPMC will be marked as unavailable to old binaries but that's within the ABI, this is a capability bit. - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new libraries can reliably detect that bit 0 is deprecated and perma-zero without having to check the kernel version. - Use bits 2, 3, 4 for the newly defined, correct functionality: cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */ cap_user_time : 1, /* The time_* fields are used */ cap_user_time_zero : 1, /* The time_zero field is used */ - Rename all the bitfield names in perf_event.h to be different from the old names, to make sure it's not possible to mis-compile it accidentally with old assumptions. The 'size' field can then be used in the future to add new fields and it will act as a natural ABI version indicator as well. Also adjust tools/perf/ userspace for the new definitions, noticed by Adrian Hunter. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Also-Fixed-by: Adrian Hunter <adrian.hunter@intel.com> Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()Yan, Zheng
uncore_validate_group() can't call smp_processor_id() because it is in preemptible context. Pass NUMA_NO_NODE to the allocator instead. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379400493-11505-1-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-18Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/intel/lpss: Add pin control support to Intel low power subsystem perf/x86/intel: Mark MEM_LOAD_UOPS_MISS_RETIRED as precise on SNB x86: Remove now-unused save_rest() x86/smpboot: Fix announce_cpu() to printk() the last "OK" properly
2013-09-18Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two small fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix UAPI export of PERF_EVENT_IOC_ID perf/x86/intel: Fix Silvermont offcore masks
2013-09-14perf/x86/intel: Mark MEM_LOAD_UOPS_MISS_RETIRED as precise on SNBStephane Eranian
On Intel SNB (SNB, SNB-EP), the event MEM_LOAD_UOPS_MISS_RETIRED supports PEBS. It was missing for the SNB PEBS event constraint table thereby preventing any measurement with PEBS for it. This patch adds the event to the PEBS table for SNB. WARNING: it should be noted that this event like a few others are subject to the erratum BT241 for Xeon E5 (SNB-EP). As such, the event may undercount when used with PEBS unless the workaround is implemented. But without this patch and just the workaround, the kernel would not allow precise sampling on this event. BT241 is documented in: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: zheng.z.yan@intel.com Link: http://lkml.kernel.org/r/20130913201646.GA23981@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes. The -g perf report lockup you reported is only partially addressed, patches that fix the excessive runtime are still being worked on" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix uncore PCI fixed counter handling uprobes: Fix utask->depth accounting in handle_trampoline() perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING perf: Fix up MMAP2 buffer space reservation perf tools: Add attr->mmap2 support perf kvm: Fix sample_type manipulation perf evlist: Fix id pos in perf_evlist__open() perf trace: Handle perf.data files with no tracepoints perf session: Separate progress bar update when processing events perf trace: Check if MAP_32BIT is defined perf hists: Fix formatting of long symbol names perf evlist: Fix parsing with no sample_id_all bit set perf tools: Add test for parsing with no sample_id_all bit perf trace: Check control+C more often
2013-09-12perf/x86/intel: Fix Silvermont offcore masksPeter Zijlstra
Fengguang Wu reported: > sparse warnings: (new ones prefixed by >>) > > >> arch/x86/kernel/cpu/perf_event_intel.c:901:9: sparse: constant 0x768005ffff is so big it is long > >> arch/x86/kernel/cpu/perf_event_intel.c:902:9: sparse: constant 0x768005ffff is so big it is long > > vim +901 arch/x86/kernel/cpu/perf_event_intel.c > > 895 }, > 896 }; > 897 > 898 static struct extra_reg intel_slm_extra_regs[] __read_mostly = > 899 { > 900 /* must define OFFCORE_RSP_X first, see intel_fixup_er() */ > > 901 INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x768005ffff, RSP_0), > > 902 INTEL_UEVENT_EXTRA_REG(0x02b7, MSR_OFFCORE_RSP_1, 0x768005ffff, RSP_1), > 903 EVENT_EXTRA_END > 904 }; > 905 Extend those constants to 64 bits. Reported-by: fengguang.wu@intel.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130909112636.GQ31370@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12perf/x86: Fix uncore PCI fixed counter handlingStephane Eranian
There was a bug in the handling of SNB-EP/IVB-EP uncore PCI fixed counters, e.g., IMC. It would cause erratic values to be returned for the IMC clockticks event. This was due to a bogus hwc->config value which was then written to PCI config space. The erratic values can be seen via: $ perf stat -a -C 0 -e uncore_imc_0/clockticks/ -I 1000 sleep 10 The fixed counter has most fields marked as reserved with hw reset values of 0. Yet the kernel was defaulting to a hwc->config = ~0 and that was causing the issues. This patch sets the hwc->config values for fixed uncore event to 0. Now, the values of IMC clockticks is correct. Signed-off-by: Stephane Eranian <eranian@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: peterz@infradead.org Cc: zheng.z.yan@intel.com Link: http://lkml.kernel.org/r/20130909195350.GA17643@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDINGStephane Eranian
The IvyBridge event CYCLE_ACTIVITY:CYCLES_LDM_PENDING can only be measured on counters 0-3 when HT is off. When HT is on, you only have counters 0-3. If you program it on the eight counters for 1s on a 3GHz IVB laptop running a noploop, you see: 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING 3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING Clearly the last 4 values are bogus. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: zheng.z.yan@intel.com Cc: dhsharp@google.com Link: http://lkml.kernel.org/r/20130911152222.GA28761@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-11mm: vmstats: track TLB flush stats on UP tooDave Hansen
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-04Merge branch 'x86-ras-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS changes from Ingo Molnar: "[ The reason for drivers/ updates is that Boris asked for the drivers/edac/ changes to go via x86/ras in this cycle ] Main changes: - AMD CPUs: . Add ECC event decoding support for new F15h models . Various erratum fixes . Fix single-channel on dual-channel-controllers bug. - Intel CPUs: . UC uncorrectable memory error parsing fix . Add support for CMC (Corrected Machine Check) 'FF' (Firmware First) flag in the APEI HEST - Various cleanups and fixes" * 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: amd64_edac: Fix incorrect wraparounds amd64_edac: Correct erratum 505 range cpc925_edac: Use proper array termination x86/mce, acpi/apei: Only disable banks listed in HEST if mce is configured amd64_edac: Get rid of boot_cpu_data accesses amd64_edac: Add ECC decoding support for newer F15h models x86, amd_nb: Clarify F15h, model 30h GART and L3 support pci_ids: Add PCI device ID functions 3 and 4 for newer F15h models. x38_edac: Make a local function static i3200_edac: Make a local function static x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errors APEI/ERST: Fix error message formatting amd64_edac: Fix single-channel setups EDAC: Replace strict_strtol() with kstrtol() mce: acpi/apei: Soft-offline a page on firmware GHES notification mce: acpi/apei: Add a boot option to disable ff mode for corrected errors mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC
2013-09-04Merge branch 'x86-paravirt-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt changes from Ingo Molnar: "Hypervisor signature detection cleanup and fixes - the goal is to make KVM guests run better on MS/Hyperv and to generalize and factor out the code a bit" * 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Correctly detect hypervisor x86, kvm: Switch to use hypervisor_cpuid_base() xen: Switch to use hypervisor_cpuid_base() x86: Introduce hypervisor_cpuid_base()
2013-09-04Merge branch 'x86-asmlinkage-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/asmlinkage changes from Ingo Molnar: "As a preparation for Andi Kleen's LTO patchset (link time optimizations using GCC's -flto which build time optimization has steadily increased in quality over the past few years and might eventually be usable for the kernel too) this tree includes a handful of preparatory patches that make function calling convention annotations consistent again: - Mark every function without arguments (or 64bit only) that is used by assembly code with asmlinkage() - Mark every function with parameters or variables that is used by assembly code as __visible. For the vanilla kernel this has documentation, consistency and debuggability advantages, for the time being" * 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asmlinkage: Fix warning in xen asmlinkage change x86, asmlinkage, vdso: Mark vdso variables __visible x86, asmlinkage, power: Make various symbols used by the suspend asm code visible x86, asmlinkage: Make dump_stack visible x86, asmlinkage: Make 64bit checksum functions visible x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops x86, asmlinkage, apm: Make APM data structure used from assembler visible x86, asmlinkage: Make syscall tables visible x86, asmlinkage: Make several variables used from assembler/linker script visible x86, asmlinkage: Make kprobes code visible and fix assembler code x86, asmlinkage: Make various syscalls asmlinkage x86, asmlinkage: Make 32bit/64bit __switch_to visible x86, asmlinkage: Make _*_start_kernel visible x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible x86, asmlinkage: Change dotraplinkage into __visible on 32bit x86: Fix sys_call_table type in asm/syscall.h
2013-09-02perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()Joe Perches
Use the convenience function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/f58599ae1a8d7b32d37e9cf283e95fba6452f7f6.1377809875.git.joe@perches.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02perf/x86: Add Silvermont (22nm Atom) supportYan, Zheng
Compared to old atom, Silvermont has offcore and has more events that support PEBS. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1374138144-17278-2-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_XYan, Zheng
Silvermont (22nm Atom) has two offcore response configuration MSRs, unlike other Intel CPU, its event code for MSR_OFFCORE_RSP_1 is 0x02b7. To avoid complicating intel_fixup_er(), use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X. So intel_fixup_er() can find the event code for OFFCORE_RSP_N by x86_pmu.extra_regs[N].event. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1374138144-17278-1-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-29Merge branch 'linus' into perf/coreIngo Molnar
Pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-19Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Two AMD microcode loader fixes and an OLPC firmware support fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode, AMD: Fix early microcode loading x86, microcode, AMD: Make cpu_has_amd_erratum() use the correct struct cpuinfo_x86 x86: Don't clear olpc_ofw_header when sentinel is detected
2013-08-16perf/x86/intel/uncore: Enable EV_SEL_EXT bit for PCUYan, Zheng
This patch adds support for the SNB-EP PCU uncore PMU extra_sel_bit (bit 21) which is missing from the documentation in Table-2.75 of Intel Xeon Processor E5-2600 Product Family Uncore Performance Monitoring Guide. It is referred to later in Table-2.81. Without this selection bit explicitly enabled by the kernel, some events such as COREx_TRANSITION_CYCLES do not count correctly. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1376375382-21350-4-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-16perf/x86/intel/uncore: Add filter support for QPI boxesYan, Zheng
The QPI uncore boxes have two pairs of MATCH/MASK registers that user to filter packet traffic serviced by QPI link layer. These registers are in auxiliary PCI devices. This patch adds the auxiliary PCI devices to snbep_uncore_pci_ids and adds field definitions for the MATCH/MASK registers. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375856245-10717-2-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-16perf/x86/intel/uncore: Add auxiliary pci device supportYan, Zheng
The QPI uncore boxes have two pairs of MATCH/MASK registers that user to filter packet traffic serviced by QPI link layer. These registers are in auxiliary PCI devices. This patch changes the meaning of (struct pci_device_id)->driver_data. The first 8 bits are device index of the same uncore type, the second 8 bytes are uncore type index. Auxiliary PCI device's type is defined as UNCORE_EXTRA_PCI_DEV(0xff) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375856245-10717-1-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-15Merge tag 'v3.11-rc5' into perf/coreIngo Molnar
Merge Linux 3.11-rc5, to sync up with the latest upstream fixes since -rc1. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12Merge tag 'please-pull-mce-f-bit' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull MCE-uncorrected-error fix from Tony Luck: "Bit 12 may or may not be set in MCi_STATUS.MCACOD when an uncorrected error is reported. Ignore it when checking error signatures." Signed-off-by: Ingo Molnar <mingo@kernel.org>