Age | Commit message (Collapse) | Author |
|
commit 26e61e8939b1fe8729572dabe9a9e97d930dd4f6 upstream.
Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures,
with perf WARN_ON()s triggering. He also provided traces of the failures.
This is I think the relevant bit:
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1
> pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800)
So we add the insn:p event (fd[23]).
At this point we should have:
n_events = 2, n_added = 1, n_txn = 1
> pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800)
> pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00)
We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]).
These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so
that's not visible.
group_sched_in()
pmu->start_txn() /* nop - BP pmu */
event_sched_in()
event->pmu->add()
So here we should end up with:
0: n_events = 3, n_added = 2, n_txn = 2
4: n_events = 4, n_added = 3, n_txn = 3
But seeing the below state on x86_pmu_enable(), the must have failed,
because the 0 and 4 events aren't there anymore.
Looking at group_sched_in(), since the BP is the leader, its
event_sched_in() must have succeeded, for otherwise we would not have
seen the sibling adds.
But since neither 0 or 4 are in the below state; their event_sched_in()
must have failed; but I don't see why, the complete state: 0,0,1:p,4
fits perfectly fine on a core2.
However, since we try and schedule 4 it means the 0 event must have
succeeded! Therefore the 4 event must have failed, its failure will
have put group_sched_in() into the fail path, which will call:
event_sched_out()
event->pmu->del()
on 0 and the BP event.
Now x86_pmu_del() will reduce n_events; but it will not reduce n_added;
giving what we see below:
n_event = 2, n_added = 2, n_txn = 2
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0
So the problem is that x86_pmu_del(), when called from a
group_sched_in() that fails (for whatever reason), and without x86_pmu
TXN support (because the leader is !x86_pmu), will corrupt the n_added
state.
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit c091c71ad2218fc50a07b3d1dab85783f3b77efd upstream.
GFP_ATOMIC is not a single gfp flag, but a macro which expands to the other
flags, where meaningful is the LACK of __GFP_WAIT flag. To check if caller
wants to perform an atomic allocation, the code must test for a lack of the
__GFP_WAIT flag. This patch fixes the issue introduced in v3.5-rc1.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit a08d3b3b99efd509133946056531cdf8f3a0c09b upstream.
The problem occurs when the guest performs a pusha with the stack
address pointing to an mmio address (or an invalid guest physical
address) to start with, but then extending into an ordinary guest
physical address. When doing repeated emulated pushes
emulator_read_write sets mmio_needed to 1 on the first one. On a
later push when the stack points to regular memory,
mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.
As a result, KVM exits to userspace, and then returns to
complete_emulated_mmio. In complete_emulated_mmio
vcpu->mmio_cur_fragment is incremented. The termination condition of
vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
The code bounces back and fourth to userspace incrementing
mmio_cur_fragment past it's buffer. If the guest does nothing else it
eventually leads to a a crash on a memcpy from invalid memory address.
However if a guest code can cause the vm to be destroyed in another
vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
can be used by the guest to control the data that's pointed to by the
call to cancel_work_item, which can be used to gain execution.
Fixes: f78146b0f9230765c6315b2e14f56112513389ad
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 87fbb2ac6073a7039303517546a76074feb14c84 upstream.
When the conversion was made to remove stop machine and use the breakpoint
logic instead, the modification of the function graph caller is still
done directly as though it was being done under stop machine.
As it is not converted via stop machine anymore, there is a possibility
that the code could be layed across cache lines and if another CPU is
accessing that function graph call when it is being updated, it could
cause a General Protection Fault.
Convert the update of the function graph caller to use the breakpoint
method as well.
Cc: H. Peter Anvin <hpa@zytor.com>
Fixes: 08d636b6d4fb "ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4640c7ee9b8953237d05a61ea3ea93981d1bc961 upstream.
If CONFIG_X86_SMAP is disabled, smap_violation() tests for conditions
which are incorrect (as the AC flag doesn't matter), causing spurious
faults.
The dynamic disabling of SMAP (nosmap on the command line) is fine
because it disables X86_FEATURE_SMAP, therefore causing the
static_cpu_has() to return false.
Found by Fengguang Wu's test system.
[ v3: move all predicates into smap_violation() ]
[ v2: use IS_ENABLED() instead of #ifdef ]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 03bbd596ac04fef47ce93a730b8f086d797c3021 upstream.
If SMAP support is not compiled into the kernel, don't enable SMAP in
CR4 -- in fact, we should clear it, because the kernel doesn't contain
the proper STAC/CLAC instructions for SMAP support.
Found by Fengguang Wu's test system.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a9c8e4beeeb64c22b84c803747487857fe424b68 upstream.
Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:
BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
page flags: 0x2ffc0000000014(referenced|dirty)
addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
Call Trace:
dump_stack+0x45/0x56
print_bad_pte+0x22e/0x250
unmap_single_vma+0x583/0x890
unmap_vmas+0x65/0x90
exit_mmap+0xc5/0x170
mmput+0x65/0x100
do_exit+0x393/0x9e0
do_group_exit+0xcc/0x140
SyS_exit_group+0x14/0x20
system_call_fastpath+0x1a/0x1f
Disabling lock debugging due to kernel taint
BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests. He
bisected the problem to commit 1667918b6483 ("mm: numa: clear numa
hinting information on mprotect") that was also included in 3.12-stable.
The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit. This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.
[mgorman@suse.de: wrote changelog, proposed minor modifications]
[akpm@linux-foundation.org: fix typo in comment]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f98b7a772ab51b52ca4d2a14362fc0e0c8a2e0f3 upstream.
There was a large performance regression that was bisected to
commit 611ae8e3 ("x86/tlb: enable tlb flush range support for
x86"). This patch simply changes the default balance point
between a local and global flush for IvyBridge.
In the interest of allowing the tests to be reproduced, this
patch was tested using mmtests 0.15 with the following
configurations
configs/config-global-dhp__tlbflush-performance
configs/config-global-dhp__scheduler-performance
configs/config-global-dhp__network-performance
Results are from two machines
Ivybridge 4 threads: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
Ivybridge 8 threads: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Page fault microbenchmark showed nothing interesting.
Ebizzy was configured to run multiple iterations and threads.
Thread counts ranged from 1 to NR_CPUS*2. For each thread count,
it ran 100 iterations and each iteration lasted 10 seconds.
Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 6395.44 ( 0.00%) 6789.09 ( 6.16%)
Mean 2 7012.85 ( 0.00%) 8052.16 ( 14.82%)
Mean 3 6403.04 ( 0.00%) 6973.74 ( 8.91%)
Mean 4 6135.32 ( 0.00%) 6582.33 ( 7.29%)
Mean 5 6095.69 ( 0.00%) 6526.68 ( 7.07%)
Mean 6 6114.33 ( 0.00%) 6416.64 ( 4.94%)
Mean 7 6085.10 ( 0.00%) 6448.51 ( 5.97%)
Mean 8 6120.62 ( 0.00%) 6462.97 ( 5.59%)
Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 7336.65 ( 0.00%) 7787.02 ( 6.14%)
Mean 2 8218.41 ( 0.00%) 9484.13 ( 15.40%)
Mean 3 7973.62 ( 0.00%) 8922.01 ( 11.89%)
Mean 4 7798.33 ( 0.00%) 8567.03 ( 9.86%)
Mean 5 7158.72 ( 0.00%) 8214.23 ( 14.74%)
Mean 6 6852.27 ( 0.00%) 7952.45 ( 16.06%)
Mean 7 6774.65 ( 0.00%) 7536.35 ( 11.24%)
Mean 8 6510.50 ( 0.00%) 6894.05 ( 5.89%)
Mean 12 6182.90 ( 0.00%) 6661.29 ( 7.74%)
Mean 16 6100.09 ( 0.00%) 6608.69 ( 8.34%)
Ebizzy hits the worst case scenario for TLB range flushing every
time and it shows for these Ivybridge CPUs at least that the
default choice is a poor on. The patch addresses the problem.
Next was a tlbflush microbenchmark written by Alex Shi at
http://marc.info/?l=linux-kernel&m=133727348217113 . It
measures access costs while the TLB is being flushed. The
expectation is that if there are always full TLB flushes that
the benchmark would suffer and it benefits from range flushing
There are 320 iterations of the test per thread count. The
number of entries is randomly selected with a min of 1 and max
of 512. To ensure a reasonably even spread of entries, the full
range is broken up into 8 sections and a random number selected
within that section.
iteration 1, random number between 0-64
iteration 2, random number between 64-128 etc
This is still a very weak methodology. When you do not know
what are typical ranges, random is a reasonable choice but it
can be easily argued that the opimisation was for smaller ranges
and an even spread is not representative of any workload that
matters. To improve this, we'd need to know the probability
distribution of TLB flush range sizes for a set of workloads
that are considered "common", build a synthetic trace and feed
that into this benchmark. Even that is not perfect because it
would not account for the time between flushes but there are
limits of what can be reasonably done and still be doing
something useful. If a representative synthetic trace is
provided then this benchmark could be revisited and the shift values retuned.
Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 10.50 ( 0.00%) 10.50 ( 0.03%)
Mean 2 17.59 ( 0.00%) 17.18 ( 2.34%)
Mean 3 22.98 ( 0.00%) 21.74 ( 5.41%)
Mean 5 47.13 ( 0.00%) 46.23 ( 1.92%)
Mean 8 43.30 ( 0.00%) 42.56 ( 1.72%)
Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 9.45 ( 0.00%) 9.36 ( 0.93%)
Mean 2 9.37 ( 0.00%) 9.70 ( -3.54%)
Mean 3 9.36 ( 0.00%) 9.29 ( 0.70%)
Mean 5 14.49 ( 0.00%) 15.04 ( -3.75%)
Mean 8 41.08 ( 0.00%) 38.73 ( 5.71%)
Mean 13 32.04 ( 0.00%) 31.24 ( 2.49%)
Mean 16 40.05 ( 0.00%) 39.04 ( 2.51%)
For both CPUs, average access time is reduced which is good as
this is the benchmark that was used to tune the shift values in
the first place albeit it is now known *how* the benchmark was
used.
The scheduler benchmarks were somewhat inconclusive. They
showed gains and losses and makes me reconsider how stable those
benchmarks really are or if something else might be interfering
with the test results recently.
Network benchmarks were inconclusive. Almost all results were
flat except for netperf-udp tests on the 4 thread machine.
These results were unstable and showed large variations between
reboots. It is unknown if this is a recent problems but I've
noticed before that netperf-udp results tend to vary.
Based on these results, changing the default for Ivybridge seems
like a logical choice.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Alex Shi <alex.shi@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-cqnadffh1tiqrshthRj3Esge@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 24f91eba18bbfdb27e71a1aae5b3a61b67fcd091 upstream.
The SOFT_DIRTY bit shows that the content of memory was changed after a
defined point in the past. mprotect() doesn't change the content of
memory, so it must not change the SOFT_DIRTY bit.
This bug causes a malfunction: on the first iteration all pages are
dumped. On other iterations only pages with the SOFT_DIRTY bit are
dumped. So if the SOFT_DIRTY bit is cleared from a page by mistake, the
page is not dumped and its content will be restored incorrectly.
This patch does nothing with _PAGE_SWP_SOFT_DIRTY, becase pte_modify()
is called only for present pages.
Fixes commit 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes
tracking").
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 51c71a3bbaca868043cc45b3ad3786dd48a90235 upstream.
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:
xen_platform_pci=0
(in the guest config file)
or
xen_emul_unplug=never
(on the Linux command line)
except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:
input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>] [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
[<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
[<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
[<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
[<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
[<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
[<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
[<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
[<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
[<ffffffff8145e7d9>] driver_attach+0x19/0x20
[<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
[<ffffffff8145f1ff>] driver_register+0x5f/0xf0
[<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
[<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
[<ffffffffa0015000>] ? 0xffffffffa0014fff
[<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
[<ffffffff81002049>] do_one_initcall+0x49/0x170
.. snip..
which is hardly nice. This patch fixes this by having each
PV driver check for:
- if running in PV, then it is fine to execute (as that is their
native environment).
- if running in HVM, check if user wanted 'xen_emul_unplug=never',
in which case bail out and don't load any PV drivers.
- if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
does not exist, then bail out and not load PV drivers.
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
then bail out for all PV devices _except_ the block one.
Ditto for the network one ('nics').
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
then load block PV driver, and also setup the legacy IDE paths.
In (v3) make it actually load PV drivers.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3b56496865f9f7d9bcb2f93b44c63f274f08e3b6 upstream.
This adds the workaround for erratum 793 as a precaution in case not
every BIOS implements it. This addresses CVE-2013-6885.
Erratum text:
[Revision Guide for AMD Family 16h Models 00h-0Fh Processors,
document 51810 Rev. 3.04 November 2013]
793 Specific Combination of Writes to Write Combined Memory Types and
Locked Instructions May Cause Core Hang
Description
Under a highly specific and detailed set of internal timing
conditions, a locked instruction may trigger a timing sequence whereby
the write to a write combined memory type is not flushed, causing the
locked instruction to stall indefinitely.
Potential Effect on System
Processor core hang.
Suggested Workaround
BIOS should set MSR
C001_1020[15] = 1b.
Fix Planned
No fix planned
[ hpa: updated description, fixed typo in MSR name ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic
Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit aee636c4809fa54848ff07a899b326eb1f9987a2 ]
At first Jakub Zawadzki noticed that some divisions by reciprocal_divide
were not correct. (off by one in some cases)
http://www.wireshark.org/~darkjames/reciprocal-buggy.c
He could also show this with BPF:
http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
The reciprocal divide in linux kernel is not generic enough,
lets remove its use in BPF, as it is not worth the pain with
current cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Daniel Borkmann <dxchgb@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 77f01bdfa5e55dc19d3eb747181d2730a9bb3ca8 upstream.
When Hyper-V hypervisor leaves are present, KVM must relocate
its own leaves at 0x40000100, because Windows does not look for
Hyper-V leaves at indices other than 0x40000000. In this case,
the KVM features are at 0x40000101, but the old code would always
look at 0x40000001.
Fix by using kvm_cpuid_base(). This also requires making the
function non-inline, since kvm_cpuid_base() is static.
Fixes: 1085ba7f552d84aa8ac0ae903fa8d0cc2ff9f79d
Cc: mtosatti@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c300a40772dae829b91dad634999a6a522c0829 upstream.
It is unnecessary to go through hypervisor_cpuid_base every time
a leaf is found (which will be every time a feature is requested
after the next patch).
Fixes: 1085ba7f552d84aa8ac0ae903fa8d0cc2ff9f79d
Cc: mtosatti@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ed96e87c5748de4c2807ef17e81287c7304186c upstream.
Limit PIT timer frequency similarly to the limit applied by
LAPIC timer.
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a7f84f03f660d93574ac88835d056c0d6468aebe upstream.
Current code check boot service region with kernel text region by:
start+size >= __pa_symbol(_text)
The end of the above region should be start + size - 1 instead.
I see this problem in ovmf + Fedora 19 grub boot:
text start: 1000000 md start: 800000 md size: 800000
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0dce7cd67fd9055c4a2ff278f8af1431e646d346 upstream.
Commit e66d2ae7c67bd moved the assignment
vcpu->arch.apic_base = value above a condition with
(vcpu->arch.apic_base ^ value), causing that check
to always fail. Use old_value, vcpu->arch.apic_base's
old value, in the condition instead.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1739f09e33d8f66bf48ddbc3eca615574da6c4f6 upstream.
Function tracing callbacks expect to have the ftrace_ops that registered it
passed to them, not the address of the variable that holds the ftrace_ops
that registered it.
Use a mov instead of a lea to store the ftrace_ops into the parameter
of the function tracing callback.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bee09ed91cacdbffdbcd3b05de8409c77ec9fcd6 upstream.
On AMD family 10h we see following error messages while waking up from
S3 for all non-boot CPUs leading to a failed IBS initialization:
Enabling non-boot CPUs ...
smpboot: Booting Node 0 Processor 1 APIC 0x1
[Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
perf: IBS APIC setup failed on cpu #1
process: Switch to broadcast mode on CPU1
CPU1 is up
...
ACPI: Waking up from system sleep state S3
Reason for this is that during suspend the LVT offset for the IBS
vector gets lost and needs to be reinialized while resuming.
The offset is read from the IBSCTL msr. On family 10h the offset needs
to be 1 as offset 0 is used for the MCE threshold interrupt, but
firmware assings it for IBS to 0 too. The kernel needs to reprogram
the vector. The msr is a readonly node msr, but a new value can be
written via pci config space access. The reinitialization is
implemented for family 10h in setup_ibs_ctl() which is forced during
IBS setup.
This patch fixes IBS setup after waking up from S3 by adding
resume/supend hooks for the boot cpu which does the offset
reinitialization.
Marking it as stable to let distros pick up this fix.
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 26bef1318adc1b3a530ecc807ef99346db2aa8b0 upstream.
Before we do an EMMS in the AMD FXSAVE information leak workaround we
need to clear any pending exceptions, otherwise we trap with a
floating-point exception inside this code.
Reported-by: halfdog <me@halfdog.net>
Tested-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/CA%2B55aFxQnY_PCG_n4=0w-VG=YLXL-yr7oMxyy0WU2gCBAf3ydg@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 20841405940e7be0617612d521e206e4b6b325db upstream.
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2b4847e73004c10ae6666c2e27b5c5430aed8698 upstream.
Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.
THP does not unmap pages due to a lack of support for migration entries
at a PMD level. This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().
This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side the page table lock is taken for each PTE update.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e66d2ae7c67bd9ac982a3d1890564de7f7eabf4b upstream.
Update arch.apic_base before triggering recalculate_apic_map. Otherwise
the recalculation will work against the previous state of the APIC and
will fail to build the correct map when an APIC is hardware-enabled
again.
This fixes a regression of 1e08ec4a13.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 29bf08f12b2fd72b882da0d85b7385e4a438a297 upstream.
Three reasons for doing this: 1. arch.walk_mmu points to arch.mmu anyway
in case nested EPT wasn't in use. 2. this aligns VMX with SVM. But 3. is
most important: nested_cpu_has_ept(vmcs12) queries the VMCS page, and if
one guest VCPU manipulates the page of another VCPU in L2, we may be
fooled to skip over the nested_ept_uninit_mmu_context, leaving mmu in
nested state. That can crash the host later on if nested_ept_get_cr3 is
invoked while L1 already left vmxon and nested.current_vmcs12 became
NULL therefore.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.
Linux 3.10 changed the timing of how thread_info->flags is touched:
x86: Use generic idle loop
(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.
Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.
Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.
While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.
Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8b3b005d675726e38bc504d2e35a991e55819155 upstream.
In checkin
5551a34e5aea x86-64, build: Always pass in -mno-sse
we unconditionally added -mno-sse to the main build, to keep newer
compilers from generating SSE instructions from autovectorization.
However, this did not extend to the special environments
(arch/x86/boot, arch/x86/boot/compressed, and arch/x86/realmode/rm).
Add -mno-sse to the compiler command line for these environments, and
add -mno-mmx to all the environments as well, as we don't want a
compiler to generate MMX code either.
This patch also removes a $(cc-option) call for -m32, since we have
long since stopped supporting compilers too old for the -m32 option,
and in fact hardcode it in other places in the Makefiles.
Reported-by: Kevin B. Smith <kevin.b.smith@intel.com>
Cc: Sunil K. Pandey <sunil.k.pandey@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Link: http://lkml.kernel.org/n/tip-j21wzqv790q834n7yc6g80j1@git.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
boot option is used
commit 3eae49ca8954f958b2001ab5643ef302cb7b67c7 upstream.
The SGI UV tlb shootdown code panics the system with a NULL
pointer deference if 'nobau' is specified on the boot
commandline.
uv_flush_tlb_other() gets called for every flush, whether the
BAU is disabled or not. It should not be keeping the s_enters
statistic while the BAU is disabled.
The panic occurs because during initialization
init_per_cpu_tunables() does not set the bcp->statp pointer if
'nobau' was specified.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Link: http://lkml.kernel.org/r/E1VnzBi-0005yF-MU@eag09.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 04bf9ba720fcc4fa313fa122b799ae0989b6cd50 upstream.
UEFI time services are often broken once we're in virtual mode. We were
already refusing to use them on 64-bit systems, but it turns out that
they're also broken on some 32-bit firmware, including the Dell Venue.
Disable them for now, we can revisit once we have the 1:1 mappings code
incorporated.
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Link: http://lkml.kernel.org/r/1385754283-2464-1-git-send-email-matthew.garrett@nebula.com
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 17d68b763f09a9ce824ae23eb62c9efc57b69271 upstream.
A guest can cause a BUG_ON() leading to a host kernel crash.
When the guest writes to the ICR to request an IPI, while in x2apic
mode the following things happen, the destination is read from
ICR2, which is a register that the guest can control.
kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the
cluster id. A BUG_ON is triggered, which is a protection against
accessing map->logical_map with an out-of-bounds access and manages
to avoid that anything really unsafe occurs.
The logic in the code is correct from real HW point of view. The problem
is that KVM supports only one cluster with ID 0 in clustered mode, but
the code that has the bug does not take this into account.
Reported-by: Lars Bull <larsbull@google.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fda4e2e85589191b123d31cdc21fd33ee70f50fd upstream.
In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
potential to corrupt kernel memory if userspace provides an address that
is at the end of a page. This patches concerts those functions to use
kvm_write_guest_cached and kvm_read_guest_cached. It also checks the
vapic_address specified by userspace during ioctl processing and returns
an error to userspace if the address is not a valid GPA.
This is generally not guest triggerable, because the required write is
done by firmware that runs before the guest. Also, it only affects AMD
processors and oldish Intel that do not have the FlexPriority feature
(unless you disable FlexPriority, of course; then newer processors are
also affected).
Fixes: b93463aa59d6 ('KVM: Accelerated apic support')
Reported-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b963a22e6d1a266a67e9eecc88134713fd54775c upstream.
Under guest controllable circumstances apic_get_tmcct will execute a
divide by zero and cause a crash. If the guest cpuid support
tsc deadline timers and performs the following sequence of requests
the host will crash.
- Set the mode to periodic
- Set the TMICT to 0
- Set the mode bits to 11 (neither periodic, nor one shot, nor tsc deadline)
- Set the TMICT to non-zero.
Then the lapic_timer.period will be 0, but the TMICT will not be. If the
guest then reads from the TMCCT then the host will perform a divide by 0.
This patch ensures that if the lapic_timer.period is 0, then the division
does not occur.
Reported-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5551a34e5aeab868f8d37f70d8754868921b4ee5 upstream.
Always pass in the -mno-sse argument, regardless if
-preferred-stack-boundary is supported. We never want to generate SSE
instructions in the kernel unless we *really* know what we're doing.
According to H. J. Lu, any version of gcc new enough that we support
it at all should handle the -mno-sse option, so just add it
unconditionally.
Reported-by: Kevin B. Smith <kevin.b.smith@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Link: http://lkml.kernel.org/n/tip-j21wzqv790q834n7yc6g80j1@git.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 98bbc06aabac5a2dcc46580d20c59baf8ebe479f ]
sk_filter isn't freed if bpf_func is equal to sk_run_filter.
This memory leak was introduced by v3.12-rc3-224-gd45ed4a4
"net: fix unsafe set_memory_rw from softirq".
Before this patch sk_filter was freed in sk_filter_release_rcu,
now it should be freed in bpf_jit_free.
Here is output of kmemleak:
unreferenced object 0xffff8800b774eab0 (size 128):
comm "systemd", pid 1, jiffies 4294669014 (age 124.062s)
hex dump (first 32 bytes):
00 00 00 00 0b 00 00 00 20 63 7f b7 00 88 ff ff ........ c......
60 d4 55 81 ff ff ff ff 30 d9 55 81 ff ff ff ff `.U.....0.U.....
backtrace:
[<ffffffff816444be>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811845af>] __kmalloc+0xef/0x260
[<ffffffff81534028>] sock_kmalloc+0x38/0x60
[<ffffffff8155d4dd>] sk_attach_filter+0x5d/0x190
[<ffffffff815378a1>] sock_setsockopt+0x991/0x9e0
[<ffffffff81531bd6>] SyS_setsockopt+0xb6/0xd0
[<ffffffff8165f3e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
v2: add extra { } after else
Fixes: d45ed4a4e33a ("net: fix unsafe set_memory_rw from softirq")
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ab4ead02ec235d706d0611d8741964628291237e upstream.
In commit 8a4d0a687a59 "ftrace: Use breakpoint method to update ftrace
caller", we choose to use breakpoint method to update the ftrace
caller. But we also need to skip over the breakpoint in function
ftrace_int3_handler() for them. Otherwise weird things would happen.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit daf727225b8abfdfe424716abac3d15a3ac5626a upstream.
When I was looking at RHEL5.9's failure to start with
unrestricted_guest=0/emulate_invalid_guest_state=1, I got it working with a
slightly older tree than kvm.git. I now debugged the remaining failure,
which was introduced by commit 660696d1 (KVM: X86 emulator: fix
source operand decoding for 8bit mov[zs]x instructions, 2013-04-24)
introduced a similar mis-emulation to the one in commit 8acb4207 (KVM:
fix sil/dil/bpl/spl in the mod/rm fields, 2013-05-30). The incorrect
decoding occurs in 8-bit movzx/movsx instructions whose 8-bit operand
is sil/dil/bpl/spl.
Needless to say, "movzbl %bpl, %eax" does occur in RHEL5.9's decompression
prolog, just a handful of instructions before finally giving control to
the decompressed vmlinux and getting out of the invalid guest state.
Because OpMem8 bypasses decode_modrm, the same handling of the REX prefix
must be applied to OpMem8.
Reported-by: Michele Baldessari <michele@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
as an error
commit 11f918d3e2d3861b6931e97b3aa778e4984935aa upstream.
Do it the same way as done in microcode_intel.c: use pr_debug()
for missing firmware files.
There seem to be CPUs out there for which no microcode update
has been submitted to kernel-firmware repo yet resulting in
scary sounding error messages in dmesg:
microcode: failed to load file amd-ucode/microcode_amd_fam16h.bin
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1384274383-43510-1-git-send-email-trenn@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 522e66464467543c0d88d023336eec4df03ad40b upstream.
In reboot and crash path, when we shut down the local APIC, the I/O APIC is
still active. This may cause issues because external interrupts
can still come in and disturb the local APIC during shutdown process.
To quiet external interrupts, disable I/O APIC before shutdown local APIC.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1382578212-4677-1-git-send-email-fenghua.yu@intel.com
[ I suppose the 'issue' is a hang during shutdown. It's a fine change nevertheless. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ea8117478918a4734586d35ff530721b682425be upstream.
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.
The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).
Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).
Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: lenb@kernel.org
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Two fixes:
- Fix 'NMI handler took too long to run' false positives
[ Genuine NMI overhead speedups will come for v3.13, this commit
only fixes a measurement bug ]
- Fix perf ring-buffer missed barrier causing (rare) ring-buffer data
corruption on ppc64"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix NMI measurements
perf: Fix perf ring buffer memory ordering
|
|
Merge three fixes from Andrew Morton.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
percpu: fix this_cpu_sub() subtrahend casting for unsigneds
mm/pagewalk.c: fix walk_page_range() access of wrong PTEs
|
|
this_cpu_sub() is implemented as negation and addition.
This patch casts the adjustment to the counter type before negation to
sign extend the adjustment. This helps in cases where the counter type
is wider than an unsigned adjustment. An alternative to this patch is
to declare such operations unsupported, but it seemed useful to avoid
surprises.
This patch specifically helps the following example:
unsigned int delta = 1
preempt_disable()
this_cpu_write(long_counter, 0)
this_cpu_sub(long_counter, delta)
preempt_enable()
Before this change long_counter on a 64 bit machine ends with value
0xffffffff, rather than 0xffffffffffffffff. This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
long_counter = 0 + 0xffffffff
Also apply the same cast to:
__this_cpu_sub()
__this_cpu_sub_return()
this_cpu_sub_return()
All percpu_test.ko passes, especially the following cases which
previously failed:
l -= ui_one;
__this_cpu_sub(long_counter, ui_one);
CHECK(l, long_counter, -1);
l -= ui_one;
this_cpu_sub(long_counter, ui_one);
CHECK(l, long_counter, -1);
CHECK(l, long_counter, 0xffffffffffffffff);
ul -= ui_one;
__this_cpu_sub(ulong_counter, ui_one);
CHECK(ul, ulong_counter, -1);
CHECK(ul, ulong_counter, 0xffffffffffffffff);
ul = this_cpu_sub_return(ulong_counter, ui_one);
CHECK(ul, ulong_counter, 2);
ul = __this_cpu_sub_return(ulong_counter, ui_one);
CHECK(ul, ulong_counter, 1);
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The x86 specific kvm init creates a new conflicting
debugfs directory which causes modprobe issues
with kvm_intel and kvm_amd. For example,
sudo modprobe kvm_amd
modprobe: ERROR: could not insert 'kvm_amd': Bad address
The simplest fix is to just rename the directory. The following
KVM config options are set:
CONFIG_KVM_GUEST=y
CONFIG_KVM_DEBUG_FS=y
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
CONFIG_KVM_DEVICE_ASSIGNMENT=y
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[Change debugfs directory name. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
OK, so what I'm actually seeing on my WSM is that sched/clock.c is
'broken' for the purpose we're using it for.
What triggered it is that my WSM-EP is broken :-(
[ 0.001000] tsc: Fast TSC calibration using PIT
[ 0.002000] tsc: Detected 2533.715 MHz processor
[ 0.500180] TSC synchronization [CPU#0 -> CPU#6]:
[ 0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed
For some reason it consistently detects TSC skew, even though NHM+
should have a single clock domain for 'reasonable' systems.
This marks sched_clock_stable=0, which means that we do fancy stuff to
try and get a 'sane' clock. Part of this fancy stuff relies on the tick,
clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until
it either wakes up or gets kicked by another cpu.
While this is perfectly fine for the scheduler -- it only cares about
actually running stuff, and when we're running stuff we're obviously not
idle. This does somewhat break down for perf which can trigger events
just fine on an otherwise idle cpu.
So I've got NMIs get get 'measured' as taking ~1ms, which actually
don't last nearly that long:
<idle>-0 [013] d.h. 886.311970: rcu_nmi_enter <-do_nmi
...
<idle>-0 [013] d.h. 886.311997: perf_sample_event_took: HERE!!! : 1040990
So ftrace (which uses sched_clock(), not the fancy bits) only sees
~27us, but we measure ~1ms !!
Now since all this measurement stuff lives in x86 code, we can actually
fix it.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org
Cc: dave.hansen@linux.intel.com
Cc: eranian@google.com
Cc: Don Zickus <dzickus@redhat.com>
Cc: jmario@redhat.com
Cc: acme@infradead.org
Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull networking fixes from David Miller:
"Sorry I let so much accumulate, I was in Buffalo and wanted a few
things to cook in my tree for a while before sending to you. Anyways,
it's a lot of little things as usual at this stage in the game"
1) Make bonding MAINTAINERS entry reflect reality, from Andy
Gospodarek.
2) Fix accidental sock_put() on timewait mini sockets, from Eric
Dumazet.
3) Fix crashes in l2tp due to mis-handling of ipv4 mapped ipv6
addresses, from François CACHEREUL.
4) Fix heap overflow in __audit_sockaddr(), from the eagle eyed Dan
Carpenter.
5) tcp_shifted_skb() doesn't take handle FINs properly, from Eric
Dumazet.
6) SFC driver bug fixes from Ben Hutchings.
7) Fix TX packet scheduling wedge after channel change in ath9k driver,
from Felix Fietkau.
8) Fix user after free in BPF JIT code, from Alexei Starovoitov.
9) Source address selection test is reversed in
__ip_route_output_key(), fix from Jiri Benc.
10) VLAN and CAN layer mis-size netlink attributes, from Marc
Kleine-Budde.
11) Fix permission checks in sysctls to use current_euid() instead of
current_uid(). From Eric W Biederman.
12) IPSEC policies can go away while a timer is still pending for them,
add appropriate ref-counting to fix, from Steffen Klassert.
13) Fix mis-programming of FDR and RMCR registers on R8A7740 sh_eth
chips, from Nguyen Hong Ky and Simon Horman.
14) MLX4 forgets to DMA unmap pages on RX, fix from Amir Vadai.
15) IPV6 GRE tunnel MTU upper limit is miscalculated, from Oussama
Ghorbel.
16) Fix typo in fq_change(), we were assigning "initial quantum" to
"quantum". From Eric Dumazet.
17) Set a more appropriate sk_pacing_rate for non-TCP sockets, otherwise
FQ packet scheduler does not pace those flows properly. Also from
Eric Dumazet.
18) rtlwifi miscalculates packet pointers, from Mark Cave-Ayland.
19) l2tp_xmit_skb() can be called from process context, not just softirq
context, so we must always make sure to BH disable around it. From
Eric Dumazet.
20) On qdisc reset, we forget to purge the RB tree of SKBs in netem
packet scheduler. From Stephen Hemminger.
21) Fix info leak in farsync WAN driver ioctl() handler, from Dan
Carpenter and Salva Peiró.
22) Fix PHY reset and other issues in dm9000 driver, from Nikita
Kiryanov and Michael Abbott.
23) When hardware can do SCTP crc32 checksums, we accidently don't
disable the csum offload when IPSEC transformations have been
applied. From Fan Du and Vlad Yasevich.
24) Tail loss probing in TCP leaves the socket in the wrong congestion
avoidance state. From Yuchung Cheng.
25) In CPSW driver, enable NAPI before interrupts are turned on, from
Markus Pargmann.
26) Integer underflow and dual-assignment in YAM hamradio driver, from
Dan Carpenter.
27) If we are going to mangle a packet in tcp_set_skb_tso_segs() we must
unclone it. This fixes various hard to track down crashes in
drivers where the SKBs ->gso_segs was changing right from underneath
the driver during TX queueing. From Eric Dumazet.
28) Fix the handling of VLAN IDs, and in particular the special IDs 0
and 4095, in the bridging layer. From Toshiaki Makita.
29) Another info leak, this time in wanxl WAN driver, from Salva Peiró.
30) Fix race in socket credential passing, from Daniel Borkmann.
31) WHen NETLABEL is disabled, we don't validate CIPSO packets properly,
from Seif Mazareeb.
32) Fix identification of fragmented frames in ipv4/ipv6 UDP
Fragmentation Offload output paths, from Jiri Pirko.
33) Virtual Function fixes in bnx2x driver from Yuval Mintz and Ariel
Elior.
34) When we removed the explicit neighbour pointer from ipv6 routes a
slight regression was introduced for users such as IPVS, xt_TEE, and
raw sockets. We mix up the users requested destination address with
the routes assigned nexthop/gateway. From Julian Anastasov and
Simon Horman.
35) Fix stack overruns in rt6_probe(), the issue is that can end up
doing two full packet xmit paths at the same time when emitting
neighbour discovery messages. From Hannes Frederic Sowa.
36) davinci_emac driver doesn't handle IFF_ALLMULTI correctly, from
Mariusz Ceier.
37) Make sure to set TCP sk_pacing_rate after the first legitimate RTT
sample, from Neal Cardwell.
38) Wrong netlink attribute passed to xfrm_replay_verify_len(), from
Steffen Klassert.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
ax88179_178a: Add VID:DID for Samsung USB Ethernet Adapter
ax88179_178a: Correct the RX error definition in RX header
Revert "bridge: only expire the mdb entry when query is received"
tcp: initialize passive-side sk_pacing_rate after 3WHS
davinci_emac.c: Fix IFF_ALLMULTI setup
mac802154: correct a typo in ieee802154_alloc_device() prototype
ipv6: probe routes asynchronous in rt6_probe
netfilter: nf_conntrack: fix rt6i_gateway checks for H.323 helper
ipv6: fill rt6i_gateway with nexthop address
ipv6: always prefer rt6i_gateway if present
bnx2x: Set NETIF_F_HIGHDMA unconditionally
bnx2x: Don't pretend during register dump
bnx2x: Lock DMAE when used by statistic flow
bnx2x: Prevent null pointer dereference on error flow
bnx2x: Fix config when SR-IOV and iSCSI are enabled
bnx2x: Fix Coalescing configuration
bnx2x: Unlock VF-PF channel on MAC/VLAN config error
bnx2x: Prevent an illegal pointer dereference during panic
bnx2x: Fix Maximum CoS estimation for VFs
drivers: net: cpsw: fix kernel warn during iperf test with interrupt pacing
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Two fixlets:
- fix a (rare-config) build bug
- fix a next-gen SGI/UV hw/firmware enumeration bug"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Update UV3 hub revision ID
x86/microcode: Correct Kconfig dependencies
|
|
Pull kvm fix from Gleb Natapov.
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Enable pvspinlock after jump_label_init() to avoid VM hang
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen fixes from Stefano Stabellini:
"A small fix for Xen on x86_32 and a build fix for xen-tpmfront on
arm64"
* tag 'stable/for-linus-3.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Fix possible user space selector corruption
tpm: xen-tpmfront: fix missing declaration of xen_domain
|
|
We use jump label to enable pv-spinlock. With the changes in (442e0973e927
Merge branch 'x86/jumplabel'), the jump label behaviour has changed
that would result in eventual hang of the VM since we would end up in a
situation where slow path locks would halt the vcpus but we will not be
able to wakeup the vcpu by lock releaser using unlock kick.
Similar problem in Xen and more detailed description is available in
a945928ea270 (xen: Do not enable spinlocks before jump_label_init()
has executed)
This patch splits kvm_spinlock_init to separate jump label changes with
pvops patching and also make jump label enabling after jump_label_init().
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
|
The UV3 hub revision ID is different than expected. The first
revision was supposed to start at 1 but instead will start at 0.
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: <stable@kernel.org> # v3.9, v3.10, v3.11
Link: http://lkml.kernel.org/r/20131014161733.GA6274@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
I have a randconfig here which has enabled only
CONFIG_MICROCODE=y
CONFIG_MICROCODE_OLD_INTERFACE=y
with both
# CONFIG_MICROCODE_INTEL is not set
# CONFIG_MICROCODE_AMD is not set
off. Which makes building the microcode functionality a little
pointless. Don't do that in such cases then.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1381682189-14470-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|