Age | Commit message (Collapse) | Author |
|
Otherwise we get a deadlock like below:
[ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003
[ 1044.042752] INFO: lockdep is turned off.
[ 1044.042754] Modules linked in:
[ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G W 3.4.0-rc2-rt3-23676-ga723175-dirty #29
[ 1044.042759] Call Trace:
[ 1044.042761] <IRQ> [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80
[ 1044.042770] [<ffffffff8168978c>] __schedule+0x83c/0xa70
[ 1044.042775] [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0
[ 1044.042779] [<ffffffff81689a5e>] schedule+0x2e/0xa0
[ 1044.042782] [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0
[ 1044.042786] [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40
[ 1044.042790] [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40
[ 1044.042794] [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50
[ 1044.042798] [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40
[ 1044.042802] [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10
[ 1044.042805] [<ffffffff8111c568>] event_sched_out+0x118/0x1d0
[ 1044.042809] [<ffffffff8111c649>] group_sched_out+0x29/0x90
[ 1044.042813] [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200
[ 1044.042817] [<ffffffff8111c343>] remote_function+0x63/0x70
[ 1044.042821] [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120
[ 1044.042826] [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40
[ 1044.042831] [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80
[ 1044.042833] <EOI> [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042840] [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70
[ 1044.042844] [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70
[ 1044.042848] [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200
[ 1044.042853] [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042857] [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0
[ 1044.042862] [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200
[ 1044.042865] [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210
[ 1044.042869] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042873] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042877] [<ffffffff8106b596>] kthread+0xb6/0xc0
[ 1044.042881] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042886] [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10
[ 1044.042889] [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110
[ 1044.042894] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042897] [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe
[ 1044.042900] [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0
[ 1044.042902] [<ffffffff8168d990>] ? gs_change+0xb/0xb
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
rwlocks and rwsems on RT do not allow multiple readers. Annotate the
lockdep acquire functions accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
On -rt there is no softirq context any more and rwlock is sleepable,
disable softirq context test and rwlock+irq test.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Yong Zhang <yong.zhang@windriver.com>
Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The crypto notifier deadlocks on RT. Though this can be a real deadlock
on mainline as well due to fifo fair rwsems.
The involved parties here are:
[ 82.172678] swapper/0 S 0000000000000001 0 1 0 0x00000000
[ 82.172682] ffff88042f18fcf0 0000000000000046 ffff88042f18fc80 ffffffff81491238
[ 82.172685] 0000000000011cc0 0000000000011cc0 ffff88042f18c040 ffff88042f18ffd8
[ 82.172688] 0000000000011cc0 0000000000011cc0 ffff88042f18ffd8 0000000000011cc0
[ 82.172689] Call Trace:
[ 82.172697] [<ffffffff81491238>] ? _raw_spin_unlock_irqrestore+0x6c/0x7a
[ 82.172701] [<ffffffff8148fd3f>] schedule+0x64/0x66
[ 82.172704] [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0
[ 82.172708] [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c
[ 82.172713] [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141
[ 82.172716] [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f
[ 82.172719] [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182
[ 82.172722] [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e
[ 82.172726] [<ffffffff811debfd>] crypto_wait_for_test+0x49/0x6b
[ 82.172728] [<ffffffff811ded32>] crypto_register_alg+0x53/0x5a
[ 82.172730] [<ffffffff811ded6c>] crypto_register_algs+0x33/0x72
[ 82.172734] [<ffffffff81ad7686>] ? aes_init+0x12/0x12
[ 82.172737] [<ffffffff81ad76ea>] aesni_init+0x64/0x66
[ 82.172741] [<ffffffff81000318>] do_one_initcall+0x7f/0x13b
[ 82.172744] [<ffffffff81ac4d34>] kernel_init+0x199/0x22c
[ 82.172747] [<ffffffff81ac44ef>] ? loglevel+0x31/0x31
[ 82.172752] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[ 82.172755] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[ 82.172759] [<ffffffff81ac4b9b>] ? start_kernel+0x3ca/0x3ca
[ 82.172761] [<ffffffff814987c0>] ? gs_change+0x13/0x13
[ 82.174186] cryptomgr_test S 0000000000000001 0 41 2 0x00000000
[ 82.174189] ffff88042c971980 0000000000000046 ffffffff81d74830 0000000000000292
[ 82.174192] 0000000000011cc0 0000000000011cc0 ffff88042c96eb80 ffff88042c971fd8
[ 82.174195] 0000000000011cc0 0000000000011cc0 ffff88042c971fd8 0000000000011cc0
[ 82.174195] Call Trace:
[ 82.174198] [<ffffffff8148fd3f>] schedule+0x64/0x66
[ 82.174201] [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0
[ 82.174204] [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c
[ 82.174206] [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141
[ 82.174209] [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f
[ 82.174212] [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182
[ 82.174215] [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e
[ 82.174218] [<ffffffff811e4883>] cryptomgr_notify+0x280/0x385
[ 82.174221] [<ffffffff814943de>] notifier_call_chain+0x6b/0x98
[ 82.174224] [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12
[ 82.174227] [<ffffffff810677cd>] __blocking_notifier_call_chain+0x70/0x8d
[ 82.174230] [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16
[ 82.174234] [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50
[ 82.174236] [<ffffffff811dd7a1>] crypto_alg_mod_lookup+0x3e/0x74
[ 82.174238] [<ffffffff811dd949>] crypto_alloc_base+0x36/0x8f
[ 82.174241] [<ffffffff811e9408>] cryptd_alloc_ablkcipher+0x6e/0xb5
[ 82.174243] [<ffffffff811dd591>] ? kzalloc.clone.5+0xe/0x10
[ 82.174246] [<ffffffff8103085d>] ablk_init_common+0x1d/0x38
[ 82.174249] [<ffffffff8103852a>] ablk_ecb_init+0x15/0x17
[ 82.174251] [<ffffffff811dd8c6>] __crypto_alloc_tfm+0xc7/0x114
[ 82.174254] [<ffffffff811e0caa>] ? crypto_lookup_skcipher+0x1f/0xe4
[ 82.174256] [<ffffffff811e0dcf>] crypto_alloc_ablkcipher+0x60/0xa5
[ 82.174258] [<ffffffff811e5bde>] alg_test_skcipher+0x24/0x9b
[ 82.174261] [<ffffffff8106d96d>] ? finish_task_switch+0x3f/0xfa
[ 82.174263] [<ffffffff811e6b8e>] alg_test+0x16f/0x1d7
[ 82.174267] [<ffffffff811e45ac>] ? cryptomgr_probe+0xac/0xac
[ 82.174269] [<ffffffff811e45d8>] cryptomgr_test+0x2c/0x47
[ 82.174272] [<ffffffff81061161>] kthread+0x7e/0x86
[ 82.174275] [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa
[ 82.174278] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[ 82.174281] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[ 82.174284] [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c
[ 82.174287] [<ffffffff814987c0>] ? gs_change+0x13/0x13
[ 82.174329] cryptomgr_probe D 0000000000000002 0 47 2 0x00000000
[ 82.174332] ffff88042c991b70 0000000000000046 ffff88042c991bb0 0000000000000006
[ 82.174335] 0000000000011cc0 0000000000011cc0 ffff88042c98ed00 ffff88042c991fd8
[ 82.174338] 0000000000011cc0 0000000000011cc0 ffff88042c991fd8 0000000000011cc0
[ 82.174338] Call Trace:
[ 82.174342] [<ffffffff8148fd3f>] schedule+0x64/0x66
[ 82.174344] [<ffffffff814901ad>] __rt_mutex_slowlock+0x85/0xbe
[ 82.174347] [<ffffffff814902d2>] rt_mutex_slowlock+0xec/0x159
[ 82.174351] [<ffffffff81089c4d>] rt_mutex_fastlock.clone.8+0x29/0x2f
[ 82.174353] [<ffffffff81490372>] rt_mutex_lock+0x33/0x37
[ 82.174356] [<ffffffff8108a0f2>] __rt_down_read+0x50/0x5a
[ 82.174358] [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12
[ 82.174360] [<ffffffff8108a11c>] rt_down_read+0x10/0x12
[ 82.174363] [<ffffffff810677b5>] __blocking_notifier_call_chain+0x58/0x8d
[ 82.174366] [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16
[ 82.174369] [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50
[ 82.174372] [<ffffffff811debd6>] crypto_wait_for_test+0x22/0x6b
[ 82.174374] [<ffffffff811decd3>] crypto_register_instance+0xb4/0xc0
[ 82.174377] [<ffffffff811e9b76>] cryptd_create+0x378/0x3b6
[ 82.174379] [<ffffffff811de512>] ? __crypto_lookup_template+0x5b/0x63
[ 82.174382] [<ffffffff811e4545>] cryptomgr_probe+0x45/0xac
[ 82.174385] [<ffffffff811e4500>] ? crypto_alloc_pcomp+0x1b/0x1b
[ 82.174388] [<ffffffff81061161>] kthread+0x7e/0x86
[ 82.174391] [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa
[ 82.174394] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[ 82.174398] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[ 82.174401] [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c
[ 82.174403] [<ffffffff814987c0>] ? gs_change+0x13/0x13
cryptomgr_test spawns the cryptomgr_probe thread from the notifier
call. The probe thread fires the same notifier as the test thread and
deadlocks on the rwsem on RT.
Now this is a potential deadlock in mainline as well, because we have
fifo fair rwsems. If another thread blocks with a down_write() on the
notifier chain before the probe thread issues the down_read() it will
block the probe thread and the whole party is dead locked.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
On RT write_seqcount_begin() disables preemption and device_rename()
allocates memory with GFP_KERNEL and grabs later the sysfs_mutex
mutex. Serialize with a mutex and add use the non preemption disabling
__write_seqcount_begin().
To avoid writer starvation, let the reader grab the mutex and release
it when it detects a writer in progress. This keeps the normal case
(no reader on the fly) fast.
[ tglx: Instead of replacing the seqcount by a mutex, add the mutex ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This code triggers the new WARN in __raise_softirq_irqsoff() though it
actually looks at the softirq pending bit and calls into the softirq
code, but that fits not well with the context related softirq model of
RT. It's correct on mainline though, but going through
local_bh_disable/enable here is not going to hurt badly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The netfilter code relies only on the implicit semantics of
local_bh_disable() for serializing wt_write_recseq sections. RT breaks
that and needs explicit serialization here.
Reported-by: Peter LaDow <petela@gocougs.wsu.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
in response to the oops in ip_output.c:ip_send_unicast_reply under high
network load with CONFIG_PREEMPT_RT_FULL=y, reported by Sami Pietikainen
<Sami.Pietikainen@wapice.com>, this patch adds local serialization in
ip_send_unicast_reply.
from ip_output.c:
/*
* Generic function to send a packet as reply to another packet.
* Used to send some TCP resets/acks so far.
*
* Use a fake percpu inet socket to avoid false sharing and contention.
*/
static DEFINE_PER_CPU(struct inet_sock, unicast_sock) = {
...
which was added in commit be9f4a44 in linux-stable. The git log, wich
introduced the PER_CPU unicast_sock, states:
<snip>
commit be9f4a44e7d41cee50ddb5f038fc2391cbbb4046
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jul 19 07:34:03 2012 +0000
ipv4: tcp: remove per net tcp_sock
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.
This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.
To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)
<snip>
The per-cpu here thus is assuming exclusivity serializing per cpu - so
the use of get_cpu_ligh introduced in
net-use-cpu-light-in-ip-send-unicast-reply.patch, which droped the
preempt_disable in favor of a migrate_disable is probably wrong as this
only handles the referencial consistency but not the serialization. To
evade a preempt_disable here a local lock would be needed.
Therapie:
* add local lock:
* and re-introduce local serialization:
Tested on x86 with high network load using the testcase from Sami Pietikainen
while : ; do wget -O - ftp://LOCAL_SERVER/empty_file > /dev/null 2>&1; done
Link: http://www.spinics.net/lists/linux-rt-users/msg11007.html
Cc: stable-rt@vger.kernel.org
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Replace it by a local lock. Though that's pretty inefficient :(
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
1)enqueue_to_backlog() (called from netif_rx) should be
bind to a particluar CPU. This can be achieved by
disabling migration. No need to disable preemption
2)Fixes crash "BUG: scheduling while atomic: ksoftirqd"
in case of RT.
If preemption is disabled, enqueue_to_backog() is called
in atomic context. And if backlog exceeds its count,
kfree_skb() is called. But in RT, kfree_skb() might
gets scheduled out, so it expects non atomic context.
3)When CONFIG_PREEMPT_RT_FULL is not defined,
migrate_enable(), migrate_disable() maps to
preempt_enable() and preempt_disable(), so no
change in functionality in case of non-RT.
-Replace preempt_enable(), preempt_disable() with
migrate_enable(), migrate_disable() respectively
-Replace get_cpu(), put_cpu() with get_cpu_light(),
put_cpu_light() respectively
Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com>
Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com>
Cc: <rostedt@goodmis.orgn>
Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
RT triggers the following:
[ 11.307652] [<ffffffff81077b27>] __might_sleep+0xe7/0x110
[ 11.307663] [<ffffffff8150e524>] rt_spin_lock+0x24/0x60
[ 11.307670] [<ffffffff8150da78>] ? rt_spin_lock_slowunlock+0x78/0x90
[ 11.307703] [<ffffffffa0272d83>] qla24xx_intr_handler+0x63/0x2d0 [qla2xxx]
[ 11.307736] [<ffffffffa0262307>] qla2x00_poll+0x67/0x90 [qla2xxx]
Function qla2x00_poll does local_irq_save() before calling qla24xx_intr_handler
which has a spinlock. Since spinlocks are sleepable on rt, it is not allowed
to call them with interrupts disabled. Therefore we use local_irq_save_nort()
instead which saves flags without disabling interrupts.
This fix needs to be applied to v3.0-rt, v3.2-rt and v3.4-rt
Suggested-by: Thomas Gleixner
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Sommerseth <davids@redhat.com>
Link: http://lkml.kernel.org/r/1335523726-10024-1-git-send-email-jkacur@redhat.com
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to
use migrate_enable()/migrate_disable() to replace that combination
of preempt_enable() and preempt_disable(), but actually in
!CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable()
are still equal to preempt_enable()/preempt_disable(). So that
followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule()
to trigger schedule_debug() like this:
_cpu_down()
|
+ migrate_disable() = preempt_disable()
|
+ cpu_hotplug_begin() or cpu_unplug_begin()
|
+ schedule()
|
+ __schedule()
|
+ preempt_disable();
|
+ __schedule_bug() is true!
So we should move migrate_enable() as the original scheme.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
|
|
If a task which is allowed to run only on CPU X puts CPU Y down then it
will be allowed on all CPUs but the on CPU Y after it comes back from
kernel. This patch ensures that we don't lose the initial setting unless
the CPU the task is running is going down.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
If kthread is pinned to CPUx and CPUx is going down then we get into
trouble:
- first the unplug thread is created
- it will set itself to hp->unplug. As a result, every task that is
going to take a lock, has to leave the CPU.
- the CPU_DOWN_PREPARE notifier are started. The worker thread will
start a new process for the "high priority worker".
Now kthread would like to take a lock but since it can't leave the CPU
it will never complete its task.
We could fire the unplug thread after the notifier but then the cpu is
no longer marked "online" and the unplug thread will run on CPU0 which
was fixed before :)
So instead the unplug thread is started and kept waiting until the
notfier complete their work.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
The patch:
cpu: Make hotplug.lock a "sleeping" spinlock on RT
Tasks can block on hotplug.lock in pin_current_cpu(), but their
state might be != RUNNING. So the mutex wakeup will set the state
unconditionally to RUNNING. That might cause spurious unexpected
wakeups. We could provide a state preserving mutex_lock() function,
but this is semantically backwards. So instead we convert the
hotplug.lock() to a spinlock for RT, which has the state preserving
semantics already.
Fixed a bug where the hotplug lock on PREEMPT_RT can be called after a
task set its state to TASK_UNINTERRUPTIBLE and before it called
schedule. If the hotplug_lock used a mutex, and there was contention,
the current task's state would be turned to TASK_RUNNABLE and the
schedule call will not sleep. This caused unexpected results.
Although the patch had a description of the change, the code had no
comments about it. This causes confusion to those that review the code,
and as PREEMPT_RT is held in a quilt queue and not git, it's not as easy
to see why a change was made. Even if it was in git, the code should
still have a comment for something as subtle as this.
Document the rational for using a spinlock on PREEMPT_RT in the hotplug
lock code.
Reported-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Bringing a CPU down is a pain with the PREEMPT_RT kernel because
tasks can be preempted in many more places than in non-RT. In
order to handle per_cpu variables, tasks may be pinned to a CPU
for a while, and even sleep. But these tasks need to be off the CPU
if that CPU is going down.
Several synchronization methods have been tried, but when stressed
they failed. This is a new approach.
A sync_tsk thread is still created and tasks may still block on a
lock when the CPU is going down, but how that works is a bit different.
When cpu_down() starts, it will create the sync_tsk and wait on it
to inform that current tasks that are pinned on the CPU are no longer
pinned. But new tasks that are about to be pinned will still be allowed
to do so at this time.
Then the notifiers are called. Several notifiers will bring down tasks
that will enter these locations. Some of these tasks will take locks
of other tasks that are on the CPU. If we don't let those other tasks
continue, but make them block until CPU down is done, the tasks that
the notifiers are waiting on will never complete as they are waiting
for the locks held by the tasks that are blocked.
Thus we still let the task pin the CPU until the notifiers are done.
After the notifiers run, we then make new tasks entering the pinned
CPU sections grab a mutex and wait. This mutex is now a per CPU mutex
in the hotplug_pcp descriptor.
To help things along, a new function in the scheduler code is created
called migrate_me(). This function will try to migrate the current task
off the CPU this is going down if possible. When the sync_tsk is created,
all tasks will then try to migrate off the CPU going down. There are
several cases that this wont work, but it helps in most cases.
After the notifiers are called and if a task can't migrate off but enters
the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex
until the CPU down is complete. Then the scheduler will force the migration
anyway.
Also, I found that THREAD_BOUND need to also be accounted for in the
pinned CPU, and the migrate_disable no longer treats them special.
This helps fix issues with ksoftirqd and workqueue that unbind on CPU down.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Tasks can block on hotplug.lock in pin_current_cpu(), but their state
might be != RUNNING. So the mutex wakeup will set the state
unconditionally to RUNNING. That might cause spurious unexpected
wakeups. We could provide a state preserving mutex_lock() function,
but this is semantically backwards. So instead we convert the
hotplug.lock() to a spinlock for RT, which has the state preserving
semantics already.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <clark.williams@gmail.com>
Cc: stable-rt@vger.kernel.org
Link: http://lkml.kernel.org/r/1330702617.25686.265.camel@gandalf.stny.rr.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
since c2f21ce ("locking: Implement new raw_spinlock")
include/linux/spinlock.h includes spin_unlock_wait() to wait for a concurren
holder of a lock. this patch just moves over to that API. spin_unlock_wait
covers both raw_spinlock_t and spinlock_t so it should be safe here as well.
the added rt-variant of read_seqbegin in include/linux/seqlock.h that is being
modified, was introduced by patch:
seqlock-prevent-rt-starvation.patch
behavior should be unchanged.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
If a low prio writer gets preempted while holding the seqlock write
locked, a high prio reader spins forever on RT.
To prevent this let the reader grab the spinlock, so it blocks and
eventually boosts the writer. This way the writer can proceed and
endless spinning is prevented.
For seqcount writers we disable preemption over the update code
path. Thaanks to Al Viro for distangling some VFS code to make that
possible.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
Delegate the random insertion to the forced threaded interrupt
handler. Store the return IP of the hard interrupt handler in the irq
descriptor and feed it into the random generator as a source of
entropy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
We can't deal with the cpumask allocations which happen in atomic
context (see arch/x86/kernel/apic/io_apic.c) on RT right now.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
We hit the following bug with 3.6-rt:
[ 5.898990] BUG: scheduling while atomic: swapper/3/0/0x00000002
[ 5.898991] no locks held by swapper/3/0.
[ 5.898993] Modules linked in:
[ 5.898996] Pid: 0, comm: swapper/3 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1
[ 5.898997] Call Trace:
[ 5.899011] [<ffffffff810804e7>] __schedule_bug+0x67/0x90
[ 5.899028] [<ffffffff81577923>] __schedule+0x793/0x7a0
[ 5.899032] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200
[ 5.899034] [<ffffffff81577b89>] schedule+0x29/0x70
[ 5.899036] BUG: scheduling while atomic: swapper/7/0/0x00000002
[ 5.899037] no locks held by swapper/7/0.
[ 5.899039] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0
[ 5.899040] Modules linked in:
[ 5.899041]
[ 5.899045] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90
[ 5.899046] Pid: 0, comm: swapper/7 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1
[ 5.899047] Call Trace:
[ 5.899049] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40
[ 5.899052] [<ffffffff810804e7>] __schedule_bug+0x67/0x90
[ 5.899054] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80
[ 5.899056] [<ffffffff81577923>] __schedule+0x793/0x7a0
[ 5.899059] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23
[ 5.899062] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200
[ 5.899068] [<ffffffff8130be64>] acpi_write_bit_register+0x33/0xb0
[ 5.899071] [<ffffffff81577b89>] schedule+0x29/0x70
[ 5.899072] [<ffffffff8130be13>] ? acpi_read_bit_register+0x33/0x51
[ 5.899074] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0
[ 5.899077] [<ffffffff8131d1fc>] acpi_idle_enter_bm+0x8a/0x28e
[ 5.899079] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90
[ 5.899081] [<ffffffff8107e5da>] ? this_cpu_load+0x1a/0x30
[ 5.899083] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40
[ 5.899087] [<ffffffff8144c759>] cpuidle_enter+0x19/0x20
[ 5.899088] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80
[ 5.899090] [<ffffffff8144c777>] cpuidle_enter_state+0x17/0x50
[ 5.899092] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23
[ 5.899094] [<ffffffff8144d1a1>] cpuidle899101] [<ffffffff8130be13>] ?
As the acpi code disables interrupts in acpi_idle_enter_bm, and calls
code that grabs the acpi lock, it causes issues as the lock is currently
in RT a sleeping lock.
The lock was converted from a raw to a sleeping lock due to some
previous issues, and tests that showed it didn't seem to matter.
Unfortunately, it did matter for one of our boxes.
This patch converts the lock back to a raw lock. I've run this code on a
few of my own machines, one being my laptop that uses the acpi quite
extensively. I've been able to suspend and resume without issues.
[ tglx: Made the change exclusive for acpi_gbl_hardware_lock ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: John Kacur <jkacur@gmail.com>
Cc: Clark Williams <clark@redhat.com>
Link: http://lkml.kernel.org/r/1360765565.23152.5.camel@gandalf.local.home
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Use the BUG_ON_NORT variant for the irq_disabled() checks. RT has
interrupts legitimately enabled here as we cant deadlock against the
irq thread due to the "sleeping spinlocks" conversion.
Reported-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Don Estabrook reported
| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100()
| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200()
| kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100()
and his backtrace showed some crypto functions which looked fine.
The problem is the following sequence:
glue_xts_crypt_128bit()
{
blkcipher_walk_virt(); /* normal migrate_disable() */
glue_fpu_begin(); /* get atomic */
while (nbytes) {
__glue_xts_crypt_128bit();
blkcipher_walk_done(); /* with nbytes = 0, migrate_enable()
* while we are atomic */
};
glue_fpu_end() /* no longer atomic */
}
and this is why the counter get out of sync and the warning is printed.
The other problem is that we are non-preemptible between
glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this,
I shorten the FPU off region and ensure blkcipher_walk_done() is called
with preemption enabled. This might hurt the performance because we now
enable/disable the FPU state more often but we gain lower latency and
the bug is gone.
Cc: stable-rt@vger.kernel.org
Reported-by: Don Estabrook <don.estabrook@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Restrict the preempt disabled regions to the actual floating point
operations and enable preemption for the administrative actions.
This is necessary on RT to avoid that kfree and other operations are
called with preemption disabled.
Reported-and-tested-by: Carsten Emde <cbe@osadl.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Current sysv sems have a weird ass wakeup scheme that involves keeping
preemption disabled over a potential O(n^2) loop and busy waiting on
that on other CPUs.
Kill this and simply wake the task directly from under the sem_lock.
This was discovered by a migrate_disable() debug feature that
disallows:
spin_lock();
preempt_disable();
spin_unlock()
preempt_enable();
Cc: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Manfred Spraul <manfred@colorfullife.com>
Link: http://lkml.kernel.org/r/1315994224.5040.1.camel@twins
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The tlb should be flushed on unmap and thus make the mapping entry
invalid. This is only done in the non-debug case which does not look
right.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
This is a copy from kmap_atomic_prot().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
In fact, with migrate_disable() existing one could play games with
kmap_atomic. You could save/restore the kmap_atomic slots on context
switch (if there are any in use of course), this should be esp easy now
that we have a kmap_atomic stack.
Something like the below.. it wants replacing all the preempt_disable()
stuff with pagefault_disable() && migrate_disable() of course, but then
you can flip kmaps around like below.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[dvhart@linux.intel.com: build fix]
Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins
[tglx@linutronix.de: Get rid of the per cpu variable and store the idx
and the pte content right away in the task struct.
Shortens the context switch code. ]
|
|
Add a /sys/kernel entry to indicate that the kernel is a
realtime kernel.
Clark says that he needs this for udev rules, udev needs to evaluate
if its a PREEMPT_RT kernel a few thousand times and parsing uname
output is too slow or so.
Are there better solutions? Should it exist and return 0 on !-rt?
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
On 07/27/2011 04:37 PM, Thomas Gleixner wrote:
> - KGDB (not yet disabled) is reportedly unusable on -rt right now due
> to missing hacks in the console locking which I dropped on purpose.
>
To work around this in the short term you can use this patch, in
addition to the clocksource watchdog patch that Thomas brewed up.
Comments are welcome of course. Ultimately the right solution is to
change separation between the console and the HW to have a polled mode
+ work queue so as not to introduce any kind of latency.
Thanks,
Jason.
|
|
There are (probably rare) situations when a system crashed and the system
console becomes unresponsive but the network icmp layer still is alive.
Wouldn't it be wonderful, if we then could submit a sysreq command via ping?
This patch provides this facility. Please consult the updated documentation
Documentation/sysrq.txt for details.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
|
|
qdisc_lock is taken w/o disabling interrupts or bottom halfs. So code
holding a qdisc_lock() can be interrupted and softirqs can run on the
return of interrupt in !RT.
The spin_trylock() in net_tx_action() makes sure, that the softirq
does not deadlock. When the lock can't be acquired q is requeued and
the NET_TX softirq is raised. That causes the softirq to run over and
over.
That works in mainline as do_softirq() has a retry loop limit and
leaves the softirq processing in the interrupt return path and
schedules ksoftirqd. The task which holds qdisc_lock cannot be
preempted, so the lock is released and either ksoftirqd or the next
softirq in the return from interrupt path can proceed. Though it's a
bit strange to actually run MAX_SOFTIRQ_RESTART (10) loops before it
decides to bail out even if it's clear in the first iteration :)
On RT all softirq processing is done in a FIFO thread and we don't
have a loop limit, so ksoftirqd preempts the lock holder forever and
unqueues and requeues until the reset button is hit.
Due to the forced threading of ksoftirqd on RT we actually cannot
deadlock on qdisc_lock because it's a "sleeping lock". So it's safe to
replace the spin_trylock() with a spin_lock(). When contended,
ksoftirqd is scheduled out and the lock holder can proceed.
[ tglx: Massaged changelog and code comments ]
Solved-by: Thomas Gleixner <tglx@linuxtronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Claudio R. Goncalves <lclaudio@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Mostly unwind is done with irqs enabled however SLUB may call it with
irqs disabled while creating a new SLUB cache.
I had system freeze while loading a module which called
kmem_cache_create() on init. That means SLUB's __slab_alloc() disabled
interrupts and then
->new_slab_objects()
->new_slab()
->setup_object()
->setup_object_debug()
->init_tracking()
->set_track()
->save_stack_trace()
->save_stack_trace_tsk()
->walk_stackframe()
->unwind_frame()
->unwind_find_idx()
=>spin_lock_irqsave(&unwind_lock);
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
RT is not too happy about the shared timer interrupt in AT91
devices. Default to tclib timer for RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The lock is hold with irgs off. The latency drops 500us+ on my arm bugs
with a "full" buffer after executing "dmesg" on the shell.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
irq_work is processed in softirq context on -RT because we want to avoid
long latencies which might arise from processing lots of perf events.
The noHZ-full mode requires its callback to be called from real hardirq
context (commit 76c24fb ("nohz: New APIs to re-evaluate the tick on full
dynticks CPUs")). If it is called from a thread context we might get
wrong results for checks like "is_idle_task(current)".
This patch introduces a second list (hirq_work_list) which will be used
if irq_work_run() has been invoked from hardirq context and process only
work items marked with IRQ_WORK_HARD_IRQ.
This patch also removes arch_irq_work_raise() from sparc & powerpc like
it is already done for x86. Atleast for powerpc it is somehow
superfluous because it is called from the timer interrupt which should
invoke update_process_times().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|