Age | Commit message (Collapse) | Author |
|
The inactive timer could not be raised casued by this patch:
https://lkml.org/lkml/2014/2/28/230
This makes the system using deferred timer like CPUfreq not
working.
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Change-Id: I8e1b6dbcb2845c6e502d9ff50617e8c30d2da7c0
Reviewed-on: http://git.am.freescale.net:8181/34890
Tested-by: Review Code-CDREVIEW <CDREVIEW@freescale.com>
Reviewed-by: Yang Li <LeoLi@freescale.com>
Reviewed-by: Honghua Yin <Hong-Hua.Yin@freescale.com>
|
|
The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled. Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.
To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.
For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters. Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.
This will be a slight slowdown on some arches. The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.
This is a prerequisite for making two-phase seccomp work cleanly.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
(backported from commit a4412fc9486ec85686c6c7929e7e829f62ae377e)
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Change-Id: I4109ed2560d19349927c3e3f7648022ae23db318
Reviewed-on: http://git.am.freescale.net:8181/33029
Reviewed-by: Scott Wood <scottwood@freescale.com>
Tested-by: Honghua Yin <Hong-Hua.Yin@freescale.com>
Reviewed-by: Honghua Yin <Hong-Hua.Yin@freescale.com>
|
|
Every caller of syscall_get_arch() uses current for the task and no
implementors of the function need args. So just get rid of both of
those things. Admittedly, since these are inline functions we aren't
wasting stack space, but it just makes the prototypes better.
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Cc: linux390@de.ibm.com
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
(backported from commit 5e937a9ae9137899c6641d718bd3820861099a09)
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Change-Id: Ifeefd84eeaa99445fdfc49ef782b01957dd67c00
Reviewed-on: http://git.am.freescale.net:8181/33023
Tested-by: Review Code-CDREVIEW <CDREVIEW@freescale.com>
Reviewed-by: Honghua Yin <Hong-Hua.Yin@freescale.com>
|
|
|
|
Signed-off-by: Scott Wood <scottwood@freescale.com>
Conflicts:
arch/arm/kvm/mmu.c
arch/arm/mm/proc-v7-3level.S
arch/powerpc/kernel/vdso32/getcpu.S
drivers/crypto/caam/error.c
drivers/crypto/caam/sg_sw_sec4.h
drivers/usb/host/ehci-fsl.c
|
|
Austin reported a XFS deadlock/stall on RT where scheduled work gets
never exececuted and tasks are waiting for each other for ever.
The underlying problem is the modification of the RT code to the
handling of workers which are about to go to sleep. In mainline a
worker thread which goes to sleep wakes an idle worker if there is
more work to do. This happens from the guts of the schedule()
function. On RT this must be outside and the accessed data structures
are not protected against scheduling due to the spinlock to rtmutex
conversion. So the naive solution to this was to move the code outside
of the scheduler and protect the data structures by the pool
lock. That approach turned out to be a little naive as we cannot call
into that code when the thread blocks on a lock, as it is not allowed
to block on two locks in parallel. So we dont call into the worker
wakeup magic when the worker is blocked on a lock, which causes the
deadlock/stall observed by Austin and Mike.
Looking deeper into that worker code it turns out that the only
relevant data structure which needs to be protected is the list of
idle workers which can be woken up.
So the solution is to protect the list manipulation operations with
preempt_enable/disable pairs on RT and call unconditionally into the
worker code even when the worker is blocked on a lock. The preemption
protection is safe as there is nothing which can fiddle with the list
outside of thread context.
Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com>
Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
I talked with Peter Zijlstra about this, and he told me that the clearing
of the PF_NO_SETAFFINITY flag was to deal with the optimization of
migrate_disable/enable() that ignores tasks that have that flag set. But
that optimization was removed when I did a rework of the cpu hotplug code.
I found that ignoring tasks that had that flag set would cause those tasks
to not sync with the hotplug code and cause the kernel to crash. Thus it
needed to not treat them special and those tasks had to go though the same
work as tasks without that flag set.
Now that those tasks are not treated special, there's no reason to clear the
flag.
May still need to be tested as the migrate_me() code does not ignore those
flags.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140701111444.0cfebaa1@gandalf.local.home
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
switch_hrtimer_base()"
In an rt-kernel with CONFIG_MISSED_TIMER_OFFSETS_HIST enabled,
__hrtimer_start_range_ns() now crashes, as new_base is not assigned
before it is used.
Tested-by: Carsten Emde <C.Emde@osadl.org>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
This will void a warning comming from the spin-lock debugging code. The
lock avoiding idea is from Steven Rostedt.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
This reverts commit ff9c870c3e27d58c9512fad122e91436681fee5a.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
The changes to move the migrate_disable() down in the trylocks()
caused race conditions to appear in the cpu hotplug code. The
migrate disables must be done before any of the rtmutexes are
taken, otherwise a lock may be held that prevents hotplug from
moving forward.
Link: http://lkml.kernel.org/r/20140429201308.63292691@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
|
|
The ntp code for notify_cmos_timer() is called from a hard interrupt
context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks
that have been converted to mutexes, thus calling schedule_delayed_work()
from interrupt is not safe.
Add a helper thread that does the call to schedule_delayed_work and wake
up that thread instead of calling schedule_delayed_work() directly.
This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls
schedule_delayed_work() directly in irq context.
Note: There's a few places in the kernel that do this. Perhaps the RT
code should have a dedicated thread that does the checks. Just register
a notifier on boot up for your check and wake up the thread when
needed. This will be a todo.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Completions have no long lasting callbacks and therefor do not need
the complex waitqueue variant. Use simple waitqueues which reduces the
contention on the waitqueue lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merged Steven's
static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) {
- swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]);
+ wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]);
}
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
The function "swait_head_has_waiters()" was internalized into
wait-simple.c but it parallels the waitqueue_active of normal
waitqueue support. Given that there are over 150 waitqueue_active
users in drivers/ fs/ kernel/ and the like, lets make it globally
visible, and rename it to parallel the waitqueue_active accordingly.
We'll need to do this if we expect to expand its usage beyond RT.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
wait_queue is a swiss army knife and in most of the cases the
complexity is not needed. For RT waitqueues are a constant source of
trouble as we can't convert the head lock to a raw spinlock due to
fancy and long lasting callbacks.
Provide a slim version, which allows RT to replace wait queues. This
should go mainline as well, as it lowers memory consumption and
runtime overhead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
smp_mb() added by Steven Rostedt to fix a race condition with swait
wakeups vs adding items to the list.
|
|
It has become an obsession to mitigate the determinism vs. throughput
loss of RT. Looking at the mainline semantics of preemption points
gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER
tasks. One major issue is the wakeup of tasks which are right away
preempting the waking task while the waking task holds a lock on which
the woken task will block right after having preempted the wakee. In
mainline this is prevented due to the implicit preemption disable of
spin/rw_lock held regions. On RT this is not possible due to the fully
preemptible nature of sleeping spinlocks.
Though for a SCHED_OTHER task preempting another SCHED_OTHER task this
is really not a correctness issue. RT folks are concerned about
SCHED_FIFO/RR tasks preemption and not about the purely fairness
driven SCHED_OTHER preemption latencies.
So I introduced a lazy preemption mechanism which only applies to
SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the
existing preempt_count each tasks sports now a preempt_lazy_count
which is manipulated on lock acquiry and release. This is slightly
incorrect as for lazyness reasons I coupled this on
migrate_disable/enable so some other mechanisms get the same treatment
(e.g. get_cpu_light).
Now on the scheduler side instead of setting NEED_RESCHED this sets
NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and
therefor allows to exit the waking task the lock held region before
the woken task preempts. That also works better for cross CPU wakeups
as the other side can stay in the adaptive spinning loop.
For RT class preemption there is no change. This simply sets
NEED_RESCHED and forgoes the lazy preemption counter.
Initial test do not expose any observable latency increasement, but
history shows that I've been proven wrong before :)
The lazy preemption mode is per default on, but with
CONFIG_SCHED_DEBUG enabled it can be disabled via:
# echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features
and reenabled via
# echo PREEMPT_LAZY >/sys/kernel/debug/sched_features
The test results so far are very machine and workload dependent, but
there is a clear trend that it enhances the non RT workload
performance.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Running RCU out of softirq is a problem for some workloads that would
like to manage RCU core processing independently of other softirq work,
for example, setting kthread priority. This commit therefore moves the
RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread
named rcuc. The SCHED_OTHER approach avoids the scalability problems
that appeared with the earlier attempt to move RCU core processing to
from softirq to kthreads. That said, kernels built with RCU_BOOST=y
will run the rcuc kthreads at the RCU-boosting priority.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
This patch removes the recursive calls to migrate_disable/enable in
local_bh_disable/enable
the softirq-local-lock.patch introduces local_bh_disable/enable wich
decrements/increments the current->softirq_nestcnt and disable/enables
migration as well. as softirq_nestcnt (include/linux/sched.h conditioned
on CONFIG_PREEMPT_RT_BASE) already is tracking the nesting level of the
recursive calls to local_bh_disable/enable (all in kernel/softirq.c) - no
need to do it twice.
migrate_disable/enable thus can be conditionsed on softirq_nestcnt making
a transition from 0-1 to disable migration and 1-0 to re-enable it.
No change of functional behavior, this does noticably reduce the observed
nesting level of migrate_disable/enable
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
We can't rely on ksoftirqd anymore and we need to check the tasks
which run a particular softirq and if such a task is pi blocked ignore
the other pending bits of that task as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
trivial API cleanup - kernel/softirq.c was mimiking local_lock.
No change of functional behavior
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
The 3.x RT series removed the split softirq implementation in favour
of pushing softirq processing into the context of the thread which
raised it. Though this prevents us from handling the various softirqs
at different priorities. Now instead of reintroducing the split
softirq threads we split the locks which serialize the softirq
processing.
If a softirq is raised in context of a thread, then the softirq is
noted on a per thread field, if the thread is in a bh disabled
region. If the softirq is raised from hard interrupt context, then the
bit is set in the flag field of ksoftirqd and ksoftirqd is invoked.
When a thread leaves a bh disabled region, then it tries to execute
the softirqs which have been raised in its own context. It acquires
the per softirq / per cpu lock for the softirq and then checks,
whether the softirq is still pending in the per cpu
local_softirq_pending() field. If yes, it runs the softirq. If no,
then some other task executed it already. This allows for zero config
softirq elevation in the context of user space tasks or interrupt
threads.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Split out the inner handling function, so RT can reuse it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Avoid the percpu softirq_runner pointer magic by using a task flag.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Otherwise we get a deadlock like below:
[ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003
[ 1044.042752] INFO: lockdep is turned off.
[ 1044.042754] Modules linked in:
[ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G W 3.4.0-rc2-rt3-23676-ga723175-dirty #29
[ 1044.042759] Call Trace:
[ 1044.042761] <IRQ> [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80
[ 1044.042770] [<ffffffff8168978c>] __schedule+0x83c/0xa70
[ 1044.042775] [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0
[ 1044.042779] [<ffffffff81689a5e>] schedule+0x2e/0xa0
[ 1044.042782] [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0
[ 1044.042786] [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40
[ 1044.042790] [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40
[ 1044.042794] [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50
[ 1044.042798] [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40
[ 1044.042802] [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10
[ 1044.042805] [<ffffffff8111c568>] event_sched_out+0x118/0x1d0
[ 1044.042809] [<ffffffff8111c649>] group_sched_out+0x29/0x90
[ 1044.042813] [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200
[ 1044.042817] [<ffffffff8111c343>] remote_function+0x63/0x70
[ 1044.042821] [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120
[ 1044.042826] [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40
[ 1044.042831] [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80
[ 1044.042833] <EOI> [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042840] [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70
[ 1044.042844] [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70
[ 1044.042848] [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200
[ 1044.042853] [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042857] [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0
[ 1044.042862] [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200
[ 1044.042865] [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210
[ 1044.042869] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042873] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042877] [<ffffffff8106b596>] kthread+0xb6/0xc0
[ 1044.042881] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042886] [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10
[ 1044.042889] [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110
[ 1044.042894] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042897] [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe
[ 1044.042900] [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0
[ 1044.042902] [<ffffffff8168d990>] ? gs_change+0xb/0xb
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
rwlocks and rwsems on RT do not allow multiple readers. Annotate the
lockdep acquire functions accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to
use migrate_enable()/migrate_disable() to replace that combination
of preempt_enable() and preempt_disable(), but actually in
!CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable()
are still equal to preempt_enable()/preempt_disable(). So that
followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule()
to trigger schedule_debug() like this:
_cpu_down()
|
+ migrate_disable() = preempt_disable()
|
+ cpu_hotplug_begin() or cpu_unplug_begin()
|
+ schedule()
|
+ __schedule()
|
+ preempt_disable();
|
+ __schedule_bug() is true!
So we should move migrate_enable() as the original scheme.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
|
|
If a task which is allowed to run only on CPU X puts CPU Y down then it
will be allowed on all CPUs but the on CPU Y after it comes back from
kernel. This patch ensures that we don't lose the initial setting unless
the CPU the task is running is going down.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
If kthread is pinned to CPUx and CPUx is going down then we get into
trouble:
- first the unplug thread is created
- it will set itself to hp->unplug. As a result, every task that is
going to take a lock, has to leave the CPU.
- the CPU_DOWN_PREPARE notifier are started. The worker thread will
start a new process for the "high priority worker".
Now kthread would like to take a lock but since it can't leave the CPU
it will never complete its task.
We could fire the unplug thread after the notifier but then the cpu is
no longer marked "online" and the unplug thread will run on CPU0 which
was fixed before :)
So instead the unplug thread is started and kept waiting until the
notfier complete their work.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
The patch:
cpu: Make hotplug.lock a "sleeping" spinlock on RT
Tasks can block on hotplug.lock in pin_current_cpu(), but their
state might be != RUNNING. So the mutex wakeup will set the state
unconditionally to RUNNING. That might cause spurious unexpected
wakeups. We could provide a state preserving mutex_lock() function,
but this is semantically backwards. So instead we convert the
hotplug.lock() to a spinlock for RT, which has the state preserving
semantics already.
Fixed a bug where the hotplug lock on PREEMPT_RT can be called after a
task set its state to TASK_UNINTERRUPTIBLE and before it called
schedule. If the hotplug_lock used a mutex, and there was contention,
the current task's state would be turned to TASK_RUNNABLE and the
schedule call will not sleep. This caused unexpected results.
Although the patch had a description of the change, the code had no
comments about it. This causes confusion to those that review the code,
and as PREEMPT_RT is held in a quilt queue and not git, it's not as easy
to see why a change was made. Even if it was in git, the code should
still have a comment for something as subtle as this.
Document the rational for using a spinlock on PREEMPT_RT in the hotplug
lock code.
Reported-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Bringing a CPU down is a pain with the PREEMPT_RT kernel because
tasks can be preempted in many more places than in non-RT. In
order to handle per_cpu variables, tasks may be pinned to a CPU
for a while, and even sleep. But these tasks need to be off the CPU
if that CPU is going down.
Several synchronization methods have been tried, but when stressed
they failed. This is a new approach.
A sync_tsk thread is still created and tasks may still block on a
lock when the CPU is going down, but how that works is a bit different.
When cpu_down() starts, it will create the sync_tsk and wait on it
to inform that current tasks that are pinned on the CPU are no longer
pinned. But new tasks that are about to be pinned will still be allowed
to do so at this time.
Then the notifiers are called. Several notifiers will bring down tasks
that will enter these locations. Some of these tasks will take locks
of other tasks that are on the CPU. If we don't let those other tasks
continue, but make them block until CPU down is done, the tasks that
the notifiers are waiting on will never complete as they are waiting
for the locks held by the tasks that are blocked.
Thus we still let the task pin the CPU until the notifiers are done.
After the notifiers run, we then make new tasks entering the pinned
CPU sections grab a mutex and wait. This mutex is now a per CPU mutex
in the hotplug_pcp descriptor.
To help things along, a new function in the scheduler code is created
called migrate_me(). This function will try to migrate the current task
off the CPU this is going down if possible. When the sync_tsk is created,
all tasks will then try to migrate off the CPU going down. There are
several cases that this wont work, but it helps in most cases.
After the notifiers are called and if a task can't migrate off but enters
the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex
until the CPU down is complete. Then the scheduler will force the migration
anyway.
Also, I found that THREAD_BOUND need to also be accounted for in the
pinned CPU, and the migrate_disable no longer treats them special.
This helps fix issues with ksoftirqd and workqueue that unbind on CPU down.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Tasks can block on hotplug.lock in pin_current_cpu(), but their state
might be != RUNNING. So the mutex wakeup will set the state
unconditionally to RUNNING. That might cause spurious unexpected
wakeups. We could provide a state preserving mutex_lock() function,
but this is semantically backwards. So instead we convert the
hotplug.lock() to a spinlock for RT, which has the state preserving
semantics already.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <clark.williams@gmail.com>
Cc: stable-rt@vger.kernel.org
Link: http://lkml.kernel.org/r/1330702617.25686.265.camel@gandalf.stny.rr.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Delegate the random insertion to the forced threaded interrupt
handler. Store the return IP of the hard interrupt handler in the irq
descriptor and feed it into the random generator as a source of
entropy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
Add a /sys/kernel entry to indicate that the kernel is a
realtime kernel.
Clark says that he needs this for udev rules, udev needs to evaluate
if its a PREEMPT_RT kernel a few thousand times and parsing uname
output is too slow or so.
Are there better solutions? Should it exist and return 0 on !-rt?
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
On 07/27/2011 04:37 PM, Thomas Gleixner wrote:
> - KGDB (not yet disabled) is reportedly unusable on -rt right now due
> to missing hacks in the console locking which I dropped on purpose.
>
To work around this in the short term you can use this patch, in
addition to the clocksource watchdog patch that Thomas brewed up.
Comments are welcome of course. Ultimately the right solution is to
change separation between the console and the HW to have a polled mode
+ work queue so as not to introduce any kind of latency.
Thanks,
Jason.
|
|
The lock is hold with irgs off. The latency drops 500us+ on my arm bugs
with a "full" buffer after executing "dmesg" on the shell.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
irq_work is processed in softirq context on -RT because we want to avoid
long latencies which might arise from processing lots of perf events.
The noHZ-full mode requires its callback to be called from real hardirq
context (commit 76c24fb ("nohz: New APIs to re-evaluate the tick on full
dynticks CPUs")). If it is called from a thread context we might get
wrong results for checks like "is_idle_task(current)".
This patch introduces a second list (hirq_work_list) which will be used
if irq_work_run() has been invoked from hardirq context and process only
work items marked with IRQ_WORK_HARD_IRQ.
This patch also removes arch_irq_work_raise() from sparc & powerpc like
it is already done for x86. Atleast for powerpc it is somehow
superfluous because it is called from the timer interrupt which should
invoke update_process_times().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The worker accounting for cpu bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.
Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. There is also no harm from
updating nr_running when the task returns from scheduling instead of
accounting it in the wakeup code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
An Intel i7 system regularly detected rcu_preempt stalls after the kernel
was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no
longer possible, unless the system was restarted.
The kernel message was:
INFO: rcu_preempt self-detected stall on CPU { 6}
[..]
NMI backtrace for cpu 6
CPU 6
Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58
RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30
RSP: 0018:ffff880333303cb0 EFLAGS: 00000002
RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034
RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001
RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc
R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f
R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002
FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000)
Stack:
ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000
0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8
ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28
Call Trace:
<IRQ>
[<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3
[<ffffffff8124caa9>] __delay+0xf/0x11
[<ffffffff8124cad2>] __const_udelay+0x27/0x29
[<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45
[<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58
[<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d
[<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19
[<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79
[<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568
[<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35
[<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38
[<ffffffff8104f653>] update_process_times+0x44/0x55
[<ffffffff81086866>] tick_sched_handle+0x4a/0x59
[<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b
[<ffffffff81062845>] __run_hrtimer+0x9b/0x158
[<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa
[<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89
[<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80
<EOI>
[<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a
[<ffffffff81059336>] try_to_grab_pending+0x42/0x17e
[<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88
[<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e
[<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39
[<ffffffff81230985>] flush_end_io+0xf1/0x107
[<ffffffff8122e0da>] blk_finish_request+0x21e/0x264
[<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60
[<ffffffff8122e1ba>] blk_end_request+0x10/0x12
[<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492
[<ffffffff81335cec>] ? sd_done+0x298/0x2ef
[<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2
[<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f
[<ffffffff812333d3>] blk_done_softirq+0x77/0x87
[<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1
[<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a
[<ffffffff81048466>] local_bh_enable+0x43/0x72
[<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52
[<ffffffff810ab089>] irq_thread+0x8c/0x17c
[<ffffffff810ab179>] ? irq_thread+0x17c/0x17c
[<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44
[<ffffffff8105eb18>] kthread+0x8d/0x95
[<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65
[<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0
[<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65
The state of softirqd of this CPU at the time of the crash was:
ksoftirqd/6 R running task 0 53 2 0x00000000
ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710
ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500
ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000
Call Trace:
[<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c
[<ffffffff814d178c>] preempt_schedule+0x61/0x76
[<ffffffff8106cccf>] migrate_enable+0xe5/0x1df
[<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c
[<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6
[<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1
[<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45
[<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308
[<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc
[<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc
[<ffffffff8105eb18>] kthread+0x8d/0x95
[<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65
[<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0
[<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65
Apparently, the softirq demon and the ata_piix IRQ handler were waiting
for each other to finish ending up in a livelock. After the below patch
was applied, the system no longer crashes.
Reported-by: Carsten Emde <C.Emde@osadl.org>
Proposed-by: Thomas Gleixner <tglx@linutronix.de>
Tested by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
There is no need for sched_rcu. The undocumented reason why sched_rcu
is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by
abusing the fact that sched_rcu reader side critical sections are also
protected by preempt or irq disabled regions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
We hit another bug that was caused by switching cpu_chill() from
msleep() to hrtimer_nanosleep().
This time it is a livelock. The problem is that hrtimer_nanosleep()
calls schedule with the state == TASK_INTERRUPTIBLE. But these means
that if a signal is pending, the scheduler wont schedule, and will
simply change the current task state back to TASK_RUNNING. This
nullifies the whole point of cpu_chill() in the first place. That is,
if a task is spinning on a try_lock() and it preempted the owner of the
lock, if it has a signal pending, it will never give up the CPU to let
the owner of the lock run.
I made a static function __hrtimer_nanosleep() that takes a fifth
parameter "state", which determines the task state of that the
nanosleep() will be in. The normal hrtimer_nanosleep() will act the
same, but cpu_chill() will call the __hrtimer_nanosleep() directly with
the TASK_UNINTERRUPTIBLE state.
cpu_chill() only cares that the first sleep happens, and does not care
about the state of the restart schedule (in hrtimer_nanosleep_restart).
Cc: stable-rt@vger.kernel.org
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Since we replaced msleep() by hrtimer I see now and then (rarely) this:
| [....] Waiting for /dev to be fully populated...
| =====================================
| [ BUG: udevd/229 still has locks held! ]
| 3.12.11-rt17 #23 Not tainted
| -------------------------------------
| 1 lock held by udevd/229:
| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98
|
| stack backtrace:
| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23
| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc)
| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160)
| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110)
| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38)
| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec)
| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c)
| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50)
| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44)
| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98)
| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc)
| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60)
| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c)
| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c)
| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94)
| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30)
| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48)
For now I see no better way but to disable the freezer the sleep the period.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
called from softirq context, it may block the ksoftirqd() from running, in
which case, it may never wake up the msleep() causing the deadlock.
I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call,
running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that
runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that
ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be
blocked on a lock that irq/74-qla2xxx holds, we have our deadlock.
The solution is not to convert the cpu_chill() back to a cpu_relax() as that
will re-create a possible live lock that the cpu_chill() fixed earlier, and may
also leave this bug open on other softirqs. The fix is to remove the
dependency on ksoftirqd from cpu_chill(). That is, instead of calling
msleep() that requires ksoftirqd to wake it up, use the
hrtimer_nanosleep() code that does the wakeup from hard irq context.
|Looks to be the lock of the block softirq. I don't have the core dump
|anymore, but from what I could tell the ksoftirqd was blocked on the
|block softirq lock, where the block softirq handler did a msleep
|(called by the qla2xxx interrupt handler).
|
|Looking at trigger_softirq() in block/blk-softirq.c, it can do a
|smp_callfunction() to another cpu to run the block softirq. If that
|happens to be the cpu where the qla2xx irq handler is doing the block
|softirq and is in a middle of a msleep(), I believe the ksoftirqd will
|try to run the softirq. If it does that, then BOOM, it's deadlocked
|because the ksoftirqd will never run the timer softirq either.
|I should have also stated that it was only one lock that was involved.
|But the lock owner was doing a msleep() that requires a wakeup by
|ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock
|held by the msleep() caller, then you have your deadlock.
|
|It's best not to have any softirqs going to sleep requiring another
|softirq to wake it up. Note, if we ever require a timer softirq to do a
|cpu_chill() it will most definitely hit this deadlock.
Cc: stable-rt@vger.kernel.org
Found-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[bigeasy: add the 4 | chapters from email]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Any callers to the function rcu_preempt_qs() must disable irqs in
order to protect the assignment to ->rcu_read_unlock_special. In
RT case, rcu_bh_qs() as the wrapper of rcu_preempt_qs() is called
in some scenarios where irq is enabled, like this path,
do_single_softirq()
|
+ local_irq_enable();
+ handle_softirq()
| |
| + rcu_bh_qs()
| |
| + rcu_preempt_qs()
|
+ local_irq_disable()
So here we'd better disable irq directly inside of rcu_bh_qs() to
fix this, otherwise the kernel may be freezable sometimes as
observed. And especially this way is also kind and safe for the
potential rcu_bh_qs() usage elsewhere in the future.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Bin Jiang <bin.jiang@windriver.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|