summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-12-11of/irq: Add of_irq_parse_and_map_pci() and related functionMinghuan Lian
The patch addes of_irq_parse_and_map_pci() and related function which will be used by Layerscape PCIe driver. of_irq_parse_raw() and of_irq_parse_one() are copied from Linux 3.17 drivers/of/irq.c. The related commit ID include: 7dc2e1134a22dc242175d5321c0c9e97d16eb87b 2361613206e66ce59cc0e08efa8d98ec15b84ed1 624cfca534f9b1ffb1326617b4e973a3d5ecff4a a9ecdc0fdc54aa499604dbd43132988effcac9b4 355e62f5ad12b005c862838156262eb2df2f8dff a7c194b007ec40a130207e9ace9cecf598fc6ac5 0c02c8007ea5554d028f99fd3e29fc201fdeeab3 530210c7814e83564c7ca7bca8192515042c0b63 of_irq_parse_pci() and of_irq_parse_and_map_pci() are copied from Linux 3.17 drivers/of/of_pci_irq.c The related commit ID include: 98d9f30c820d509145757e6ecbc36013aa02f7bc 2361613206e66ce59cc0e08efa8d98ec15b84ed1 16b84e5a505c790538e534ad8dfda9c288691e40 0c02c8007ea5554d028f99fd3e29fc201fdeeab3 530210c7814e83564c7ca7bca8192515042c0b63 irq_create_of_mapping_new() is copied from Linux 3.17 kernel/irq/irqdomain.c irq_create_of_mapping(). The related commit ID include: e6d30ab1e7d1281784672c0fc2ffa385cfb7279e The copyright is owned by the author of the related commit ID You can find out the detailed copyright information based on the commit ID in the upstream repository. Signed-off-by: Minghuan Lian <Minghuan.Lian@freescale.com> Change-Id: I24bcaaea4c6cbe43229bccaceb80e74f57a9ef93 Reviewed-on: http://git.am.freescale.net:8181/19677 Tested-by: Review Code-CDREVIEW <CDREVIEW@freescale.com> Reviewed-by: Yang Li <LeoLi@freescale.com> Reviewed-by: Zhengxiong Jin <Jason.Jin@freescale.com>
2014-05-14Merge branch 'rtmerge' into sdk-v1.6.xScott Wood
2014-05-14rt: Move migrate_disable up in trylocksSteven Rostedt
The changes to move the migrate_disable() down in the trylocks() caused race conditions to appear in the cpu hotplug code. The migrate disables must be done before any of the rtmutexes are taken, otherwise a lock may be held that prevents hotplug from moving forward. Link: http://lkml.kernel.org/r/20140429201308.63292691@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com>
2014-05-14rt,ntp: Move call to schedule_delayed_work() to helper threadSteven Rostedt
The ntp code for notify_cmos_timer() is called from a hard interrupt context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks that have been converted to mutexes, thus calling schedule_delayed_work() from interrupt is not safe. Add a helper thread that does the call to schedule_delayed_work and wake up that thread instead of calling schedule_delayed_work() directly. This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls schedule_delayed_work() directly in irq context. Note: There's a few places in the kernel that do this. Perhaps the RT code should have a dedicated thread that does the checks. Just register a notifier on boot up for your check and wake up the thread when needed. This will be a todo. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14completion: Use simple wait queuesThomas Gleixner
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcu-more-swait-conversions.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Merged Steven's static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) { - swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]); + wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]); } Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14kernel/treercu: use a simple waitqueueSebastian Andrzej Siewior
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14simple-wait: rename and export the equivalent of waitqueue_active()Paul Gortmaker
The function "swait_head_has_waiters()" was internalized into wait-simple.c but it parallels the waitqueue_active of normal waitqueue support. Given that there are over 150 waitqueue_active users in drivers/ fs/ kernel/ and the like, lets make it globally visible, and rename it to parallel the waitqueue_active accordingly. We'll need to do this if we expect to expand its usage beyond RT. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14wait-simple: Rework for use with completionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14wait-simple: Simple waitqueue implementationThomas Gleixner
wait_queue is a swiss army knife and in most of the cases the complexity is not needed. For RT waitqueues are a constant source of trouble as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. Provide a slim version, which allows RT to replace wait queues. This should go mainline as well, as it lowers memory consumption and runtime overhead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> smp_mb() added by Steven Rostedt to fix a race condition with swait wakeups vs adding items to the list.
2014-05-14sched: Add support for lazy preemptionThomas Gleixner
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcu: Eliminate softirq processing from rcutreePaul E. McKenney
Running RCU out of softirq is a problem for some workloads that would like to manage RCU core processing independently of other softirq work, for example, setting kthread priority. This commit therefore moves the RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread named rcuc. The SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14softirq: make migrate disable/enable conditioned on softirq_nestcnt transitionNicholas Mc Guire
This patch removes the recursive calls to migrate_disable/enable in local_bh_disable/enable the softirq-local-lock.patch introduces local_bh_disable/enable wich decrements/increments the current->softirq_nestcnt and disable/enables migration as well. as softirq_nestcnt (include/linux/sched.h conditioned on CONFIG_PREEMPT_RT_BASE) already is tracking the nesting level of the recursive calls to local_bh_disable/enable (all in kernel/softirq.c) - no need to do it twice. migrate_disable/enable thus can be conditionsed on softirq_nestcnt making a transition from 0-1 to disable migration and 1-0 to re-enable it. No change of functional behavior, this does noticably reduce the observed nesting level of migrate_disable/enable Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14softirq: Adapt NOHZ softirq pending check to new RT schemeThomas Gleixner
We can't rely on ksoftirqd anymore and we need to check the tasks which run a particular softirq and if such a task is pi blocked ignore the other pending bits of that task as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14API cleanup - use local_lock not __local_lock for softNicholas Mc Guire
trivial API cleanup - kernel/softirq.c was mimiking local_lock. No change of functional behavior Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14softirq: Split softirq locksThomas Gleixner
The 3.x RT series removed the split softirq implementation in favour of pushing softirq processing into the context of the thread which raised it. Though this prevents us from handling the various softirqs at different priorities. Now instead of reintroducing the split softirq threads we split the locks which serialize the softirq processing. If a softirq is raised in context of a thread, then the softirq is noted on a per thread field, if the thread is in a bh disabled region. If the softirq is raised from hard interrupt context, then the bit is set in the flag field of ksoftirqd and ksoftirqd is invoked. When a thread leaves a bh disabled region, then it tries to execute the softirqs which have been raised in its own context. It acquires the per softirq / per cpu lock for the softirq and then checks, whether the softirq is still pending in the per cpu local_softirq_pending() field. If yes, it runs the softirq. If no, then some other task executed it already. This allows for zero config softirq elevation in the context of user space tasks or interrupt threads. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14softirq: Split handling functionThomas Gleixner
Split out the inner handling function, so RT can reuse it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14softirq: Make serving softirqs a task flagThomas Gleixner
Avoid the percpu softirq_runner pointer magic by using a task flag. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14perf: Make swevent hrtimer run in irq instead of softirqYong Zhang
Otherwise we get a deadlock like below: [ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003 [ 1044.042752] INFO: lockdep is turned off. [ 1044.042754] Modules linked in: [ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G W 3.4.0-rc2-rt3-23676-ga723175-dirty #29 [ 1044.042759] Call Trace: [ 1044.042761] <IRQ> [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80 [ 1044.042770] [<ffffffff8168978c>] __schedule+0x83c/0xa70 [ 1044.042775] [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0 [ 1044.042779] [<ffffffff81689a5e>] schedule+0x2e/0xa0 [ 1044.042782] [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0 [ 1044.042786] [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40 [ 1044.042790] [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40 [ 1044.042794] [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50 [ 1044.042798] [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40 [ 1044.042802] [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10 [ 1044.042805] [<ffffffff8111c568>] event_sched_out+0x118/0x1d0 [ 1044.042809] [<ffffffff8111c649>] group_sched_out+0x29/0x90 [ 1044.042813] [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200 [ 1044.042817] [<ffffffff8111c343>] remote_function+0x63/0x70 [ 1044.042821] [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120 [ 1044.042826] [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40 [ 1044.042831] [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80 [ 1044.042833] <EOI> [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042840] [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70 [ 1044.042844] [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70 [ 1044.042848] [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200 [ 1044.042853] [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042857] [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0 [ 1044.042862] [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200 [ 1044.042865] [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210 [ 1044.042869] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042873] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042877] [<ffffffff8106b596>] kthread+0xb6/0xc0 [ 1044.042881] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042886] [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10 [ 1044.042889] [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110 [ 1044.042894] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042897] [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe [ 1044.042900] [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0 [ 1044.042902] [<ffffffff8168d990>] ? gs_change+0xb/0xb Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14rt: rwsem/rwlock: lockdep annotationsThomas Gleixner
rwlocks and rwsems on RT do not allow multiple readers. Annotate the lockdep acquire functions accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-14cpu_down: move migrate_enable() backTiejun Chen
Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to use migrate_enable()/migrate_disable() to replace that combination of preempt_enable() and preempt_disable(), but actually in !CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable() are still equal to preempt_enable()/preempt_disable(). So that followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule() to trigger schedule_debug() like this: _cpu_down() | + migrate_disable() = preempt_disable() | + cpu_hotplug_begin() or cpu_unplug_begin() | + schedule() | + __schedule() | + preempt_disable(); | + __schedule_bug() is true! So we should move migrate_enable() as the original scheme. Cc: stable-rt@vger.kernel.org Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
2014-05-14kernel/hotplug: restore original cpu mask oncpu/downSebastian Andrzej Siewior
If a task which is allowed to run only on CPU X puts CPU Y down then it will be allowed on all CPUs but the on CPU Y after it comes back from kernel. This patch ensures that we don't lose the initial setting unless the CPU the task is running is going down. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14kernel/cpu: fix cpu down problem if kthread's cpu is going downSebastian Andrzej Siewior
If kthread is pinned to CPUx and CPUx is going down then we get into trouble: - first the unplug thread is created - it will set itself to hp->unplug. As a result, every task that is going to take a lock, has to leave the CPU. - the CPU_DOWN_PREPARE notifier are started. The worker thread will start a new process for the "high priority worker". Now kthread would like to take a lock but since it can't leave the CPU it will never complete its task. We could fire the unplug thread after the notifier but then the cpu is no longer marked "online" and the unplug thread will run on CPU0 which was fixed before :) So instead the unplug thread is started and kept waiting until the notfier complete their work. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14cpu hotplug: Document why PREEMPT_RT uses a spinlockSteven Rostedt
The patch: cpu: Make hotplug.lock a "sleeping" spinlock on RT Tasks can block on hotplug.lock in pin_current_cpu(), but their state might be != RUNNING. So the mutex wakeup will set the state unconditionally to RUNNING. That might cause spurious unexpected wakeups. We could provide a state preserving mutex_lock() function, but this is semantically backwards. So instead we convert the hotplug.lock() to a spinlock for RT, which has the state preserving semantics already. Fixed a bug where the hotplug lock on PREEMPT_RT can be called after a task set its state to TASK_UNINTERRUPTIBLE and before it called schedule. If the hotplug_lock used a mutex, and there was contention, the current task's state would be turned to TASK_RUNNABLE and the schedule call will not sleep. This caused unexpected results. Although the patch had a description of the change, the code had no comments about it. This causes confusion to those that review the code, and as PREEMPT_RT is held in a quilt queue and not git, it's not as easy to see why a change was made. Even if it was in git, the code should still have a comment for something as subtle as this. Document the rational for using a spinlock on PREEMPT_RT in the hotplug lock code. Reported-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14cpu/rt: Rework cpu down for PREEMPT_RTSteven Rostedt
Bringing a CPU down is a pain with the PREEMPT_RT kernel because tasks can be preempted in many more places than in non-RT. In order to handle per_cpu variables, tasks may be pinned to a CPU for a while, and even sleep. But these tasks need to be off the CPU if that CPU is going down. Several synchronization methods have been tried, but when stressed they failed. This is a new approach. A sync_tsk thread is still created and tasks may still block on a lock when the CPU is going down, but how that works is a bit different. When cpu_down() starts, it will create the sync_tsk and wait on it to inform that current tasks that are pinned on the CPU are no longer pinned. But new tasks that are about to be pinned will still be allowed to do so at this time. Then the notifiers are called. Several notifiers will bring down tasks that will enter these locations. Some of these tasks will take locks of other tasks that are on the CPU. If we don't let those other tasks continue, but make them block until CPU down is done, the tasks that the notifiers are waiting on will never complete as they are waiting for the locks held by the tasks that are blocked. Thus we still let the task pin the CPU until the notifiers are done. After the notifiers run, we then make new tasks entering the pinned CPU sections grab a mutex and wait. This mutex is now a per CPU mutex in the hotplug_pcp descriptor. To help things along, a new function in the scheduler code is created called migrate_me(). This function will try to migrate the current task off the CPU this is going down if possible. When the sync_tsk is created, all tasks will then try to migrate off the CPU going down. There are several cases that this wont work, but it helps in most cases. After the notifiers are called and if a task can't migrate off but enters the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex until the CPU down is complete. Then the scheduler will force the migration anyway. Also, I found that THREAD_BOUND need to also be accounted for in the pinned CPU, and the migrate_disable no longer treats them special. This helps fix issues with ksoftirqd and workqueue that unbind on CPU down. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14cpu: Make hotplug.lock a "sleeping" spinlock on RTSteven Rostedt
Tasks can block on hotplug.lock in pin_current_cpu(), but their state might be != RUNNING. So the mutex wakeup will set the state unconditionally to RUNNING. That might cause spurious unexpected wakeups. We could provide a state preserving mutex_lock() function, but this is semantically backwards. So instead we convert the hotplug.lock() to a spinlock for RT, which has the state preserving semantics already. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Carsten Emde <C.Emde@osadl.org> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <clark.williams@gmail.com> Cc: stable-rt@vger.kernel.org Link: http://lkml.kernel.org/r/1330702617.25686.265.camel@gandalf.stny.rr.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14random: Make it work on rtThomas Gleixner
Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-14add /sys/kernel/realtime entryClark Williams
Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2014-05-14kgdb/serial: Short term workaroundJason Wessel
On 07/27/2011 04:37 PM, Thomas Gleixner wrote: > - KGDB (not yet disabled) is reportedly unusable on -rt right now due > to missing hacks in the console locking which I dropped on purpose. > To work around this in the short term you can use this patch, in addition to the clocksource watchdog patch that Thomas brewed up. Comments are welcome of course. Ultimately the right solution is to change separation between the console and the HW to have a polled mode + work queue so as not to introduce any kind of latency. Thanks, Jason.
2014-05-14HACK: printk: drop the logbuf_lock more oftenSebastian Andrzej Siewior
The lock is hold with irgs off. The latency drops 500us+ on my arm bugs with a "full" buffer after executing "dmesg" on the shell. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14printk-rt-aware.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14irq_work: allow certain work in hard irq contextSebastian Andrzej Siewior
irq_work is processed in softirq context on -RT because we want to avoid long latencies which might arise from processing lots of perf events. The noHZ-full mode requires its callback to be called from real hardirq context (commit 76c24fb ("nohz: New APIs to re-evaluate the tick on full dynticks CPUs")). If it is called from a thread context we might get wrong results for checks like "is_idle_task(current)". This patch introduces a second list (hirq_work_list) which will be used if irq_work_run() has been invoked from hardirq context and process only work items marked with IRQ_WORK_HARD_IRQ. This patch also removes arch_irq_work_raise() from sparc & powerpc like it is already done for x86. Atleast for powerpc it is somehow superfluous because it is called from the timer interrupt which should invoke update_process_times(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14x86-no-perf-irq-work-rt.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14workqueue vs ata-piix livelock fixupThomas Gleixner
An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14Use local irq lock instead of irq disable regionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14workqueue: Use normal rcuThomas Gleixner
There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14cpu_chill: Add a UNINTERRUPTIBLE hrtimer_nanosleepSteven Rostedt
We hit another bug that was caused by switching cpu_chill() from msleep() to hrtimer_nanosleep(). This time it is a livelock. The problem is that hrtimer_nanosleep() calls schedule with the state == TASK_INTERRUPTIBLE. But these means that if a signal is pending, the scheduler wont schedule, and will simply change the current task state back to TASK_RUNNING. This nullifies the whole point of cpu_chill() in the first place. That is, if a task is spinning on a try_lock() and it preempted the owner of the lock, if it has a signal pending, it will never give up the CPU to let the owner of the lock run. I made a static function __hrtimer_nanosleep() that takes a fifth parameter "state", which determines the task state of that the nanosleep() will be in. The normal hrtimer_nanosleep() will act the same, but cpu_chill() will call the __hrtimer_nanosleep() directly with the TASK_UNINTERRUPTIBLE state. cpu_chill() only cares that the first sleep happens, and does not care about the state of the restart schedule (in hrtimer_nanosleep_restart). Cc: stable-rt@vger.kernel.org Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14kernel/hrtimer: be non-freezeable in cpu_chill()Sebastian Andrzej Siewior
Since we replaced msleep() by hrtimer I see now and then (rarely) this: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) For now I see no better way but to disable the freezer the sleep the period. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14rt: Make cpu_chill() use hrtimer instead of msleep()Steven Rostedt
Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is called from softirq context, it may block the ksoftirqd() from running, in which case, it may never wake up the msleep() causing the deadlock. I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call, running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be blocked on a lock that irq/74-qla2xxx holds, we have our deadlock. The solution is not to convert the cpu_chill() back to a cpu_relax() as that will re-create a possible live lock that the cpu_chill() fixed earlier, and may also leave this bug open on other softirqs. The fix is to remove the dependency on ksoftirqd from cpu_chill(). That is, instead of calling msleep() that requires ksoftirqd to wake it up, use the hrtimer_nanosleep() code that does the wakeup from hard irq context. |Looks to be the lock of the block softirq. I don't have the core dump |anymore, but from what I could tell the ksoftirqd was blocked on the |block softirq lock, where the block softirq handler did a msleep |(called by the qla2xxx interrupt handler). | |Looking at trigger_softirq() in block/blk-softirq.c, it can do a |smp_callfunction() to another cpu to run the block softirq. If that |happens to be the cpu where the qla2xx irq handler is doing the block |softirq and is in a middle of a msleep(), I believe the ksoftirqd will |try to run the softirq. If it does that, then BOOM, it's deadlocked |because the ksoftirqd will never run the timer softirq either. |I should have also stated that it was only one lock that was involved. |But the lock owner was doing a msleep() that requires a wakeup by |ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock |held by the msleep() caller, then you have your deadlock. | |It's best not to have any softirqs going to sleep requiring another |softirq to wake it up. Note, if we ever require a timer softirq to do a |cpu_chill() it will most definitely hit this deadlock. Cc: stable-rt@vger.kernel.org Found-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy: add the 4 | chapters from email] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14lglocks-rt.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcutree/rcu_bh_qs: disable irq while calling rcu_preempt_qs()Tiejun Chen
Any callers to the function rcu_preempt_qs() must disable irqs in order to protect the assignment to ->rcu_read_unlock_special. In RT case, rcu_bh_qs() as the wrapper of rcu_preempt_qs() is called in some scenarios where irq is enabled, like this path, do_single_softirq() | + local_irq_enable(); + handle_softirq() | | | + rcu_bh_qs() | | | + rcu_preempt_qs() | + local_irq_disable() So here we'd better disable irq directly inside of rcu_bh_qs() to fix this, otherwise the kernel may be freezable sometimes as observed. And especially this way is also kind and safe for the potential rcu_bh_qs() usage elsewhere in the future. Cc: stable-rt@vger.kernel.org Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Bin Jiang <bin.jiang@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14rcu: Make ksoftirqd do RCU quiescent statesPaul E. McKenney
Implementing RCU-bh in terms of RCU-preempt makes the system vulnerable to network-based denial-of-service attacks. This patch therefore makes __do_softirq() invoke rcu_bh_qs(), but only when __do_softirq() is running in ksoftirqd context. A wrapper layer in interposed so that other calls to __do_softirq() avoid invoking rcu_bh_qs(). The underlying function __do_softirq_common() does the actual work. The reason that rcu_bh_qs() is bad in these non-ksoftirqd contexts is that there might be a local_bh_enable() inside an RCU-preempt read-side critical section. This local_bh_enable() can invoke __do_softirq() directly, so if __do_softirq() were to invoke rcu_bh_qs() (which just calls rcu_preempt_qs() in the PREEMPT_RT_FULL case), there would be an illegal RCU-preempt quiescent state in the middle of an RCU-preempt read-side critical section. Therefore, quiescent states can only happen in cases where __do_softirq() is invoked directly from ksoftirqd. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005184518.GA21601@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcu-more-fallout.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcu: Merge RCU-bh into RCU-preemptThomas Gleixner
The Linux kernel has long RCU-bh read-side critical sections that intolerably increase scheduling latency under mainline's RCU-bh rules, which include RCU-bh read-side critical sections being non-preemptible. This patch therefore arranges for RCU-bh to be implemented in terms of RCU-preempt for CONFIG_PREEMPT_RT_FULL=y. This has the downside of defeating the purpose of RCU-bh, namely, handling the case where the system is subjected to a network-based denial-of-service attack that keeps at least one CPU doing full-time softirq processing. This issue will be fixed by a later commit. The current commit will need some work to make it appropriate for mainline use, for example, it needs to be extended to cover Tiny RCU. [ paulmck: Added a useful changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005185938.GA20403@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14rcu: Frob softirq testPeter Zijlstra
With RT_FULL we get the below wreckage: [ 126.060484] ======================================================= [ 126.060486] [ INFO: possible circular locking dependency detected ] [ 126.060489] 3.0.1-rt10+ #30 [ 126.060490] ------------------------------------------------------- [ 126.060492] irq/24-eth0/1235 is trying to acquire lock: [ 126.060495] (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060503] [ 126.060504] but task is already holding lock: [ 126.060506] (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060511] [ 126.060511] which lock already depends on the new lock. [ 126.060513] [ 126.060514] [ 126.060514] the existing dependency chain (in reverse order) is: [ 126.060516] [ 126.060516] -> #1 (&p->pi_lock){-...-.}: [ 126.060519] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060524] [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85 [ 126.060527] [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f [ 126.060531] [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a [ 126.060534] [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f [ 126.060537] [<ffffffff810d9020>] rcu_boost+0xad/0xde [ 126.060541] [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b [ 126.060544] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060547] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060551] [ 126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}: [ 126.060555] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060558] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060561] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060564] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060566] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060569] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060573] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060576] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060580] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060583] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060585] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060590] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060593] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060597] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060600] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060603] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060606] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060608] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060611] [ 126.060612] other info that might help us debug this: [ 126.060614] [ 126.060615] Possible unsafe locking scenario: [ 126.060616] [ 126.060617] CPU0 CPU1 [ 126.060619] ---- ---- [ 126.060620] lock(&p->pi_lock); [ 126.060623] lock(&(lock)->wait_lock); [ 126.060625] lock(&p->pi_lock); [ 126.060627] lock(&(lock)->wait_lock); [ 126.060629] [ 126.060629] *** DEADLOCK *** [ 126.060630] [ 126.060632] 1 lock held by irq/24-eth0/1235: [ 126.060633] #0: (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060638] [ 126.060638] stack backtrace: [ 126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30 [ 126.060643] Call Trace: [ 126.060644] <IRQ> [<ffffffff810acbde>] print_circular_bug+0x289/0x29a [ 126.060651] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060655] [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99 [ 126.060658] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060661] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060664] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060668] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060671] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060674] [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c [ 126.060677] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060680] [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4 [ 126.060683] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060687] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060690] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060693] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060696] [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5 [ 126.060701] [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90 [ 126.060704] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060708] [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21 [ 126.060711] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060715] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060718] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060721] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060724] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060726] <EOI> [<ffffffff81072855>] ? migrate_disable+0x75/0x12d [ 126.060733] [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f [ 126.060736] [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f [ 126.060739] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060742] [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59 [ 126.060745] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060748] [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a [ 126.060751] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060754] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060757] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060761] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060764] [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a [ 126.060768] [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe [ 126.060771] [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c [ 126.060774] [<ffffffff81509b10>] ? gs_change+0xb/0xb Because irq_exit() does: void irq_exit(void) { account_system_vtime(current); trace_hardirq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); if (!in_interrupt() && local_softirq_pending()) invoke_softirq(); ... } Which triggers a wakeup, which uses RCU, now if the interrupted task has t->rcu_read_unlock_special set, the rcu usage from the wakeup will end up in rcu_read_unlock_special(). rcu_read_unlock_special() will test for in_irq(), which will fail as we just decremented preempt_count with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for PREEMPT_RT_FULL reads: int in_serving_softirq(void) { int res; preempt_disable(); res = __get_cpu_var(local_softirq_runner) == current; preempt_enable(); return res; } Which will thus also fail, resulting in the above wreckage. The 'somewhat' ugly solution is to open-code the preempt_count() test in rcu_read_unlock_special(). Also, we're not at all sure how ->rcu_read_unlock_special gets set here... so this is very likely a bandaid and more thought is required. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2014-05-14rtmutex: use a trylock for waiter lock in trylockSebastian Andrzej Siewior
Mike Galbraith captered the following: | >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596 | >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be | >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42 | >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd | >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2 | >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d | >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd | >--- <IRQ stack> --- | >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd | > [exception RIP: task_blocks_on_rt_mutex+51] | >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c | >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf | >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce | >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb | >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5 | >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c lock_timer_base() does a try_lock() which deadlocks on the waiter lock not the lock itself. This patch takes the waiter_lock with trylock so it should work from interrupt context as well. If the fastpath doesn't work and the waiter_lock itself is taken then it seems that the lock itself taken. This patch also adds "rt_spin_unlock_after_trylock_in_irq" to keep lockdep happy. If we managed to take the wait_lock in the first place we should also be able to take it in the unlock path. Cc: stable-rt@vger.kernel.org Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14timer/rt: Always raise the softirq if there's irq_work to be doneSteven Rostedt
It was previously discovered that some systems would hang on boot up with a previous version of 3.12-rt. This was due to RCU using irq_work, and RT defers the irq_work to a softirq. But if there's no active timers, the softirq will not be raised, and RCU work will not get done, causing the system to hang. The fix was to check that if there was no active timers but irq_work to be done, then we should raise the softirq. But this fix was not 100% correct. It left out the case that there were active timers that were not expired yet. This would have the softirq not get raised even if there was irq work to be done. If there is irq_work to be done, then we must raise the timer softirq regardless of if there is active timers or whether they are expired or not. The softirq can handle those cases. But we can never ignore irq_work. As it is only PREEMPT_RT_FULL that requires irq_work to be done in the softirq, we can pull out the check in the active_timers condition, and make the code a bit cleaner by having the irq_work check separate, and put the code in with the other #ifdef PREEMPT_RT. If there is irq_work to be done, there's no need to check the active timers or if they are expired. Just raise the time softirq and be done with it. Otherwise, we can do the timer checks just like we do with non -rt. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14timer: Raise softirq if there's irq_workSteven Rostedt
[ Talking with Sebastian on IRC, it seems that doing the irq_work_run() from the interrupt in -rt is a bad thing. Here we simply raise the softirq if there's irq work to do. This too boots on my i7 ] After trying hard to figure out why my i7 box was locking up with the new active_timers code, that does not run the timer softirq if there are no active timers, I took an extra look at the softirq handler and noticed that it doesn't just run timer softirqs, it also runs irq work. This was the bug that was locking up the system. It wasn't missing a timer, it was missing irq work. By always doing the irq work callbacks, the system boots fine. The missing irq work callback was the RCU's sp_wakeup() function. No need to check for defined(CONFIG_IRQ_WORK). When that's not set the "irq_work_needs_cpu()" is a static inline that returns false. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14timers: do not raise softirq unconditionallyThomas Gleixner
Mike, On Thu, 7 Nov 2013, Mike Galbraith wrote: > On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: > > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: > > > > I bet you are trying to work around some of the side effects of the > > > occasional tick which is still necessary despite of full nohz, right? > > > > Nope, I wanted to check out cost of nohz_full for rt, and found that it > > doesn't work at all instead, looked, and found that the sole running > > task has just awakened ksoftirqd when it wants to shut the tick down, so > > that shutdown never happens. > > Like so in virgin 3.10-rt. Box is x3550 M3 booted nowatchdog > rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via > cpusets as well. well, that very same problem is in mainline if you add "threadirqs" to the command line. But we can be smart about this. The untested patch below should address that issue. If that works on mainline we can adapt it for RT (needs a trylock(&base->lock) there). Though it's not a full solution. It needs some thought versus the softirq code of timers. Assume we have only one timer queued 1000 ticks into the future. So this change will cause the timer softirq not to be called until that timer expires and then the timer softirq is going to do 1000 loops until it catches up with jiffies. That's anything but pretty ... What worries me more is this one: pert-5229 [003] d..h1.. 684.482618: softirq_raise: vec=9 [action=RCU] The CPU has no callbacks as you shoved them over to cpu 0, so why is the RCU softirq raised? Thanks, tglx ------------------ Message-id: <alpine.DEB.2.02.1311071158350.23353@ionos.tec.linutronix.de> |CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>