summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2015-02-13workqueue: Prevent deadlock/stall on RTThomas Gleixner
Austin reported a XFS deadlock/stall on RT where scheduled work gets never exececuted and tasks are waiting for each other for ever. The underlying problem is the modification of the RT code to the handling of workers which are about to go to sleep. In mainline a worker thread which goes to sleep wakes an idle worker if there is more work to do. This happens from the guts of the schedule() function. On RT this must be outside and the accessed data structures are not protected against scheduling due to the spinlock to rtmutex conversion. So the naive solution to this was to move the code outside of the scheduler and protect the data structures by the pool lock. That approach turned out to be a little naive as we cannot call into that code when the thread blocks on a lock, as it is not allowed to block on two locks in parallel. So we dont call into the worker wakeup magic when the worker is blocked on a lock, which causes the deadlock/stall observed by Austin and Mike. Looking deeper into that worker code it turns out that the only relevant data structure which needs to be protected is the list of idle workers which can be woken up. So the solution is to protect the list manipulation operations with preempt_enable/disable pairs on RT and call unconditionally into the worker code even when the worker is blocked on a lock. The preemption protection is safe as there is nothing which can fiddle with the list outside of thread context. Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com> Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-02-13sched: Do not clear PF_NO_SETAFFINITY flag in select_fallback_rq()Steven Rostedt
I talked with Peter Zijlstra about this, and he told me that the clearing of the PF_NO_SETAFFINITY flag was to deal with the optimization of migrate_disable/enable() that ignores tasks that have that flag set. But that optimization was removed when I did a rework of the cpu hotplug code. I found that ignoring tasks that had that flag set would cause those tasks to not sync with the hotplug code and cause the kernel to crash. Thus it needed to not treat them special and those tasks had to go though the same work as tasks without that flag set. Now that those tasks are not treated special, there's no reason to clear the flag. May still need to be tested as the migrate_me() code does not ignore those flags. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140701111444.0cfebaa1@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13completion: Use simple wait queuesThomas Gleixner
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Add support for lazy preemptionThomas Gleixner
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13cpu/rt: Rework cpu down for PREEMPT_RTSteven Rostedt
Bringing a CPU down is a pain with the PREEMPT_RT kernel because tasks can be preempted in many more places than in non-RT. In order to handle per_cpu variables, tasks may be pinned to a CPU for a while, and even sleep. But these tasks need to be off the CPU if that CPU is going down. Several synchronization methods have been tried, but when stressed they failed. This is a new approach. A sync_tsk thread is still created and tasks may still block on a lock when the CPU is going down, but how that works is a bit different. When cpu_down() starts, it will create the sync_tsk and wait on it to inform that current tasks that are pinned on the CPU are no longer pinned. But new tasks that are about to be pinned will still be allowed to do so at this time. Then the notifiers are called. Several notifiers will bring down tasks that will enter these locations. Some of these tasks will take locks of other tasks that are on the CPU. If we don't let those other tasks continue, but make them block until CPU down is done, the tasks that the notifiers are waiting on will never complete as they are waiting for the locks held by the tasks that are blocked. Thus we still let the task pin the CPU until the notifiers are done. After the notifiers run, we then make new tasks entering the pinned CPU sections grab a mutex and wait. This mutex is now a per CPU mutex in the hotplug_pcp descriptor. To help things along, a new function in the scheduler code is created called migrate_me(). This function will try to migrate the current task off the CPU this is going down if possible. When the sync_tsk is created, all tasks will then try to migrate off the CPU going down. There are several cases that this wont work, but it helps in most cases. After the notifiers are called and if a task can't migrate off but enters the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex until the CPU down is complete. Then the scheduler will force the migration anyway. Also, I found that THREAD_BOUND need to also be accounted for in the pinned CPU, and the migrate_disable no longer treats them special. This helps fix issues with ksoftirqd and workqueue that unbind on CPU down. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-clear-pf-thread-bound-on-fallback-rq.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: dont calculate hweight in update_migrate_disable()Nicholas Mc Guire
Proposal for a minor optimization in update_migrate_disable - its only a few instructions saved but those are in the hot path of locks so it might be worth it When being scheduled out while migrate_disable > 0 and migrate_disabled_updated is not yet set we end up here (kernel/sched/core.c): static inline void update_migrate_disable(struct task_struct *p) { ... mask = tsk_cpus_allowed(p); if (p->sched_class->set_cpus_allowed) p->sched_class->set_cpus_allowed(p, mask); p->nr_cpus_allowed = cpumask_weight(mask); as we only can get here if migrate_disable > 0 there is no need to calculate the cpumask_weight(mask) as tsk_cpus_allowed in that case will return cpumask_of(task_cpu(p)) which only can have a hamming weight of 1 anyway. So we can simply do: p->nr_cpus_allowed = 1; without changing the behavior. Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13sched: Have migrate_disable ignore bounded threadsPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Do not compare cpu masks in schedulerPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13allow preemption in recursive migrate_disable callNicholas Mc Guire
Minor cleanup in migrate_disable/migrate_enable. The recursive case does not need to disable preemption as it is "pinned" to the current cpu any way so it is safe to preempt it. Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13sched: Postpone actual migration disalbe to scheduleSteven Rostedt
The migrate_disable() can cause a bit of a overhead to the RT kernel, as changing the affinity is expensive to do at every lock encountered. As a running task can not migrate, the actual disabling of migration does not need to occur until the task is about to schedule out. In most cases, a task that disables migration will enable it before it schedules making this change improve performance tremendously. [ Frank Rowand: UP compile fix ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: teach migrate_disable about atomic contextsPeter Zijlstra
<NMI> [<ffffffff812dafd8>] spin_bug+0x94/0xa8 [<ffffffff812db07f>] do_raw_spin_lock+0x43/0xea [<ffffffff814fa9be>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106ff9e>] ? migrate_disable+0x75/0x12d [<ffffffff81078aaf>] ? pin_current_cpu+0x36/0xb0 [<ffffffff8106ff9e>] migrate_disable+0x75/0x12d [<ffffffff81115b9d>] pagefault_disable+0xe/0x1f [<ffffffff81047027>] copy_from_user_nmi+0x74/0xe6 [<ffffffff810489d7>] perf_callchain_user+0xf3/0x135 Now clearly we can't go around taking locks from NMI context, cure this by short-circuiting migrate_disable() when we're in an atomic context already. Add some extra debugging to avoid things like: preempt_disable() migrate_disable(); preempt_enable(); migrate_enable(); Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314967297.1301.14.camel@twins Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-wbot4vsmwhi8vmbf83hsclk6@git.kernel.org
2015-02-13sched, rt: Fix migrate_enable() thinkoMike Galbraith
Assigning mask = tsk_cpus_allowed(p) after p->migrate_disable = 0 ensures that we won't see a mask change.. no push/pull, we stack tasks on one CPU. Also add a couple fields to sched_debug for the next guy. [ Build fix from Stratos Psomadakis <psomas@gentoo.org> ] Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1314108763.6689.4.camel@marge.simson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Generic migrate_disablePeter Zijlstra
Make migrate_disable() be a preempt_disable() for !rt kernels. This allows generic code to use it but still enforces that these code sections stay relatively small. A preemptible migrate_disable() accessible for general use would allow people growing arbitrary per-cpu crap instead of clean these things up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-275i87sl8e1jcamtchmehonm@git.kernel.org
2015-02-13sched: Optimize migrate_disablePeter Zijlstra
Change from task_rq_lock() to raw_spin_lock(&rq->lock) to avoid a few atomic ops. See comment on why it should be safe. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-cbz6hkl5r5mvwtx5s3tor2y6@git.kernel.org
2015-02-13sched-migrate-disable.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lockSteven Rostedt
In -rt, most spin_locks() turn into mutexes. One of these spin_lock conversions is performed on the workqueue gcwq->lock. When the idle worker is worken, the first thing it will do is grab that same lock and it too will block, possibly jumping into the same code, but because nr_running would already be decremented it prevents an infinite loop. But this is still a waste of CPU cycles, and it doesn't follow the method of mainline, as new workers should only be woken when a worker thread is truly going to sleep, and not just blocked on a spin_lock(). Check the saved_state too before waking up new workers. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13sched: ttwu: Return success when only changing the saved_state valueThomas Gleixner
When a task blocks on a rt lock, it saves the current state in p->saved_state, so a lock related wake up will not destroy the original state. When a real wakeup happens, while the task is running due to a lock wakeup already, we update p->saved_state to TASK_RUNNING, but we do not return success, which might cause another wakeup in the waitqueue code and the task remains in the waitqueue list. Return success in that case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2015-02-13sched-disable-ttwu-queue.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13cond-resched-softirq-fix.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-cond-resched.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-might-sleep-do-not-account-rcu-depth.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-rt-mutex-wakeup.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-mmdrop-delayed.patchThomas Gleixner
Needs thread context (pgd_lock) -> ifdeffed. workqueues wont work with RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched-limit-nr-migrate.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13hrtimer: fixup hrtimer callback changes for preempt-rtThomas Gleixner
In preempt-rt we can not call the callbacks which take sleeping locks from the timer interrupt context. Bring back the softirq split for now, until we fixed the signal delivery problem for real. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2015-02-13ptrace: fix ptrace vs tasklist_lock raceSebastian Andrzej Siewior
As explained by Alexander Fyodorov <halcy@yandex.ru>: |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13tracing: Account for preempt off in preempt_schedule()Steven Rostedt
The preempt_schedule() uses the preempt_disable_notrace() version because it can cause infinite recursion by the function tracer as the function tracer uses preempt_enable_notrace() which may call back into the preempt_schedule() code as the NEED_RESCHED is still set and the PREEMPT_ACTIVE has not been set yet. See commit: d1f74e20b5b064a130cd0743a256c2d3cfe84010 that made this change. The preemptoff and preemptirqsoff latency tracers require the first and last preempt count modifiers to enable tracing. But this skips the checks. Since we can not convert them back to the non notrace version, we can use the idle() hooks for the latency tracers here. That is, the start/stop_critical_timings() works well to manually start and stop the latency tracer for preempt off timings. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13vtime-split-lock-and-seqcount.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Consider pi boosting in setschedulerThomas Gleixner
If a PI boosted task policy/priority is modified by a setscheduler() call we unconditionally dequeue and requeue the task if it is on the runqueue even if the new priority is lower than the current effective boosted priority. This can result in undesired reordering of the priority bucket list. If the new priority is less or equal than the current effective we just store the new parameters in the task struct and leave the scheduler class and the runqueue untouched. This is handled when the task deboosts itself. Only if the new priority is higher than the effective boosted priority we apply the change immediately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2015-02-13sched: Queue RT tasks to head when prio dropsThomas Gleixner
The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2015-02-13sched: Adjust sched_reset_on_fork when nothing else changesThomas Gleixner
If the policy and priority remain unchanged a possible modification of sched_reset_on_fork gets lost in the early exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2015-02-13sched: Better debug output for might sleepThomas Gleixner
might sleep can tell us where interrupts have been disabled, but we have no idea what disabled preemption. Add some debug infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Check for idle task in might_sleep()Thomas Gleixner
Idle is not allowed to call sleeping functions ever! Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13sched: Init idle->on_rq in init_idle()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13Reset to 3.12.37Scott Wood
2014-05-14completion: Use simple wait queuesThomas Gleixner
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: Add support for lazy preemptionThomas Gleixner
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14cpu/rt: Rework cpu down for PREEMPT_RTSteven Rostedt
Bringing a CPU down is a pain with the PREEMPT_RT kernel because tasks can be preempted in many more places than in non-RT. In order to handle per_cpu variables, tasks may be pinned to a CPU for a while, and even sleep. But these tasks need to be off the CPU if that CPU is going down. Several synchronization methods have been tried, but when stressed they failed. This is a new approach. A sync_tsk thread is still created and tasks may still block on a lock when the CPU is going down, but how that works is a bit different. When cpu_down() starts, it will create the sync_tsk and wait on it to inform that current tasks that are pinned on the CPU are no longer pinned. But new tasks that are about to be pinned will still be allowed to do so at this time. Then the notifiers are called. Several notifiers will bring down tasks that will enter these locations. Some of these tasks will take locks of other tasks that are on the CPU. If we don't let those other tasks continue, but make them block until CPU down is done, the tasks that the notifiers are waiting on will never complete as they are waiting for the locks held by the tasks that are blocked. Thus we still let the task pin the CPU until the notifiers are done. After the notifiers run, we then make new tasks entering the pinned CPU sections grab a mutex and wait. This mutex is now a per CPU mutex in the hotplug_pcp descriptor. To help things along, a new function in the scheduler code is created called migrate_me(). This function will try to migrate the current task off the CPU this is going down if possible. When the sync_tsk is created, all tasks will then try to migrate off the CPU going down. There are several cases that this wont work, but it helps in most cases. After the notifiers are called and if a task can't migrate off but enters the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex until the CPU down is complete. Then the scheduler will force the migration anyway. Also, I found that THREAD_BOUND need to also be accounted for in the pinned CPU, and the migrate_disable no longer treats them special. This helps fix issues with ksoftirqd and workqueue that unbind on CPU down. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched-clear-pf-thread-bound-on-fallback-rq.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: dont calculate hweight in update_migrate_disable()Nicholas Mc Guire
Proposal for a minor optimization in update_migrate_disable - its only a few instructions saved but those are in the hot path of locks so it might be worth it When being scheduled out while migrate_disable > 0 and migrate_disabled_updated is not yet set we end up here (kernel/sched/core.c): static inline void update_migrate_disable(struct task_struct *p) { ... mask = tsk_cpus_allowed(p); if (p->sched_class->set_cpus_allowed) p->sched_class->set_cpus_allowed(p, mask); p->nr_cpus_allowed = cpumask_weight(mask); as we only can get here if migrate_disable > 0 there is no need to calculate the cpumask_weight(mask) as tsk_cpus_allowed in that case will return cpumask_of(task_cpu(p)) which only can have a hamming weight of 1 anyway. So we can simply do: p->nr_cpus_allowed = 1; without changing the behavior. Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14sched: Have migrate_disable ignore bounded threadsPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: Do not compare cpu masks in schedulerPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14allow preemption in recursive migrate_disable callNicholas Mc Guire
Minor cleanup in migrate_disable/migrate_enable. The recursive case does not need to disable preemption as it is "pinned" to the current cpu any way so it is safe to preempt it. Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14sched: Postpone actual migration disalbe to scheduleSteven Rostedt
The migrate_disable() can cause a bit of a overhead to the RT kernel, as changing the affinity is expensive to do at every lock encountered. As a running task can not migrate, the actual disabling of migration does not need to occur until the task is about to schedule out. In most cases, a task that disables migration will enable it before it schedules making this change improve performance tremendously. [ Frank Rowand: UP compile fix ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: teach migrate_disable about atomic contextsPeter Zijlstra
<NMI> [<ffffffff812dafd8>] spin_bug+0x94/0xa8 [<ffffffff812db07f>] do_raw_spin_lock+0x43/0xea [<ffffffff814fa9be>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106ff9e>] ? migrate_disable+0x75/0x12d [<ffffffff81078aaf>] ? pin_current_cpu+0x36/0xb0 [<ffffffff8106ff9e>] migrate_disable+0x75/0x12d [<ffffffff81115b9d>] pagefault_disable+0xe/0x1f [<ffffffff81047027>] copy_from_user_nmi+0x74/0xe6 [<ffffffff810489d7>] perf_callchain_user+0xf3/0x135 Now clearly we can't go around taking locks from NMI context, cure this by short-circuiting migrate_disable() when we're in an atomic context already. Add some extra debugging to avoid things like: preempt_disable() migrate_disable(); preempt_enable(); migrate_enable(); Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314967297.1301.14.camel@twins Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-wbot4vsmwhi8vmbf83hsclk6@git.kernel.org
2014-05-14sched, rt: Fix migrate_enable() thinkoMike Galbraith
Assigning mask = tsk_cpus_allowed(p) after p->migrate_disable = 0 ensures that we won't see a mask change.. no push/pull, we stack tasks on one CPU. Also add a couple fields to sched_debug for the next guy. [ Build fix from Stratos Psomadakis <psomas@gentoo.org> ] Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1314108763.6689.4.camel@marge.simson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14sched: Generic migrate_disablePeter Zijlstra
Make migrate_disable() be a preempt_disable() for !rt kernels. This allows generic code to use it but still enforces that these code sections stay relatively small. A preemptible migrate_disable() accessible for general use would allow people growing arbitrary per-cpu crap instead of clean these things up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-275i87sl8e1jcamtchmehonm@git.kernel.org