summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2014-04-10completion: Use simple wait queuesThomas Gleixner
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Add support for lazy preemptionThomas Gleixner
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10cpu/rt: Rework cpu down for PREEMPT_RTSteven Rostedt
Bringing a CPU down is a pain with the PREEMPT_RT kernel because tasks can be preempted in many more places than in non-RT. In order to handle per_cpu variables, tasks may be pinned to a CPU for a while, and even sleep. But these tasks need to be off the CPU if that CPU is going down. Several synchronization methods have been tried, but when stressed they failed. This is a new approach. A sync_tsk thread is still created and tasks may still block on a lock when the CPU is going down, but how that works is a bit different. When cpu_down() starts, it will create the sync_tsk and wait on it to inform that current tasks that are pinned on the CPU are no longer pinned. But new tasks that are about to be pinned will still be allowed to do so at this time. Then the notifiers are called. Several notifiers will bring down tasks that will enter these locations. Some of these tasks will take locks of other tasks that are on the CPU. If we don't let those other tasks continue, but make them block until CPU down is done, the tasks that the notifiers are waiting on will never complete as they are waiting for the locks held by the tasks that are blocked. Thus we still let the task pin the CPU until the notifiers are done. After the notifiers run, we then make new tasks entering the pinned CPU sections grab a mutex and wait. This mutex is now a per CPU mutex in the hotplug_pcp descriptor. To help things along, a new function in the scheduler code is created called migrate_me(). This function will try to migrate the current task off the CPU this is going down if possible. When the sync_tsk is created, all tasks will then try to migrate off the CPU going down. There are several cases that this wont work, but it helps in most cases. After the notifiers are called and if a task can't migrate off but enters the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex until the CPU down is complete. Then the scheduler will force the migration anyway. Also, I found that THREAD_BOUND need to also be accounted for in the pinned CPU, and the migrate_disable no longer treats them special. This helps fix issues with ksoftirqd and workqueue that unbind on CPU down. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-clear-pf-thread-bound-on-fallback-rq.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: dont calculate hweight in update_migrate_disable()Nicholas Mc Guire
Proposal for a minor optimization in update_migrate_disable - its only a few instructions saved but those are in the hot path of locks so it might be worth it When being scheduled out while migrate_disable > 0 and migrate_disabled_updated is not yet set we end up here (kernel/sched/core.c): static inline void update_migrate_disable(struct task_struct *p) { ... mask = tsk_cpus_allowed(p); if (p->sched_class->set_cpus_allowed) p->sched_class->set_cpus_allowed(p, mask); p->nr_cpus_allowed = cpumask_weight(mask); as we only can get here if migrate_disable > 0 there is no need to calculate the cpumask_weight(mask) as tsk_cpus_allowed in that case will return cpumask_of(task_cpu(p)) which only can have a hamming weight of 1 anyway. So we can simply do: p->nr_cpus_allowed = 1; without changing the behavior. Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10sched: Have migrate_disable ignore bounded threadsPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Do not compare cpu masks in schedulerPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10allow preemption in recursive migrate_disable callNicholas Mc Guire
Minor cleanup in migrate_disable/migrate_enable. The recursive case does not need to disable preemption as it is "pinned" to the current cpu any way so it is safe to preempt it. Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10sched: Postpone actual migration disalbe to scheduleSteven Rostedt
The migrate_disable() can cause a bit of a overhead to the RT kernel, as changing the affinity is expensive to do at every lock encountered. As a running task can not migrate, the actual disabling of migration does not need to occur until the task is about to schedule out. In most cases, a task that disables migration will enable it before it schedules making this change improve performance tremendously. [ Frank Rowand: UP compile fix ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: teach migrate_disable about atomic contextsPeter Zijlstra
<NMI> [<ffffffff812dafd8>] spin_bug+0x94/0xa8 [<ffffffff812db07f>] do_raw_spin_lock+0x43/0xea [<ffffffff814fa9be>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106ff9e>] ? migrate_disable+0x75/0x12d [<ffffffff81078aaf>] ? pin_current_cpu+0x36/0xb0 [<ffffffff8106ff9e>] migrate_disable+0x75/0x12d [<ffffffff81115b9d>] pagefault_disable+0xe/0x1f [<ffffffff81047027>] copy_from_user_nmi+0x74/0xe6 [<ffffffff810489d7>] perf_callchain_user+0xf3/0x135 Now clearly we can't go around taking locks from NMI context, cure this by short-circuiting migrate_disable() when we're in an atomic context already. Add some extra debugging to avoid things like: preempt_disable() migrate_disable(); preempt_enable(); migrate_enable(); Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314967297.1301.14.camel@twins Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-wbot4vsmwhi8vmbf83hsclk6@git.kernel.org
2014-04-10sched, rt: Fix migrate_enable() thinkoMike Galbraith
Assigning mask = tsk_cpus_allowed(p) after p->migrate_disable = 0 ensures that we won't see a mask change.. no push/pull, we stack tasks on one CPU. Also add a couple fields to sched_debug for the next guy. [ Build fix from Stratos Psomadakis <psomas@gentoo.org> ] Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1314108763.6689.4.camel@marge.simson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Generic migrate_disablePeter Zijlstra
Make migrate_disable() be a preempt_disable() for !rt kernels. This allows generic code to use it but still enforces that these code sections stay relatively small. A preemptible migrate_disable() accessible for general use would allow people growing arbitrary per-cpu crap instead of clean these things up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-275i87sl8e1jcamtchmehonm@git.kernel.org
2014-04-10sched: Optimize migrate_disablePeter Zijlstra
Change from task_rq_lock() to raw_spin_lock(&rq->lock) to avoid a few atomic ops. See comment on why it should be safe. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-cbz6hkl5r5mvwtx5s3tor2y6@git.kernel.org
2014-04-10sched-migrate-disable.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lockSteven Rostedt
In -rt, most spin_locks() turn into mutexes. One of these spin_lock conversions is performed on the workqueue gcwq->lock. When the idle worker is worken, the first thing it will do is grab that same lock and it too will block, possibly jumping into the same code, but because nr_running would already be decremented it prevents an infinite loop. But this is still a waste of CPU cycles, and it doesn't follow the method of mainline, as new workers should only be woken when a worker thread is truly going to sleep, and not just blocked on a spin_lock(). Check the saved_state too before waking up new workers. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10sched: ttwu: Return success when only changing the saved_state valueThomas Gleixner
When a task blocks on a rt lock, it saves the current state in p->saved_state, so a lock related wake up will not destroy the original state. When a real wakeup happens, while the task is running due to a lock wakeup already, we update p->saved_state to TASK_RUNNING, but we do not return success, which might cause another wakeup in the waitqueue code and the task remains in the waitqueue list. Return success in that case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-04-10sched-disable-ttwu-queue.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10cond-resched-softirq-fix.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-cond-resched.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-might-sleep-do-not-account-rcu-depth.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-rt-mutex-wakeup.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-mmdrop-delayed.patchThomas Gleixner
Needs thread context (pgd_lock) -> ifdeffed. workqueues wont work with RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched-limit-nr-migrate.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10hrtimer: fixup hrtimer callback changes for preempt-rtThomas Gleixner
In preempt-rt we can not call the callbacks which take sleeping locks from the timer interrupt context. Bring back the softirq split for now, until we fixed the signal delivery problem for real. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2014-04-10ptrace: fix ptrace vs tasklist_lock raceSebastian Andrzej Siewior
As explained by Alexander Fyodorov <halcy@yandex.ru>: |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10tracing: Account for preempt off in preempt_schedule()Steven Rostedt
The preempt_schedule() uses the preempt_disable_notrace() version because it can cause infinite recursion by the function tracer as the function tracer uses preempt_enable_notrace() which may call back into the preempt_schedule() code as the NEED_RESCHED is still set and the PREEMPT_ACTIVE has not been set yet. See commit: d1f74e20b5b064a130cd0743a256c2d3cfe84010 that made this change. The preemptoff and preemptirqsoff latency tracers require the first and last preempt count modifiers to enable tracing. But this skips the checks. Since we can not convert them back to the non notrace version, we can use the idle() hooks for the latency tracers here. That is, the start/stop_critical_timings() works well to manually start and stop the latency tracer for preempt off timings. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10vtime-split-lock-and-seqcount.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Consider pi boosting in setschedulerThomas Gleixner
If a PI boosted task policy/priority is modified by a setscheduler() call we unconditionally dequeue and requeue the task if it is on the runqueue even if the new priority is lower than the current effective boosted priority. This can result in undesired reordering of the priority bucket list. If the new priority is less or equal than the current effective we just store the new parameters in the task struct and leave the scheduler class and the runqueue untouched. This is handled when the task deboosts itself. Only if the new priority is higher than the effective boosted priority we apply the change immediately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2014-04-10sched: Queue RT tasks to head when prio dropsThomas Gleixner
The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2014-04-10sched: Adjust sched_reset_on_fork when nothing else changesThomas Gleixner
If the policy and priority remain unchanged a possible modification of sched_reset_on_fork gets lost in the early exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
2014-04-10sched: Better debug output for might sleepThomas Gleixner
might sleep can tell us where interrupts have been disabled, but we have no idea what disabled preemption. Add some debug infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Check for idle task in might_sleep()Thomas Gleixner
Idle is not allowed to call sleeping functions ever! Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10sched: Init idle->on_rq in init_idle()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-22sched: Fix double normalization of vruntimeGeorge McCollister
commit 791c9e0292671a3bfa95286bb5c08129d8605618 upstream. dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched/rt: Remove redundant nr_cpus_allowed testShawn Bohrer
commit 6bfa687c19b7ab8adee03f0d43c197c2945dd869 upstream. In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: tomk@rgmadvisors.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched/rt: Add missing rmb()Peter Zijlstra
commit 7c3f2ab7b844f1a859afbc3d41925e8a0faba5fa upstream. While discussing the proposed SCHED_DEADLINE patches which in parts mimic the existing FIFO code it was noticed that the wmb in rt_set_overloaded() didn't have a matching barrier. The only site using rt_overloaded() to test the rto_count is pull_rt_task() and we should issue a matching rmb before then assuming there's an rto_mask bit set. Without that smp_rmb() in there we could actually miss seeing the rto_mask bit. Also, change to using smp_[wr]mb(), even though this is SMP only code; memory barriers without smp_ always make me think they're against hardware of some sort. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: luca.abeni@unitn.it Cc: bruce.ashfield@windriver.com Cc: dhaval.giani@gmail.com Cc: rostedt@goodmis.org Cc: hgu1972@gmail.com Cc: oleg@redhat.com Cc: fweisbec@gmail.com Cc: darren@dvhart.com Cc: johan.eker@ericsson.com Cc: p.faure@akatech.ch Cc: paulmck@linux.vnet.ibm.com Cc: raistlin@linux.it Cc: claudio@evidence.eu.com Cc: insop.song@gmail.com Cc: michael@amarulasolutions.com Cc: liming.wang@windriver.com Cc: fchecconi@gmail.com Cc: jkacur@redhat.com Cc: tommaso.cucinotta@sssup.it Cc: Juri Lelli <juri.lelli@gmail.com> Cc: harald.gustafsson@ericsson.com Cc: nicola.manica@disi.unitn.it Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Assign correct scheduling domain to 'sd_llc'Mel Gorman
commit 5d4cf996cf134e8ddb4f906b8197feb9267c2b77 upstream. Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL dereference on sd_busy but the fix also altered what scheduling domain it used for the 'sd_llc' percpu variable. One impact of this is that a task selecting a runqueue may consider idle CPUs that are not cache siblings as candidates for running. Tasks are then running on CPUs that are not cache hot. This was found through bisection where ebizzy threads were not seeing equal performance and it looked like a scheduling fairness issue. This patch mitigates but does not completely fix the problem on all machines tested implying there may be an additional bug or a common root cause. Here are the average range of performance seen by individual ebizzy threads. It was tested on top of candidate patches related to x86 TLB range flushing. 4-core machine 3.13.0-rc3 3.13.0-rc3 vanilla fixsd-v3r3 Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%) Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%) Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%) Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%) Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%) Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%) Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%) 8-core machine Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%) Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%) Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%) Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%) Mean 6 23.21 ( 0.00%) 69.46 (-199.27%) Mean 7 15.85 ( 0.00%) 101.72 (-541.77%) Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%) Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%) Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%) It's eliminated for one machine and reduced for another. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: H Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Initialize power_orig for overlapping groupsPeter Zijlstra
commit 8e8339a3a1069141985daaa2521ba304509ddecd upstream. Yinghai reported that he saw a /0 in sg_capacity on his EX parts. Make sure to always initialize power_orig now that we actually use it. Ideally build_sched_domains() -> init_sched_groups_power() would also initialize this; but for some yet unexplained reason some setups seem to miss updates there. Reported-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Avoid NULL dereference on sd_busyPeter Zijlstra
commit 42eb088ed246a5a817bb45a8b32fe234cf1c0f8b upstream. Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some conditions leading to a possible NULL deref in set_cpu_sd_state_idle(). Reported-by: Anton Blanchard <anton@samba.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Remove unnecessary iteration over sched domains to update nr_busy_cpusPreeti U Murthy
commit 37dc6b50cee97954c4e6edcd5b1fa614b76038ee upstream. nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set. Therefore instead of updating nr_busy_cpus at every level of sched domain, since it is irrelevant, we can update this parameter only at the parent domain of the sd which has this flag set. Introduce a per-cpu parameter sd_busy which represents this parent domain. In nohz_kick_needed() we directly query the nr_busy_cpus parameter associated with the groups of sd_busy. By associating sd_busy with the highest domain which has SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could have this flag set and trigger nohz_idle_balancing if any of the levels have more than one busy cpu. sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set so that it can be queried directly when required. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain where it is currently used. Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: svaidy@linux.vnet.ibm.com Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Cc: peterz@infradead.org Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Fix asymmetric scheduling for POWER7Vaidyanathan Srinivasan
commit 2042abe7977222ef606306faa2dce8fd51e98e65 upstream. Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling <michael.neuling@au1.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-01-15sched: Guarantee new group-entities always have weightPaul Turner
commit 0ac9b1c21874d2490331233b3242085f8151e166 upstream. Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15sched: Fix hrtimer_cancel()/rq->lock deadlockBen Segall
commit 927b54fccbf04207ec92f669dce6806848cbec7d upstream. __start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock. Fix this by ensuring that cfs_b->timer_active is cleared only if the _latest_ call to do_sched_cfs_period_timer is returning as idle. Then __start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for that to succeed or timer_active == 1. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15sched: Fix cfs_bandwidth misuse of hrtimer_expires_remainingBen Segall
commit db06e78cc13d70f10877e0557becc88ab3ad2be8 upstream. hrtimer_expires_remaining does not take internal hrtimer locks and thus must be guarded against concurrent __hrtimer_start_range_ns (but returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15sched: Fix race on toggling cfs_bandwidth_usedBen Segall
commit 1ee14e6c8cddeeb8a490d7b54cd9016e4bb900b4 upstream. When we transition cfs_bandwidth_used to false, any currently throttled groups will incorrectly return false from cfs_rq_throttled. While tg_set_cfs_bandwidth will unthrottle them eventually, currently running code (including at least dequeue_task_fair and distribute_cfs_runtime) will cause errors. Fix this by turning off cfs_bandwidth_used only after unthrottling all cfs_rqs. Tested: toggle bandwidth back and forth on a loaded cgroup. Caused crashes in minutes without the patch, hasn't crashed with it. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09sched: numa: skip inaccessible VMAsMel Gorman
commit 3c67f474558748b604e247d92b55dfe89654c81d upstream. Inaccessible VMA should not be trapping NUMA hint faults. Skip them. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entitiesKirill Tkhai
commit 757dfcaa41844595964f1220f1d33182dae49976 upstream. This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> CC: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20sched: Avoid throttle_cfs_rq() racing with period_timer stoppingBen Segall
commit f9f9ffc237dd924f048204e8799da74f9ecf40cf upstream. throttle_cfs_rq() doesn't check to make sure that period_timer is running, and while update_curr/assign_cfs_runtime does, a concurrently running period_timer on another cpu could cancel itself between this cpu's update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running in the tg to restart the timer, this causes the cfs_rq to be stranded forever. Fix this by calling __start_cfs_bandwidth() in throttle if the timer is inactive. (Also add some sched_debug lines for cfs_bandwidth.) Tested: make a run/sleep task in a cgroup, loop switching the cgroup between 1ms/100ms quota and unlimited, checking for timer_active=0 and throttled=1 as a failure. With the throttle_cfs_rq() change commented out this fails, with the full patch it passes. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-20sched/balancing: Fix cfs_rq->task_h_load calculationVladimir Davydov
Patch a003a2 (sched: Consider runnable load average in move_tasks()) sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is always 0. This mistype leads to all tasks having weight 0 when load balancing in a cpu-cgroup enabled setup. There obviously should be sum of weights of all runnable tasks there instead. Fix it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>