Age | Commit message (Collapse) | Author |
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This fixes the following build error for the preempt-rt kernel.
make kernel/fork.o
CC kernel/fork.o
kernel/fork.c:90: error: section of tasklist_lock conflicts with previous declaration
make[2]: *** [kernel/fork.o] Error 1
make[1]: *** [kernel/fork.o] Error 2
The rt kernel cache aligns the RWLOCK in DEFINE_RWLOCK by default.
The non-rt kernels explicitly cache align only the tasklist_lock in
kernel/fork.c
That can create a build conflict. This fixes the build problem by making the
non-rt kernels cache align RWLOCKs by default. The side effect is that
the other RWLOCKs are also cache aligned for non-rt.
This is a short term solution for rt only.
The longer term solution would be to push the cache aligned DEFINE_RWLOCK
to mainline. If there are objections, then we could create a
DEFINE_RWLOCK_CACHE_ALIGNED or something of that nature.
Comments? Objections?
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.LFD.2.00.1109191104010.23118@localhost6.localdomain6
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Bad return value in _mutex_lock_check_stamp - this problem only would show
up with 3.12.1 rt4 applied but CONFIG_PREEMPT_RT_FULL not enabled
currently it would be returning what ever vprintk_emit ended up with
(atleast on x86), which probably is not the intended behavior. Added a
return 0; as in the case with CONFIG_PREEMPT_RT_FULL enabled.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
lockdep says:
| --------------------------------------------------------------------------
| | Wound/wait tests |
| ---------------------
| ww api failures: ok | ok | ok |
| ww contexts mixing: ok | ok |
| finishing ww context: ok | ok | ok | ok |
| locking mismatches: ok | ok | ok |
| EDEADLK handling: ok | ok | ok | ok | ok | ok | ok | ok | ok | ok |
| spinlock nest unlocked: ok |
| -----------------------------------------------------
| |block | try |context|
| -----------------------------------------------------
| context: ok | ok | ok |
| try: ok | ok | ok |
| block: ok | ok | ok |
| spinlock: ok | ok | ok |
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
|
|
The shortcut on mainline skip lockdep. No idea why this is a good thing.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
With the migration pushdonw a few of the do{ }while(0)
loops became obsolete but got left over - this patch
only removes this fallout.
Patch applies on top of 3.12.9-rt13
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
pushdown of migrate_disable/enable from read_*lock* to the rt_read_*lock*
api level
general mapping to mutexes:
read_*lock*
`-> rt_read_*lock*
`-> __spin_lock (the sleeping spin locks)
`-> rt_mutex
The real read_lock* mapping:
read_lock_irqsave -.
read_lock_irq `-> rt_read_lock_irqsave()
`->read_lock ---------. \
read_lock_bh ------+ \
`--> rt_read_lock()
if (rt_mutex_owner(lock) != current){
`-> __rt_spin_lock()
rt_spin_lock_fastlock()
`->rt_mutex_cmpxchg()
migrate_disable()
}
rwlock->read_depth++;
read_trylock mapping:
read_trylock
`-> rt_read_trylock
if (rt_mutex_owner(lock) != current){
`-> rt_mutex_trylock()
rt_mutex_fasttrylock()
rt_mutex_cmpxchg()
migrate_disable()
}
rwlock->read_depth++;
read_unlock* mapping:
read_unlock_bh --------+
read_unlock_irq -------+
read_unlock_irqrestore +
read_unlock -----------+
`-> rt_read_unlock()
if(--rwlock->read_depth==0){
`-> __rt_spin_unlock()
rt_spin_lock_fastunlock()
`-> rt_mutex_cmpxchg()
migrate_disable()
}
So calls to migrate_disable/enable() are better placed at the rt_read_*
level of lock/trylock/unlock as all of the read_*lock* API has this as a
common path. In the rt_read* API of lock/trylock/unlock the nesting level
is already being recorded in rwlock->read_depth, so we can push down the
migrate disable/enable to that level and condition it on the read_depth
going from 0 to 1 -> migrate_disable and 1 to 0 -> migrate_enable. This
eliminates the recursive calls that were needed when migrate_disable/enable
was done at the read_*lock* level.
The approach to read_*_bh also eliminates the concerns raised with the
regards to api inbalances (read_lock_bh -> read_unlock+local_bh_enable)
Tested-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
pushdown of migrate_disable/enable from write_*lock* to the rt_write_*lock*
api level
general mapping of write_*lock* to mutexes:
write_*lock*
`-> rt_write_*lock*
`-> __spin_lock (the sleeping __spin_lock)
`-> rt_mutex
write_*lock*s are non-recursive so we have two lock chains to consider
- write_trylock*/write_unlock
- write_lock*/wirte_unlock
for both paths the migration_disable/enable must be balanced.
write_trylock* mapping:
write_trylock_irqsave
`-> rt_write_trylock_irqsave
write_trylock \
`--------> rt_write_trylock
ret = rt_mutex_trylock
rt_mutex_fasttrylock
rt_mutex_cmpxchg
if (ret)
migrate_disable
write_lock* mapping:
write_lock_irqsave
`-> rt_write_lock_irqsave
write_lock_irq -> write_lock ----. \
write_lock_bh -+ \
`-> rt_write_lock
__rt_spin_lock()
rt_spin_lock_fastlock()
rt_mutex_cmpxchg()
migrate_disable()
write_unlock* mapping:
write_unlock_irqrestore.
write_unlock_bh -------+
write_unlock_irq -> write_unlock ----------+
`-> rt_write_unlock()
__rt_spin_unlock()
rt_spin_lock_fastunlock()
rt_mutex_cmpxchg()
migrate_enable()
So calls to migrate_disable/enable() are better placed at the rt_write_*
level of lock/trylock/unlock as all of the write_*lock* API has this as a
common path.
This approach to write_*_bh also eliminates the concerns raised with
regards to api inbalances (write_lock_bh -> write_unlock+local_bh_enable)
Tested-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
No need to unconditionally migrate_disable (what is it protecting ?) and
re-enable on failure to acquire the lock.
This patch moves the migrate_disable to be conditioned on sucessful lock
acquisition only.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Map spinlocks, rwlocks, rw_semaphores and semaphores to the rt_mutex
based locking functions for preempt-rt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
In exit_pi_state_list() we have the following locking construct:
spin_lock(&hb->lock);
raw_spin_lock_irq(&curr->pi_lock);
...
spin_unlock(&hb->lock);
In !RT this works, but on RT the migrate_enable() function which is
called from spin_unlock() sees atomic context due to the held pi_lock
and just decrements the migrate_disable_atomic counter of the
task. Now the next call to migrate_disable() sees the counter being
negative and issues a warning. That check should be in
migrate_enable() already.
Fix this by dropping pi_lock before unlocking hb->lock and reaquire
pi_lock after that again. This is safe as the loop code reevaluates
head again under the pi_lock.
Reported-by: Yong Zhang <yong.zhang@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Requeue with timeout causes a bug with PREEMPT_RT_FULL.
The bug comes from a timed out condition.
TASK 1 TASK 2
------ ------
futex_wait_requeue_pi()
futex_wait_queue_me()
<timed out>
double_lock_hb();
raw_spin_lock(pi_lock);
if (current->pi_blocked_on) {
} else {
current->pi_blocked_on = PI_WAKE_INPROGRESS;
run_spin_unlock(pi_lock);
spin_lock(hb->lock); <-- blocked!
plist_for_each_entry_safe(this) {
rt_mutex_start_proxy_lock();
task_blocks_on_rt_mutex();
BUG_ON(task->pi_blocked_on)!!!!
The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the
problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to
grab the hb->lock, which it fails to do so. As the hb->lock is a mutex,
it will block and set the "pi_blocked_on" to the hb->lock.
When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails
because the task1's pi_blocked_on is no longer set to that, but instead,
set to the hb->lock.
The fix:
When calling rt_mutex_start_proxy_lock() a check is made to see
if the proxy tasks pi_blocked_on is set. If so, exit out early.
Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies
the proxy task that it is being requeued, and will handle things
appropriately.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
__raid_run_ops() disables preemption with get_cpu() around the access
to the raid5_percpu variables. That causes scheduling while atomic
spews on RT.
Serialize the access to the percpu data with a lock and keep the code
preemptible.
Reported-by: Udo van den Heuvel <udovdh@xs4all.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Udo van den Heuvel <udovdh@xs4all.nl>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The processing of softirqs in irq thread context is a performance gain
for the non-rt workloads of a system, but it's counterproductive for
interrupts which are explicitely related to the realtime
workload. Allow such interrupts to prevent softirq processing in their
thread context.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
When CONFIG_PREEMPT_RT_FULL is enabled, tasklets run as threads,
and spinlocks turn are mutexes. But this can cause issues with
tasks disabling tasklets. A tasklet runs under ksoftirqd, and
if a tasklets are disabled with tasklet_disable(), the tasklet
count is increased. When a tasklet runs, it checks this counter
and if it is set, it adds itself back on the softirq queue and
returns.
The problem arises in RT because ksoftirq will see that a softirq
is ready to run (the tasklet softirq just re-armed itself), and will
not sleep, but instead run the softirqs again. The tasklet softirq
will still see that the count is non-zero and will not execute
the tasklet and requeue itself on the softirq again, which will
cause ksoftirqd to run it again and again and again.
It gets worse because ksoftirqd runs as a real-time thread.
If it preempted the task that disabled tasklets, and that task
has migration disabled, or can't run for other reasons, the tasklet
softirq will never run because the count will never be zero, and
ksoftirqd will go into an infinite loop. As an RT task, it this
becomes a big problem.
This is a hack solution to have tasklet_disable stop tasklets, and
when a tasklet runs, instead of requeueing the tasklet softirqd
it delays it. When tasklet_enable() is called, and tasklets are
waiting, then the tasklet_enable() will kick the tasklets to continue.
This prevents the lock up from ksoftirq going into an infinite loop.
[ rostedt@goodmis.org: ported to 3.0-rt ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Proposal for a minor optimization in update_migrate_disable - its only a few
instructions saved but those are in the hot path of locks so it might be worth
it
When being scheduled out while migrate_disable > 0 and migrate_disabled_updated
is not yet set we end up here (kernel/sched/core.c):
static inline void update_migrate_disable(struct task_struct *p)
{
...
mask = tsk_cpus_allowed(p);
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, mask);
p->nr_cpus_allowed = cpumask_weight(mask);
as we only can get here if migrate_disable > 0 there is no need to calculate
the cpumask_weight(mask) as tsk_cpus_allowed in that case will return
cpumask_of(task_cpu(p)) which only can have a hamming weight of 1 anyway.
So we can simply do:
p->nr_cpus_allowed = 1;
without changing the behavior.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Minor cleanup in migrate_disable/migrate_enable. The recursive case
does not need to disable preemption as it is "pinned" to the current
cpu any way so it is safe to preempt it.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
The migrate_disable() can cause a bit of a overhead to the RT kernel,
as changing the affinity is expensive to do at every lock encountered.
As a running task can not migrate, the actual disabling of migration
does not need to occur until the task is about to schedule out.
In most cases, a task that disables migration will enable it before
it schedules making this change improve performance tremendously.
[ Frank Rowand: UP compile fix ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
<NMI> [<ffffffff812dafd8>] spin_bug+0x94/0xa8
[<ffffffff812db07f>] do_raw_spin_lock+0x43/0xea
[<ffffffff814fa9be>] _raw_spin_lock_irqsave+0x6b/0x85
[<ffffffff8106ff9e>] ? migrate_disable+0x75/0x12d
[<ffffffff81078aaf>] ? pin_current_cpu+0x36/0xb0
[<ffffffff8106ff9e>] migrate_disable+0x75/0x12d
[<ffffffff81115b9d>] pagefault_disable+0xe/0x1f
[<ffffffff81047027>] copy_from_user_nmi+0x74/0xe6
[<ffffffff810489d7>] perf_callchain_user+0xf3/0x135
Now clearly we can't go around taking locks from NMI context, cure
this by short-circuiting migrate_disable() when we're in an atomic
context already.
Add some extra debugging to avoid things like:
preempt_disable()
migrate_disable();
preempt_enable();
migrate_enable();
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314967297.1301.14.camel@twins
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-wbot4vsmwhi8vmbf83hsclk6@git.kernel.org
|
|
Assigning mask = tsk_cpus_allowed(p) after p->migrate_disable = 0 ensures
that we won't see a mask change.. no push/pull, we stack tasks on one CPU.
Also add a couple fields to sched_debug for the next guy.
[ Build fix from Stratos Psomadakis <psomas@gentoo.org> ]
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1314108763.6689.4.camel@marge.simson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Make migrate_disable() be a preempt_disable() for !rt kernels. This
allows generic code to use it but still enforces that these code
sections stay relatively small.
A preemptible migrate_disable() accessible for general use would allow
people growing arbitrary per-cpu crap instead of clean these things
up.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-275i87sl8e1jcamtchmehonm@git.kernel.org
|
|
Change from task_rq_lock() to raw_spin_lock(&rq->lock) to avoid a few
atomic ops. See comment on why it should be safe.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-cbz6hkl5r5mvwtx5s3tor2y6@git.kernel.org
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
RT added two bytes to trace migrate disable counting to the trace events
and used two bytes of the padding to make the change. The structures and
all were updated correctly, but the display in the event formats was
not:
cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 51
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned short common_migrate_disable; offset:8; size:2; signed:0;
field:int common_padding; offset:10; size:2; signed:0;
The field for common_padding has the correct size and offset, but the
use of "int" might confuse some parsers (and people that are reading
it). This needs to be changed to "unsigned short".
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1321467575.4181.36.camel@frodo
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
cpu_unplug_begin() should be called before CPU_DOWN_PREPARE, because
at CPU_DOWN_PREPARE cpu_active is cleared and sched_domain is
rebuilt. Otherwise the 'sync_unplug' thread will be running on the cpu
on which it's created and not bound on the cpu which is about to go
down.
I found that by an incorrect warning on smp_processor_id() called by
sync_unplug/1, and trace shows below:
(echo 1 > /sys/device/system/cpu/cpu1/online)
bash-1664 [000] 83.136620: _cpu_down: Bind sync_unplug to cpu 1
bash-1664 [000] 83.136623: sched_wait_task: comm=sync_unplug/1 pid=1724 prio=120
bash-1664 [000] 83.136624: _cpu_down: Wake sync_unplug
bash-1664 [000] 83.136629: sched_wakeup: comm=sync_unplug/1 pid=1724 prio=120 success=1 target_cpu=000
Wants to be folded back....
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/1318762607-2261-3-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|