Age | Commit message (Collapse) | Author |
|
This patch provides a recording mechanism to store data of potential
sources of system latencies. The recordings separately determine the
latency caused by a delayed timer expiration, by a delayed wakeup of the
related user space program and by the sum of both. The histograms can be
enabled and reset individually. The data are accessible via the debug
filesystem. For details please consult Documentation/trace/histograms.txt.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The waitqueue is protected by the pci_lock, so we can just avoid to
lock the waitqueue lock itself. That prevents the
might_sleep()/scheduling while atomic problem on RT
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
|
|
Preemption must be disabled before enabling interrupts in do_trap
on x86_64 because the stack in use for int3 and debug is a per CPU
stack set by th IST. But 32bit does not have an IST and the stack
still belongs to the current task and there is no problem in scheduling
out the task.
Keep preemption enabled on X86_32 when enabling interrupts for
do_trap().
The name of the function is changed from preempt_conditional_sti/cli()
to conditional_sti/cli_ist(), to annotate that this function is used
when the stack is on the IST.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
With threaded interrupts we might see an interrupt in progress on
migration. Do not unmask it when this is the case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
for outstanding qdisc_run calls
On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50
(by default). If a high priority userspace process tries to shut down a busy
network interface it might spin in a yield loop waiting for the device to
become idle. With the interrupt thread having a lower priority than the
looping process it might never be scheduled and so result in a deadlock on UP
systems.
With Magic SysRq the following backtrace can be produced:
> test_app R running 0 174 168 0x00000000
> [<c02c7070>] (__schedule+0x220/0x3fc) from [<c02c7870>] (preempt_schedule_irq+0x48/0x80)
> [<c02c7870>] (preempt_schedule_irq+0x48/0x80) from [<c0008fa8>] (svc_preempt+0x8/0x20)
> [<c0008fa8>] (svc_preempt+0x8/0x20) from [<c001a984>] (local_bh_enable+0x18/0x88)
> [<c001a984>] (local_bh_enable+0x18/0x88) from [<c025316c>] (dev_deactivate_many+0x220/0x264)
> [<c025316c>] (dev_deactivate_many+0x220/0x264) from [<c023be04>] (__dev_close_many+0x64/0xd4)
> [<c023be04>] (__dev_close_many+0x64/0xd4) from [<c023be9c>] (__dev_close+0x28/0x3c)
> [<c023be9c>] (__dev_close+0x28/0x3c) from [<c023f7f0>] (__dev_change_flags+0x88/0x130)
> [<c023f7f0>] (__dev_change_flags+0x88/0x130) from [<c023f904>] (dev_change_flags+0x10/0x48)
> [<c023f904>] (dev_change_flags+0x10/0x48) from [<c024c140>] (do_setlink+0x370/0x7ec)
> [<c024c140>] (do_setlink+0x370/0x7ec) from [<c024d2f0>] (rtnl_newlink+0x2b4/0x450)
> [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) from [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4)
> [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) from [<c0256740>] (netlink_rcv_skb+0xac/0xc0)
> [<c0256740>] (netlink_rcv_skb+0xac/0xc0) from [<c024bbd8>] (rtnetlink_rcv+0x18/0x24)
> [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) from [<c02561b8>] (netlink_unicast+0x13c/0x198)
> [<c02561b8>] (netlink_unicast+0x13c/0x198) from [<c025651c>] (netlink_sendmsg+0x264/0x2e0)
> [<c025651c>] (netlink_sendmsg+0x264/0x2e0) from [<c022af98>] (sock_sendmsg+0x78/0x98)
> [<c022af98>] (sock_sendmsg+0x78/0x98) from [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278)
> [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) from [<c022cf08>] (__sys_sendmsg+0x48/0x78)
> [<c022cf08>] (__sys_sendmsg+0x48/0x78) from [<c0009320>] (ret_fast_syscall+0x0/0x2c)
This patch works around the problem by replacing yield() by msleep(1), giving
the interrupt thread time to finish, similar to other changes contained in the
rt patch set. Using wait_for_completion() instead would probably be a better
solution.
Cc: stable-rt@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.0-rc3+ #26
-------------------------------------------------------
ip/1104 is trying to acquire lock:
(local_softirq_lock){+.+...}, at: [<ffffffff81056d12>] __local_lock+0x25/0x68
but task is already holding lock:
(sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_INET){+.+...}:
[<ffffffff810836e5>] lock_acquire+0x103/0x12e
[<ffffffff813e2781>] lock_sock_nested+0x82/0x92
[<ffffffff81433308>] lock_sock+0x10/0x12
[<ffffffff81433afa>] tcp_close+0x1b/0x355
[<ffffffff81453c99>] inet_release+0xc3/0xcd
[<ffffffff813dff3f>] sock_release+0x1f/0x74
[<ffffffff813dffbb>] sock_close+0x27/0x2b
[<ffffffff81129c63>] fput+0x11d/0x1e3
[<ffffffff81126577>] filp_close+0x70/0x7b
[<ffffffff8112667a>] sys_close+0xf8/0x13d
[<ffffffff814ae882>] system_call_fastpath+0x16/0x1b
-> #0 (local_softirq_lock){+.+...}:
[<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
[<ffffffff810836e5>] lock_acquire+0x103/0x12e
[<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
[<ffffffff81056d12>] __local_lock+0x25/0x68
[<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
[<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
[<ffffffff81433c38>] tcp_close+0x159/0x355
[<ffffffff81453c99>] inet_release+0xc3/0xcd
[<ffffffff813dff3f>] sock_release+0x1f/0x74
[<ffffffff813dffbb>] sock_close+0x27/0x2b
[<ffffffff81129c63>] fput+0x11d/0x1e3
[<ffffffff81126577>] filp_close+0x70/0x7b
[<ffffffff8112667a>] sys_close+0xf8/0x13d
[<ffffffff814ae882>] system_call_fastpath+0x16/0x1b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET);
lock(local_softirq_lock);
lock(sk_lock-AF_INET);
lock(local_softirq_lock);
*** DEADLOCK ***
1 lock held by ip/1104:
#0: (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12
stack backtrace:
Pid: 1104, comm: ip Not tainted 3.0.0-rc3+ #26
Call Trace:
[<ffffffff81081649>] print_circular_bug+0x1f8/0x209
[<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
[<ffffffff81056d12>] ? __local_lock+0x25/0x68
[<ffffffff810836e5>] lock_acquire+0x103/0x12e
[<ffffffff81056d12>] ? __local_lock+0x25/0x68
[<ffffffff81046c75>] ? get_parent_ip+0x11/0x41
[<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
[<ffffffff81056d12>] ? __local_lock+0x25/0x68
[<ffffffff81046c8c>] ? get_parent_ip+0x28/0x41
[<ffffffff81056d12>] __local_lock+0x25/0x68
[<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
[<ffffffff81433308>] ? lock_sock+0x10/0x12
[<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
[<ffffffff81433c38>] tcp_close+0x159/0x355
[<ffffffff81453c99>] inet_release+0xc3/0xcd
[<ffffffff813dff3f>] sock_release+0x1f/0x74
[<ffffffff813dffbb>] sock_close+0x27/0x2b
[<ffffffff81129c63>] fput+0x11d/0x1e3
[<ffffffff81126577>] filp_close+0x70/0x7b
[<ffffffff8112667a>] sys_close+0xf8/0x13d
[<ffffffff814ae882>] system_call_fastpath+0x16/0x1b
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
timekeeping suspend/resume calls read_persistant_clock() which takes
rtc_lock. That results in might sleep warnings because at that point
we run with interrupts disabled.
We cannot convert rtc_lock to a raw spinlock as that would trigger
other might sleep warnings.
As a temporary workaround we disable the might sleep warnings by
setting system_state to SYSTEM_SUSPEND before calling sysdev_suspend()
and restoring it to SYSTEM_RUNNING afer sysdev_resume().
Needs to be revisited.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Now that all users are cleaned up, we can remove the preemption count.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-m6yuzd6ul717hlnl2gj6p3ou@git.kernel.org
|
|
Adding migrate_disable() to pagefault_disable() to preserve the
per-cpu thing for kmap_atomic might not have been the best of choices.
But short of adding preempt_disable/migrate_disable foo all over the
kmap code it still seems the best way.
It does however yield the below borkage as well as wreck !-rt builds
since !-rt does rely on pagefault_disable() not preempting. So fix all
that up by adding raw_pagefault_disable().
<NMI> [<ffffffff81076d5c>] warn_slowpath_common+0x85/0x9d
[<ffffffff81076e17>] warn_slowpath_fmt+0x46/0x48
[<ffffffff814f7fca>] ? _raw_spin_lock+0x6c/0x73
[<ffffffff810cac87>] ? watchdog_overflow_callback+0x9b/0xd0
[<ffffffff810caca3>] watchdog_overflow_callback+0xb7/0xd0
[<ffffffff810f51bb>] __perf_event_overflow+0x11c/0x1fe
[<ffffffff810f298f>] ? perf_event_update_userpage+0x149/0x151
[<ffffffff810f2846>] ? perf_event_task_disable+0x7c/0x7c
[<ffffffff810f5b7c>] perf_event_overflow+0x14/0x16
[<ffffffff81046e02>] x86_pmu_handle_irq+0xcb/0x108
[<ffffffff814f9a6b>] perf_event_nmi_handler+0x46/0x91
[<ffffffff814fb2ba>] notifier_call_chain+0x79/0xa6
[<ffffffff814fb34d>] __atomic_notifier_call_chain+0x66/0x98
[<ffffffff814fb2e7>] ? notifier_call_chain+0xa6/0xa6
[<ffffffff814fb393>] atomic_notifier_call_chain+0x14/0x16
[<ffffffff814fb3c3>] notify_die+0x2e/0x30
[<ffffffff814f8f75>] do_nmi+0x7e/0x22b
[<ffffffff814f8bca>] nmi+0x1a/0x2c
[<ffffffff814fb130>] ? sub_preempt_count+0x4b/0xaa
<<EOE>> <IRQ> [<ffffffff812d44cc>] delay_tsc+0xac/0xd1
[<ffffffff812d4399>] __delay+0xf/0x11
[<ffffffff812d95d9>] do_raw_spin_lock+0xd2/0x13c
[<ffffffff814f813e>] _raw_spin_lock_irqsave+0x6b/0x85
[<ffffffff8106772a>] ? task_rq_lock+0x35/0x8d
[<ffffffff8106772a>] task_rq_lock+0x35/0x8d
[<ffffffff8106fe2f>] migrate_disable+0x65/0x12c
[<ffffffff81114e69>] pagefault_disable+0xe/0x1f
[<ffffffff81039c73>] dump_trace+0x21f/0x2e2
[<ffffffff8103ad79>] show_trace_log_lvl+0x54/0x5d
[<ffffffff8103ad97>] show_trace+0x15/0x17
[<ffffffff814f4f5f>] dump_stack+0x77/0x80
[<ffffffff812d94b0>] spin_bug+0x9c/0xa3
[<ffffffff81067745>] ? task_rq_lock+0x50/0x8d
[<ffffffff812d954e>] do_raw_spin_lock+0x47/0x13c
[<ffffffff814f7fbe>] _raw_spin_lock+0x60/0x73
[<ffffffff81067745>] ? task_rq_lock+0x50/0x8d
[<ffffffff81067745>] task_rq_lock+0x50/0x8d
[<ffffffff8106fe2f>] migrate_disable+0x65/0x12c
[<ffffffff81114e69>] pagefault_disable+0xe/0x1f
[<ffffffff81039c73>] dump_trace+0x21f/0x2e2
[<ffffffff8104369b>] save_stack_trace+0x2f/0x4c
[<ffffffff810a7848>] save_trace+0x3f/0xaf
[<ffffffff810aa2bd>] mark_lock+0x228/0x530
[<ffffffff810aac27>] __lock_acquire+0x662/0x1812
[<ffffffff8103dad4>] ? native_sched_clock+0x37/0x6d
[<ffffffff810a790e>] ? trace_hardirqs_off_caller+0x1f/0x99
[<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218
[<ffffffff810ac403>] lock_acquire+0x145/0x18a
[<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218
[<ffffffff814f7f9e>] _raw_spin_lock+0x40/0x73
[<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218
[<ffffffff810693f6>] sched_rt_period_timer+0xbd/0x218
[<ffffffff8109aa39>] __run_hrtimer+0x1e4/0x347
[<ffffffff81069339>] ? can_migrate_task.clone.82+0x14a/0x14a
[<ffffffff8109b97c>] hrtimer_interrupt+0xee/0x1d6
[<ffffffff814fb23d>] ? add_preempt_count+0xae/0xb2
[<ffffffff814ffb38>] smp_apic_timer_interrupt+0x85/0x98
[<ffffffff814fef13>] apic_timer_interrupt+0x13/0x20
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-31keae8mkjiv8esq4rl76cib@git.kernel.org
|
|
Wrap the test for pagefault_disabled() into a helper, this allows us
to remove the need for current->pagefault_disabled on !-rt kernels.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3yy517m8zsi9fpsf14xfaqkw@git.kernel.org
|
|
Necessary for decoupling pagefault disable from preempt count.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add a pagefault_disabled variable to task_struct to allow decoupling
the pagefault-disabled logic from the preempt count.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Use disable_irq_nosync() instead of disable_irq() as this might be
called in atomic context with netpoll.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Otherwise the device is not completely shut down.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
As default the TCLIB uses the 32KiHz base clock rate for clock events.
Add a compile time selection to allow higher clock resulution.
(fixed up by Sami Pietikäinen <Sami.Pietikainen@wapice.com>)
Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Setup and remove the interrupt handler in clock event mode selection.
This avoids calling the (shared) interrupt handler when the device is
not used.
Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: redo the patch with NR_IRQS_LEGACY which is probably required since
commit 8fe82a55 ("ARM: at91: sparse irq support") which is included since v3.6.
Patch based on what Sami Pietikäinen <Sami.Pietikainen@wapice.com> suggested].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
No need to keep preemption disabled across the whole function.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
On x86_64 we must disable preemption before we enable interrupts
for stack faults, int3 and debugging, because the current task is using
a per CPU debug stack defined by the IST. If we schedule out, another task
can come in and use the same stack and cause the stack to be corrupted
and crash the kernel on return.
When CONFIG_PREEMPT_RT_FULL is enabled, spin_locks become mutexes, and
one of these is the spin lock used in signal handling.
Some of the debug code (int3) causes do_trap() to send a signal.
This function calls a spin lock that has been converted to a mutex
and has the possibility to sleep. If this happens, the above issues with
the corrupted stack is possible.
Instead of calling the signal right away, for PREEMPT_RT and x86_64,
the signal information is stored on the stacks task_struct and
TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
code will send the signal when preemption is enabled.
[ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT_FULL to
ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]
Cc: stable-rt@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
To avoid allocation allow rt tasks to cache one sigqueue struct in
task struct.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Posix timers should not send broadcast signals and kernel only
signals. Prevent it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The arm boot_lock is used by the secondary processor startup code. The locking
task is the idle thread, which has idle->sched_class == &idle_sched_class.
idle_sched_class->enqueue_task == NULL, so if the idle task blocks on the
lock, the attempt to wake it when the lock becomes available will fail:
try_to_wake_up()
...
activate_task()
enqueue_task()
p->sched_class->enqueue_task(rq, p, flags)
Fix by converting boot_lock to a raw spin lock.
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
Link: http://lkml.kernel.org/r/4E77B952.3010606@am.sony.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
As explained by Alexander Fyodorov <halcy@yandex.ru>:
|read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel,
|and it can remove __TASK_TRACED from task->state (by moving it to
|task->saved_state). If parent does wait() on child followed by a sys_ptrace
|call, the following race can happen:
|
|- child sets __TASK_TRACED in ptrace_stop()
|- parent does wait() which eventually calls wait_task_stopped() and returns
| child's pid
|- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves
| __TASK_TRACED flag to saved_state
|- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive()
The patch is based on his initial patch where an additional check is
added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is
taken in case the caller is interrupted between looking into ->state and
->saved_state.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The preempt_schedule() uses the preempt_disable_notrace() version
because it can cause infinite recursion by the function tracer as
the function tracer uses preempt_enable_notrace() which may call
back into the preempt_schedule() code as the NEED_RESCHED is still
set and the PREEMPT_ACTIVE has not been set yet.
See commit: d1f74e20b5b064a130cd0743a256c2d3cfe84010 that made this
change.
The preemptoff and preemptirqsoff latency tracers require the first
and last preempt count modifiers to enable tracing. But this skips
the checks. Since we can not convert them back to the non notrace
version, we can use the idle() hooks for the latency tracers here.
That is, the start/stop_critical_timings() works well to manually
start and stop the latency tracer for preempt off timings.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Moving the blk_sched_flush_plug() call out of the interrupt/preempt
disabled region in the scheduler allows us to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug().
Now instead of doing this we disable interrupts explicitely when we
lock the request_queue and reenable them when we drop the lock. That
allows interrupts to be handled when the plug list contains requests
for more than one queue.
Aside of that this change makes the scope of the irq disabled region
more obvious. The current code confused the hell out of me when
looking at:
local_irq_save(flags);
spin_lock(q->queue_lock);
...
queue_unplugged(q...);
scsi_request_fn();
spin_unlock(q->queue_lock);
spin_lock(shost->host_lock);
spin_unlock_irq(shost->host_lock);
-------------------^^^ ????
spin_lock_irq(q->queue_lock);
spin_unlock(q->lock);
local_irq_restore(flags);
Also add a comment to __blk_run_queue() documenting that
q->request_fn() can drop q->queue_lock and reenable interrupts, but
must return with q->queue_lock held and interrupts disabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
|
|
If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.
If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: stable-rt@vger.kernel.org
|
|
The following scenario does not work correctly:
Runqueue of CPUx contains two runnable and pinned tasks:
T1: SCHED_FIFO, prio 80
T2: SCHED_FIFO, prio 80
T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):
sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
...
sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
...
Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!
The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.
This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.
We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.
It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: stable-rt@vger.kernel.org
|
|
If the policy and priority remain unchanged a possible modification of
sched_reset_on_fork gets lost in the early exit path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: stable-rt@vger.kernel.org
|
|
might sleep can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Idle is not allowed to call sleeping functions ever!
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This more or less reverts commit 6391f697d4892a6f233501beea553e13f7745a23.
Instead of adding an unneeded 'default', mark the variable to prevent
the false positive 'uninitialized var'. The other change (fixing the
printout) needs revert, too. We want to know WHICH critical irq failed,
not which level it had.
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anatolij Gustschin <agust@denx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
There are macros for static initializer for the three out of four
possible notifier types, that are:
ATOMIC_NOTIFIER_HEAD()
BLOCKING_NOTIFIER_HEAD()
RAW_NOTIFIER_HEAD()
This patch provides a static initilizer for the forth type to make it
complete.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Issue debugged by Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
sparc does not have CONFIG_EARLY_PRINTK option.
So early-printk-consolidate.patch breaks compilation:
arch/sparc/built-in.o: In function `setup_arch':
(.init.text+0x15e4): undefined reference to `early_console'
arch/sparc/built-in.o: In function `setup_arch':
(.init.text+0x15ec): undefined reference to `early_console'
The below addition fixes that.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
if the RCU core needs anything from this CPU. If so, RCU raises
RCU_SOFTIRQ to carry out any needed processing.
This approach has worked well historically, but it is undesirable on
NO_HZ_FULL CPUs. Such CPUs are expected to spend almost all of their time
in userspace, so that scheduling-clock interrupts can be disabled while
there is only one runnable task on the CPU in question. Unfortunately,
raising any softirq has the potential to wake up ksoftirqd, which would
provide the second runnable task on that CPU, preventing disabling of
scheduling-clock interrupts.
What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
relying on the grace-period kthreads' quiescent-state forcing to
do any needed RCU work on behalf of those CPUs.
This commit therefore refrains from raising RCU_SOFTIRQ on any
NO_HZ_FULL CPUs during any grace periods that have been in effect
for less than one second. The one-second limit handles the case
where an inappropriate workload is running on a NO_HZ_FULL CPU
that features lots of scheduling-clock interrupts, but no idle
or userspace time.
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <bitbucket@online.de>
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
neigh_priv_len is defined as u8. With all debug enabled struct
ipoib_neigh has 200 bytes. The largest part is sk_buff_head with 96
bytes and here the spinlock with 72 bytes.
The size value still fits in this u8 leaving some room for more.
On -RT struct ipoib_neigh put on weight and has 392 bytes. The main
reason is sk_buff_head with 288 and the fatty here is spinlock with 192
bytes. This does no longer fit into into neigh_priv_len and gcc
complains.
This patch changes neigh_priv_len from being 8bit to 16bit. Since the
following element (dev_id) is 16bit followed by a spinlock which is
aligned, the struct remains with a total size of 3200 (allmodconfig) /
2048 (with as much debug off as possible) bytes on x86-64.
On x86-32 the struct is 1856 (allmodconfig) / 1216 (with as much debug
off as possible) bytes long. The numbers were gained with and without
the patch to prove that this change does not increase the size of the
struct.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|