summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)Author
2015-02-13workqueue: Prevent deadlock/stall on RTThomas Gleixner
Austin reported a XFS deadlock/stall on RT where scheduled work gets never exececuted and tasks are waiting for each other for ever. The underlying problem is the modification of the RT code to the handling of workers which are about to go to sleep. In mainline a worker thread which goes to sleep wakes an idle worker if there is more work to do. This happens from the guts of the schedule() function. On RT this must be outside and the accessed data structures are not protected against scheduling due to the spinlock to rtmutex conversion. So the naive solution to this was to move the code outside of the scheduler and protect the data structures by the pool lock. That approach turned out to be a little naive as we cannot call into that code when the thread blocks on a lock, as it is not allowed to block on two locks in parallel. So we dont call into the worker wakeup magic when the worker is blocked on a lock, which causes the deadlock/stall observed by Austin and Mike. Looking deeper into that worker code it turns out that the only relevant data structure which needs to be protected is the list of idle workers which can be woken up. So the solution is to protect the list manipulation operations with preempt_enable/disable pairs on RT and call unconditionally into the worker code even when the worker is blocked on a lock. The preemption protection is safe as there is nothing which can fiddle with the list outside of thread context. Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com> Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-02-13sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13workqueue vs ata-piix livelock fixupThomas Gleixner
An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13Use local irq lock instead of irq disable regionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13workqueue: Use normal rcuThomas Gleixner
There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13Reset to 3.12.37Scott Wood
2014-05-14sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14workqueue vs ata-piix livelock fixupThomas Gleixner
An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14Use local irq lock instead of irq disable regionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14workqueue: Use normal rcuThomas Gleixner
There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14Reset to 3.12.19Scott Wood
2014-04-10sched: Distangle worker accounting from rqlockThomas Gleixner
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10workqueue vs ata-piix livelock fixupThomas Gleixner
An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10Use local irq lock instead of irq disable regionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10workqueue: Use normal rcuThomas Gleixner
There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-05workqueue: ensure @task is valid across kthread_stop()Lai Jiangshan
commit 5bdfff96c69a4d5ab9c49e60abf9e070ecd2acbb upstream. When a kworker should die, the kworkre is notified through WORKER_DIE flag instead of kthread_should_stop(). This, IIRC, is primarily to keep the test synchronized inside worker_pool lock. WORKER_DIE is first set while holding pool->lock, the lock is dropped and kthread_stop() is called. Unfortunately, this means that there's a slight chance that the target kworker may see WORKER_DIE before kthread_stop() finishes and exits and frees the target task before or during kthread_stop(). Fix it by pinning the target task before setting WORKER_DIE and putting it after kthread_stop() is done. tj: Improved patch description and comment. Moved pinning above WORKER_DIE for better signify what it's protecting. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2013-12-04workqueue: fix ordered workqueues in NUMA setupsTejun Heo
commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream. An ordered workqueue implements execution ordering by using single pool_workqueue with max_active == 1. On a given pool_workqueue, work items are processed in FIFO order and limiting max_active to 1 enforces the queued work items to be processed one by one. Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues") accidentally broke this guarantee by applying NUMA affinity to ordered workqueues too. On NUMA setups, an ordered workqueue would end up with separate pool_workqueues for different nodes. Each pool_workqueue still limits max_active to 1 but multiple work items may be executed concurrently and out of order depending on which node they are queued to. Fix it by using dedicated ordered_wq_attrs[] when creating ordered workqueues. The new attrs match the unbound ones except that no_numa is always set thus forcing all NUMA nodes to share the default pool_workqueue. While at it, add sanity check in workqueue creation path which verifies that an ordered workqueues has only the default pool_workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Libin <huawei.libin@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
2013-09-04Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Nothing interesting. All are doc / comment updates" * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Correct/Drop references to gcwq in Documentation workqueue: Fix manage_workers() RETURNS description workqueue: Comment correction in file header workqueue: mark WQ_NON_REENTRANT deprecated
2013-09-03Merge tag 'driver-core-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...
2013-08-29workqueue: cond_resched() after processing each work itemTejun Heo
If !PREEMPT, a kworker running work items back to back can hog CPU. This becomes dangerous when a self-requeueing work item which is waiting for something to happen races against stop_machine. Such self-requeueing work item would requeue itself indefinitely hogging the kworker and CPU it's running on while stop_machine would wait for that CPU to enter stop_machine while preventing anything else from happening on all other CPUs. The two would deadlock. Jamie Liu reports that this deadlock scenario exists around scsi_requeue_run_queue() and libata port multiplier support, where one port may exclude command processing from other ports. With the right timing, scsi_requeue_run_queue() can end up requeueing itself trying to execute an IO which is asked to be retried while another device has an exclusive access, which in turn can't make forward progress due to stop_machine. Fix it by invoking cond_resched() after executing each work item. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jamie Liu <jamieliu@google.com> References: http://thread.gmane.org/gmane.linux.kernel/1552567 Cc: stable@vger.kernel.org -- kernel/workqueue.c | 9 +++++++++ 1 file changed, 9 insertions(+)
2013-08-23workqueue: convert bus code to use dev_groupsGreg Kroah-Hartman
The dev_attrs field of struct bus_type is going away soon, dev_groups should be used instead. This converts the workqueue bus code to use the correct field. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21workqueue: Fix manage_workers() RETURNS descriptionLibin
No functional change. The comment of function manage_workers() RETURNS description is obvious wrong, same as the CONTEXT. Fix it. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-21workqueue: Comment correction in file headerLibin
No functional change. There are two worker pools for each cpu in current implementation (one for normal work items and the other for high priority ones). tj: Whitespace adjustments. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-20workqueue: fix some scripts/kernel-doc warningsYacine Belkadi
When building the htmldocs (in verbose mode), scripts/kernel-doc reports the following type of warnings: Warning(kernel/workqueue.c:653): No description found for return value of 'get_work_pool' Fix them by: - Using "Return:" sections to introduce descriptions of return values - Adding some missing descriptions Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-06Merge branch 'for-3.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull two workqueue fixes from Tejun Heo: "A lockdep notation update so that nested work_on_cpu() invocations don't lead to spurious lockdep warnings and fix for an unbound attr bug which made what's shown in sysfs deviate from the actual ones. Both patches have pretty limited scope" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: copy workqueue_attrs with all fields workqueue: allow work_on_cpu() to be called recursively
2013-08-01workqueue: copy workqueue_attrs with all fieldsShaohua Li
$echo '0' > /sys/bus/workqueue/devices/xxx/numa $cat /sys/bus/workqueue/devices/xxx/numa I got 1. It should be 0, the reason is copy_workqueue_attrs() called in apply_workqueue_attrs() doesn't copy no_numa field. Fix it by making copy_workqueue_attrs() copy ->no_numa too. This would also make get_unbound_pool() set a pool's ->no_numa attribute according to the workqueue attributes used when the pool was created. While harmelss, as ->no_numa isn't a pool attribute, this is a bit confusing. Clear it explicitly. tj: Updated description and comments a bit. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2013-07-24workqueue: allow work_on_cpu() to be called recursivelyLai Jiangshan
If the @fn call work_on_cpu() again, the lockdep will complain: > [ INFO: possible recursive locking detected ] > 3.11.0-rc1-lockdep-fix-a #6 Not tainted > --------------------------------------------- > kworker/0:1/142 is trying to acquire lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0 > > but task is already holding lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock((&wfc.work)); > lock((&wfc.work)); > > *** DEADLOCK *** It is false-positive lockdep report. In this sutiation, the two "wfc"s of the two work_on_cpu() are different, they are both on stack. flush_work() can't be deadlock. To fix this, we need to avoid the lockdep checking in this case, thus we instroduce a internal __flush_work() which skip the lockdep. tj: Minor comment adjustment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-14kernel: delete __cpuinit usage from all core kernel filesPaul Gortmaker
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-03Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "Surprisingly, Lai and I didn't break too many things implementing custom pools and stuff last time around and there aren't any follow-up changes necessary at this point. The only change in this pull request is Viresh's patches to make some per-cpu workqueues to behave as unbound workqueues dependent on a boot param whose default can be configured via a config option. This leads to higher processing overhead / lower bandwidth as more work items are bounced across CPUs; however, it can lead to noticeable powersave in certain configurations - ~10% w/ idlish constant workload on a big.LITTLE configuration according to Viresh. This is because per-cpu workqueues interfere with how the scheduler perceives whether or not each CPU is idle by forcing pinned tasks on them, which makes the scheduler's power-aware scheduling decisions less effective. Its effectiveness is likely less pronounced on homogenous configurations and this type of optimization can probably be made automatic; however, the changes are pretty minimal and the affected workqueues are clearly marked, so it's an easy gain for some configurations for the time being with pretty unintrusive changes." * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: fbcon: queue work on power efficient wq block: queue work on power efficient wq PHYLIB: queue work on system_power_efficient_wq workqueue: Add system wide power_efficient workqueues workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
2013-05-15workqueue: don't perform NUMA-aware allocations on offline nodes in ↵Tejun Heo
wq_numa_init() wq_numa_init() builds per-node cpumasks which are later used to make unbound workqueues NUMA-aware. The cpumasks are allocated using alloc_cpumask_var_node() for all possible nodes. Unfortunately, on machines with off-line nodes, this leads to NUMA-aware allocations on existing bug offline nodes, which in turn triggers BUG in the memory allocation code. Fix it by using NUMA_NO_NODE for cpumask allocations for offline nodes. kernel BUG at include/linux/gfp.h:323! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1 Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011 task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000 RIP: 0010:[<ffffffff8117495d>] [<ffffffff8117495d>] new_slab+0x2ad/0x340 RSP: 0000:ffff880234603bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0 RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001 R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0 FS: 0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0 ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1 ffff880237816140 0000000000000000 ffff88023740e1c0 Call Trace: [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2 [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200 [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90 [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be [<ffffffff81a0bec8>] init_workqueues+0x64/0x341 [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0 [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec [<ffffffff815d50de>] kernel_init+0xe/0xf0 [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0 Code: 45 84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Lingzhu Xiang <lxiang@redhat.com>
2013-05-14workqueue: Make schedule_work() available again to non GPL modulesMarc Dionne
Commit 8425e3d5bdbe ("workqueue: inline trivial wrappers") changed schedule_work() and schedule_delayed_work() to inline wrappers, but these rely on some symbols that are EXPORT_SYMBOL_GPL, while the original functions were EXPORT_SYMBOL. This has the effect of changing the licensing requirement for these functions and making them unavailable to non GPL modules. Make them available again by removing the restriction on the required symbols. Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14workqueue: correct handling of the pool spin_lockJoonsoo Kim
When we fail to mutex_trylock(), we release the pool spin_lock and do mutex_lock(). After that, we should regrab the pool spin_lock, but, regrabbing is missed in current code. So correct it. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14workqueue: Add system wide power_efficient workqueuesViresh Kumar
This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueuesViresh Kumar
Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-10workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into ↵Tejun Heo
node number df2d5ae499 ("workqueue: map an unbound workqueues to multiple per-node pool_workqueues") made unbound workqueues to map to multiple per-node pool_workqueues and accordingly updated workqueue_contested() so that, for unbound workqueues, it maps the specified @cpu to the NUMA node number to obtain the matching pool_workqueue to query the congested state. Before this change, workqueue_congested() ignored @cpu for unbound workqueues as there was only one pool_workqueue and some users (fscache) called it with WORK_CPU_UNBOUND. After the commit, this causes the following oops as WORK_CPU_UNBOUND gets translated to garbage by cpu_to_node(). BUG: unable to handle kernel paging request at ffff8803598d98b8 IP: [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa PGD 2421067 PUD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 2689 Comm: cat Tainted: GF 3.9.0-fsdevel+ #4 task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000 RIP: 0010:[<ffffffff81043b7e>] [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa RSP: 0018:ffff880025807ad8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003 RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0 RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898 R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97 R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0 FS: 00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0 ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8 Call Trace: [<ffffffff810442ce>] workqueue_congested+0xb7/0x12c [<ffffffffa00810b0>] fscache_enqueue_object+0xb2/0xe8 [fscache] [<ffffffffa007facd>] __fscache_acquire_cookie+0x3b9/0x56c [fscache] [<ffffffffa00ad8fe>] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs] [<ffffffffa009e112>] do_open+0x9/0xd [nfs] [<ffffffff810e804a>] do_dentry_open+0x175/0x24b [<ffffffff810e8298>] finish_open+0x41/0x51 Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Howells <dhowells@redhat.com> Tested-and-Acked-by: David Howells <dhowells@redhat.com>
2013-05-01workqueue: include workqueue info when printing debug dump of a worker taskTejun Heo
One of the problems that arise when converting dedicated custom threadpool to workqueue is that the shared worker pool used by workqueue anonimizes each worker making it more difficult to identify what the worker was doing on which target from the output of sysrq-t or debug dump from oops, BUG() and friends. This patch implements set_worker_desc() which can be called from any workqueue work function to set its description. When the worker task is dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion and so on - the description will be printed out together with the workqueue name and the worker function pointer. The printing side is implemented by print_worker_info() which is called from functions in task dump paths - sched_show_task() and dump_stack_print_info(). print_worker_info() can be safely called on any task in any state as long as the task struct itself is accessible. It uses probe_*() functions to access worker fields. It may print garbage if something went very wrong, but it wouldn't cause (another) oops. The description is currently limited to 24bytes including the terminating \0. worker->desc_valid and workder->desc[] are added and the 64 bytes marker which was already incorrect before adding the new fields is moved to the correct position. Here's an example dump with writeback updated to set the bdi name as worker desc. Hardware name: Bochs Modules linked in: Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1 Workqueue: writeback bdi_writeback_workfn (flush-8:0) ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8 ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0 ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08 Call Trace: [<ffffffff81c61845>] dump_stack+0x19/0x1b [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0 ... Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-09workqueue: use kmem_cache_free() instead of kfree()Wei Yongjun
memory allocated by kmem_cache_alloc() should be freed using kmem_cache_free(), not kfree(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-04workqueue: avoid false negative WARN_ON() in destroy_workqueue()Lai Jiangshan
destroy_workqueue() performs several sanity checks before proceeding with destruction of a workqueue. One of the checks verifies that refcnt of each pwq (pool_workqueue) is over 1 as at that point there should be no in-flight work items and the only holder of pwq refs is the workqueue itself. This worked fine as a workqueue used to hold only one reference to its pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues"), a workqueue may hold multiple references to its default pwq triggering this sanity check spuriously. Fix it by not triggering the pwq->refcnt assertion on default pwqs. An example spurious WARN trigger follows. WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e() Hardware name: 4286C12 Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29 Call Trace: [<c04314a7>] warn_slowpath_common+0x7c/0x93 [<c04314e0>] warn_slowpath_null+0x22/0x24 [<c044796a>] destroy_workqueue+0x6a/0x13e [<c056dc01>] ext4_put_super+0x43/0x2c4 [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9 [<c04fb848>] kill_block_super+0x22/0x60 [<c04fb960>] deactivate_locked_super+0x2f/0x56 [<c04fc41b>] deactivate_super+0x2e/0x31 [<c050f1e6>] mntput_no_expire+0x103/0x108 [<c050fdce>] sys_umount+0x2a2/0x2c4 [<c050fe0e>] sys_oldumount+0x1e/0x20 [<c085ba4d>] sysenter_do_call+0x12/0x38 tj: Rewrote description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-04-02Merge tag 'v3.9-rc5' into wq/for-3.10Tejun Heo
Writeback conversion to workqueue will be based on top of wq/for-3.10 branch to take advantage of custom attrs and NUMA support for unbound workqueues. Mainline currently contains two commits which result in non-trivial merge conflicts with wq/for-3.10 and because block/for-3.10/core is based on v3.9-rc3 which contains one of the conflicting commits, we need a pre-merge-window merge anyway. Let's pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer from workqueue merge conflicts. The two conflicts and their resolutions: * e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes worker_pool_assign_id() to use idr_alloc() instead of the old idr interface. worker_pool_assign_id() goes through multiple locking changes in wq/for-3.10 causing the following conflict. static int worker_pool_assign_id(struct worker_pool *pool) { int ret; <<<<<<< HEAD lockdep_assert_held(&wq_pool_mutex); do { if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL)) return -ENOMEM; ret = idr_get_new(&worker_pool_idr, pool, &pool->id); } while (ret == -EAGAIN); ======= mutex_lock(&worker_pool_idr_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) pool->id = ret; mutex_unlock(&worker_pool_idr_mutex); >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 return ret < 0 ? ret : 0; } We want locking from the former and idr_alloc() usage from the latter, which can be combined to the following. static int worker_pool_assign_id(struct worker_pool *pool) { int ret; lockdep_assert_held(&wq_pool_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) { pool->id = ret; return 0; } return ret; } * eb2834285c ("workqueue: fix possible pool stall bug in wq_unbind_fn()") updated wq_unbind_fn() such that it has single larger for_each_std_worker_pool() loop instead of two separate loops with a schedule() call inbetween. wq/for-3.10 renamed pool->assoc_mutex to pool->manager_mutex causing the following conflict (earlier function body and comments omitted for brevity). static void wq_unbind_fn(struct work_struct *work) { ... spin_unlock_irq(&pool->lock); <<<<<<< HEAD mutex_unlock(&pool->manager_mutex); } ======= mutex_unlock(&pool->assoc_mutex); >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 schedule(); <<<<<<< HEAD for_each_cpu_worker_pool(pool, cpu) ======= >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } The resolution is mostly trivial. We want the control flow of the latter with the rename of the former. static void wq_unbind_fn(struct work_struct *work) { ... spin_unlock_irq(&pool->lock); mutex_unlock(&pool->manager_mutex); schedule(); atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01workqueue: update sysfs interface to reflect NUMA awareness and a kernel ↵Tejun Heo
param to disable NUMA affinity Unbound workqueues are now NUMA aware. Let's add some control knobs and update sysfs interface accordingly. * Add kernel param workqueue.numa_disable which disables NUMA affinity globally. * Replace sysfs file "pool_id" with "pool_ids" which contain node:pool_id pairs. This change is userland-visible but "pool_id" hasn't seen a release yet, so this is okay. * Add a new sysf files "numa" which can toggle NUMA affinity on individual workqueues. This is implemented as attrs->no_numa whichn is special in that it isn't part of a pool's attributes. It only affects how apply_workqueue_attrs() picks which pools to use. After "pool_ids" change, first_pwq() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: implement NUMA affinity for unbound workqueuesTejun Heo
Currently, an unbound workqueue has single current, or first, pwq (pool_workqueue) to which all new work items are queued. This often isn't optimal on NUMA machines as workers may jump around across node boundaries and work items get assigned to workers without any regard to NUMA affinity. This patch implements NUMA affinity for unbound workqueues. Instead of mapping all entries of numa_pwq_tbl[] to the same pwq, apply_workqueue_attrs() now creates a separate pwq covering the intersecting CPUs for each NUMA node which has online CPUs in @attrs->cpumask. Nodes which don't have intersecting possible CPUs are mapped to pwqs covering whole @attrs->cpumask. As CPUs come up and go down, the pool association is changed accordingly. Changing pool association may involve allocating new pools which may fail. To avoid failing CPU_DOWN, each workqueue always keeps a default pwq which covers whole attrs->cpumask which is used as fallback if pool creation fails during a CPU hotplug operation. This ensures that all work items issued on a NUMA node is executed on the same node as long as the workqueue allows execution on the CPUs of the node. As this maps a workqueue to multiple pwqs and max_active is per-pwq, this change the behavior of max_active. The limit is now per NUMA node instead of global. While this is an actual change, max_active is already per-cpu for per-cpu workqueues and primarily used as safety mechanism rather than for active concurrency control. Concurrency is usually limited from workqueue users by the number of concurrently active work items and this change shouldn't matter much. v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted by Lai. v3: The previous version incorrectly made a workqueue spanning multiple nodes spread work items over all online CPUs when some of its nodes don't have any desired cpus. Reimplemented so that NUMA affinity is properly updated as CPUs go up and down. This problem was spotted by Lai Jiangshan. v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it; however, wq may be freed at any time after dfl_pwq is put making the clearing use-after-free. Clear wq->dfl_pwq before putting it. v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and @pwq_tbl after success. Fixed. Retry loop in wq_update_unbound_numa_attrs() isn't necessary as application of new attrs is excluded via CPU hotplug. Removed. Documentation on CPU affinity guarantee on CPU_DOWN added. All changes are suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: introduce put_pwq_unlocked()Tejun Heo
Factor out lock pool, put_pwq(), unlock sequence into put_pwq_unlocked(). The two existing places are converted and there will be more with NUMA affinity support. This is to prepare for NUMA affinity support for unbound workqueues and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: introduce numa_pwq_tbl_install()Tejun Heo
Factor out pool_workqueue linking and installation into numa_pwq_tbl[] from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq() is made safe to call multiple times. numa_pwq_tbl_install() links the pwq, installs it into numa_pwq_tbl[] at the specified node and returns the old entry. @last_pwq is removed from link_pwq() as the return value of the new function can be used instead. This is to prepare for NUMA affinity support for unbound workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: use NUMA-aware allocation for pool_workqueuesTejun Heo
Use kmem_cache_alloc_node() with @pool->node instead of kmem_cache_zalloc() when allocating a pool_workqueue so that it's allocated on the same node as the associated worker_pool. As there's no no kmem_cache_zalloc_node(), move zeroing to init_pwq(). This was suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: break init_and_link_pwq() into two functions and introduce ↵Tejun Heo
alloc_unbound_pwq() Break init_and_link_pwq() into init_pwq() and link_pwq() and move unbound-workqueue specific handling into apply_workqueue_attrs(). Also, factor out unbound pool and pool_workqueue allocation into alloc_unbound_pwq(). This reorganization is to prepare for NUMA affinity and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: map an unbound workqueues to multiple per-node pool_workqueuesTejun Heo
Currently, an unbound workqueue has only one "current" pool_workqueue associated with it. It may have multple pool_workqueues but only the first pool_workqueue servies new work items. For NUMA affinity, we want to change this so that there are multiple current pool_workqueues serving different NUMA nodes. Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and points to the pool_workqueue to use for each possible node. This replaces first_pwq() in __queue_work() and workqueue_congested(). numa_pwq_tbl[] is currently initialized to point to the same pool_workqueue as first_pwq() so this patch doesn't make any behavior changes. v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function may be called only with wq->mutex held. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: move hot fields of workqueue_struct to the endTejun Heo
Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align them to the cacheline. These two fields are used in the work item issue path and thus hot. The scheduled NUMA affinity support will add dispatch table at the end of workqueue_struct and relocating these two fields will allow us hitting only single cacheline on hot paths. Note that wq->pwqs isn't moved although it currently is being used in the work item issue path for unbound workqueues. The dispatch table mentioned above will replace its use in the issue path, so it will become cold once NUMA support is implemented. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: make workqueue->name[] fixed lenTejun Heo
Currently workqueue->name[] is of flexible length. We want to use the flexible field for something more useful and there isn't much benefit in allowing arbitrary name length anyway. Make it fixed len capping at 24 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: add workqueue->unbound_attrsTejun Heo
Currently, when exposing attrs of an unbound workqueue via sysfs, the workqueue_attrs of first_pwq() is used as that should equal the current state of the workqueue. The planned NUMA affinity support will make unbound workqueues make use of multiple pool_workqueues for different NUMA nodes and the above assumption will no longer hold. Introduce workqueue->unbound_attrs which records the current attrs in effect and use it for sysfs instead of first_pwq()->attrs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>