summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-02-13fs: dcache: Use cpu_chill() in trylock loopsThomas Gleixner
Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2015-02-13epoll.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13fs: jbd2: pull your plug when waiting for spaceSebastian Andrzej Siewior
Two cps in parallel managed to stall the the ext4 fs. It seems that journal code is either waiting for locks or sleeping waiting for something to happen. This seems similar to what Mike observed on ext3, here is his description: |With an -rt kernel, and a heavy sync IO load, tasks can jam |up on journal locks without unplugging, which can lead to |terminal IO starvation. Unplug and schedule when waiting |for space. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-02-13fs, jbd: pull your plug when waiting for spaceMike Galbraith
With an -rt kernel, and a heavy sync IO load, tasks can jam up on journal locks without unplugging, which can lead to terminal IO starvation. Unplug and schedule when waiting for space. Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Theodore Tso <tytso@mit.edu> Link: http://lkml.kernel.org/r/1341812414.7370.73.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13fs: ntfs: disable interrupt only on !RTMike Galbraith
On Sat, 2007-10-27 at 11:44 +0200, Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > [10138.175796] [<c0105de3>] show_trace+0x12/0x14 > > > [10138.180291] [<c0105dfb>] dump_stack+0x16/0x18 > > > [10138.184769] [<c011609f>] native_smp_call_function_mask+0x138/0x13d > > > [10138.191117] [<c0117606>] smp_call_function+0x1e/0x24 > > > [10138.196210] [<c012f85c>] on_each_cpu+0x25/0x50 > > > [10138.200807] [<c0115c74>] flush_tlb_all+0x1e/0x20 > > > [10138.205553] [<c016caaf>] kmap_high+0x1b6/0x417 > > > [10138.210118] [<c011ec88>] kmap+0x4d/0x4f > > > [10138.214102] [<c026a9d8>] ntfs_end_buffer_async_read+0x228/0x2f9 > > > [10138.220163] [<c01a0e9e>] end_bio_bh_io_sync+0x26/0x3f > > > [10138.225352] [<c01a2b09>] bio_endio+0x42/0x6d > > > [10138.229769] [<c02c2a08>] __end_that_request_first+0x115/0x4ac > > > [10138.235682] [<c02c2da7>] end_that_request_chunk+0x8/0xa > > > [10138.241052] [<c0365943>] ide_end_request+0x55/0x10a > > > [10138.246058] [<c036dae3>] ide_dma_intr+0x6f/0xac > > > [10138.250727] [<c0366d83>] ide_intr+0x93/0x1e0 > > > [10138.255125] [<c015afb4>] handle_IRQ_event+0x5c/0xc9 > > > > Looks like ntfs is kmap()ing from interrupt context. Should be using > > kmap_atomic instead, I think. > > it's not atomic interrupt context but irq thread context - and -rt > remaps kmap_atomic() to kmap() internally. Hm. Looking at the change to mm/bounce.c, perhaps I should do this instead? Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13mm: Protect activate_mm() by preempt_[disable&enable]_rt()Yong Zhang
User preempt_*_rt instead of local_irq_*_rt or otherwise there will be warning on ARM like below: WARNING: at build/linux/kernel/smp.c:459 smp_call_function_many+0x98/0x264() Modules linked in: [<c0013bb4>] (unwind_backtrace+0x0/0xe4) from [<c001be94>] (warn_slowpath_common+0x4c/0x64) [<c001be94>] (warn_slowpath_common+0x4c/0x64) from [<c001bec4>] (warn_slowpath_null+0x18/0x1c) [<c001bec4>] (warn_slowpath_null+0x18/0x1c) from [<c0053ff8>](smp_call_function_many+0x98/0x264) [<c0053ff8>] (smp_call_function_many+0x98/0x264) from [<c0054364>] (smp_call_function+0x44/0x6c) [<c0054364>] (smp_call_function+0x44/0x6c) from [<c0017d50>] (__new_context+0xbc/0x124) [<c0017d50>] (__new_context+0xbc/0x124) from [<c009e49c>] (flush_old_exec+0x460/0x5e4) [<c009e49c>] (flush_old_exec+0x460/0x5e4) from [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) from [<c009d060>] (search_binary_handler+0x94/0x2a4) [<c009d060>] (search_binary_handler+0x94/0x2a4) from [<c009e8fc>] (do_execve+0x254/0x364) [<c009e8fc>] (do_execve+0x254/0x364) from [<c0010e84>] (sys_execve+0x34/0x54) [<c0010e84>] (sys_execve+0x34/0x54) from [<c000da00>] (ret_fast_syscall+0x0/0x30) ---[ end trace 0000000000000002 ]--- The reason is that ARM need irq enabled when doing activate_mm(). According to mm-protect-activate-switch-mm.patch, actually preempt_[disable|enable]_rt() is sufficient. Inspired-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1337061236-1766-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13fs: namespace preemption fixThomas Gleixner
On RT we cannot loop with preemption disabled here as mnt_make_readonly() might have been preempted. We can safely enable preemption while waiting for MNT_WRITE_HOLD to be cleared. Safe on !RT as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13timer-fd: Prevent live lockThomas Gleixner
If hrtimer_try_to_cancel() requires a retry, then depending on the priority setting te retry loop might prevent timer callback completion on RT. Prevent that by waiting for completion on RT, no change for a non RT kernel. Reported-by: Sankara Muthukrishnan <sankara.m@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2015-02-13buffer_head: Replace bh_uptodate_lock for -rtThomas Gleixner
Wrap the bit_spin_lock calls into a separate inline and add the RT replacements with a real spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13Reset to 3.12.37Scott Wood
2014-05-14fs: dcache: Use cpu_chill() in trylock loopsThomas Gleixner
Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-14epoll.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14fs: jbd2: pull your plug when waiting for spaceSebastian Andrzej Siewior
Two cps in parallel managed to stall the the ext4 fs. It seems that journal code is either waiting for locks or sleeping waiting for something to happen. This seems similar to what Mike observed on ext3, here is his description: |With an -rt kernel, and a heavy sync IO load, tasks can jam |up on journal locks without unplugging, which can lead to |terminal IO starvation. Unplug and schedule when waiting |for space. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-14fs, jbd: pull your plug when waiting for spaceMike Galbraith
With an -rt kernel, and a heavy sync IO load, tasks can jam up on journal locks without unplugging, which can lead to terminal IO starvation. Unplug and schedule when waiting for space. Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Theodore Tso <tytso@mit.edu> Link: http://lkml.kernel.org/r/1341812414.7370.73.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14fs: ntfs: disable interrupt only on !RTMike Galbraith
On Sat, 2007-10-27 at 11:44 +0200, Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > [10138.175796] [<c0105de3>] show_trace+0x12/0x14 > > > [10138.180291] [<c0105dfb>] dump_stack+0x16/0x18 > > > [10138.184769] [<c011609f>] native_smp_call_function_mask+0x138/0x13d > > > [10138.191117] [<c0117606>] smp_call_function+0x1e/0x24 > > > [10138.196210] [<c012f85c>] on_each_cpu+0x25/0x50 > > > [10138.200807] [<c0115c74>] flush_tlb_all+0x1e/0x20 > > > [10138.205553] [<c016caaf>] kmap_high+0x1b6/0x417 > > > [10138.210118] [<c011ec88>] kmap+0x4d/0x4f > > > [10138.214102] [<c026a9d8>] ntfs_end_buffer_async_read+0x228/0x2f9 > > > [10138.220163] [<c01a0e9e>] end_bio_bh_io_sync+0x26/0x3f > > > [10138.225352] [<c01a2b09>] bio_endio+0x42/0x6d > > > [10138.229769] [<c02c2a08>] __end_that_request_first+0x115/0x4ac > > > [10138.235682] [<c02c2da7>] end_that_request_chunk+0x8/0xa > > > [10138.241052] [<c0365943>] ide_end_request+0x55/0x10a > > > [10138.246058] [<c036dae3>] ide_dma_intr+0x6f/0xac > > > [10138.250727] [<c0366d83>] ide_intr+0x93/0x1e0 > > > [10138.255125] [<c015afb4>] handle_IRQ_event+0x5c/0xc9 > > > > Looks like ntfs is kmap()ing from interrupt context. Should be using > > kmap_atomic instead, I think. > > it's not atomic interrupt context but irq thread context - and -rt > remaps kmap_atomic() to kmap() internally. Hm. Looking at the change to mm/bounce.c, perhaps I should do this instead? Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14mm: Protect activate_mm() by preempt_[disable&enable]_rt()Yong Zhang
User preempt_*_rt instead of local_irq_*_rt or otherwise there will be warning on ARM like below: WARNING: at build/linux/kernel/smp.c:459 smp_call_function_many+0x98/0x264() Modules linked in: [<c0013bb4>] (unwind_backtrace+0x0/0xe4) from [<c001be94>] (warn_slowpath_common+0x4c/0x64) [<c001be94>] (warn_slowpath_common+0x4c/0x64) from [<c001bec4>] (warn_slowpath_null+0x18/0x1c) [<c001bec4>] (warn_slowpath_null+0x18/0x1c) from [<c0053ff8>](smp_call_function_many+0x98/0x264) [<c0053ff8>] (smp_call_function_many+0x98/0x264) from [<c0054364>] (smp_call_function+0x44/0x6c) [<c0054364>] (smp_call_function+0x44/0x6c) from [<c0017d50>] (__new_context+0xbc/0x124) [<c0017d50>] (__new_context+0xbc/0x124) from [<c009e49c>] (flush_old_exec+0x460/0x5e4) [<c009e49c>] (flush_old_exec+0x460/0x5e4) from [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) from [<c009d060>] (search_binary_handler+0x94/0x2a4) [<c009d060>] (search_binary_handler+0x94/0x2a4) from [<c009e8fc>] (do_execve+0x254/0x364) [<c009e8fc>] (do_execve+0x254/0x364) from [<c0010e84>] (sys_execve+0x34/0x54) [<c0010e84>] (sys_execve+0x34/0x54) from [<c000da00>] (ret_fast_syscall+0x0/0x30) ---[ end trace 0000000000000002 ]--- The reason is that ARM need irq enabled when doing activate_mm(). According to mm-protect-activate-switch-mm.patch, actually preempt_[disable|enable]_rt() is sufficient. Inspired-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1337061236-1766-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14fs: namespace preemption fixThomas Gleixner
On RT we cannot loop with preemption disabled here as mnt_make_readonly() might have been preempted. We can safely enable preemption while waiting for MNT_WRITE_HOLD to be cleared. Safe on !RT as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14timer-fd: Prevent live lockThomas Gleixner
If hrtimer_try_to_cancel() requires a retry, then depending on the priority setting te retry loop might prevent timer callback completion on RT. Prevent that by waiting for completion on RT, no change for a non RT kernel. Reported-by: Sankara Muthukrishnan <sankara.m@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-14buffer_head: Replace bh_uptodate_lock for -rtThomas Gleixner
Wrap the bit_spin_lock calls into a separate inline and add the RT replacements with a real spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14Reset to 3.12.19Scott Wood
2014-04-10fs: dcache: Use cpu_chill() in trylock loopsThomas Gleixner
Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-04-10epoll.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10fs: jbd2: pull your plug when waiting for spaceSebastian Andrzej Siewior
Two cps in parallel managed to stall the the ext4 fs. It seems that journal code is either waiting for locks or sleeping waiting for something to happen. This seems similar to what Mike observed on ext3, here is his description: |With an -rt kernel, and a heavy sync IO load, tasks can jam |up on journal locks without unplugging, which can lead to |terminal IO starvation. Unplug and schedule when waiting |for space. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-04-10fs, jbd: pull your plug when waiting for spaceMike Galbraith
With an -rt kernel, and a heavy sync IO load, tasks can jam up on journal locks without unplugging, which can lead to terminal IO starvation. Unplug and schedule when waiting for space. Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Theodore Tso <tytso@mit.edu> Link: http://lkml.kernel.org/r/1341812414.7370.73.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10fs: ntfs: disable interrupt only on !RTMike Galbraith
On Sat, 2007-10-27 at 11:44 +0200, Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > [10138.175796] [<c0105de3>] show_trace+0x12/0x14 > > > [10138.180291] [<c0105dfb>] dump_stack+0x16/0x18 > > > [10138.184769] [<c011609f>] native_smp_call_function_mask+0x138/0x13d > > > [10138.191117] [<c0117606>] smp_call_function+0x1e/0x24 > > > [10138.196210] [<c012f85c>] on_each_cpu+0x25/0x50 > > > [10138.200807] [<c0115c74>] flush_tlb_all+0x1e/0x20 > > > [10138.205553] [<c016caaf>] kmap_high+0x1b6/0x417 > > > [10138.210118] [<c011ec88>] kmap+0x4d/0x4f > > > [10138.214102] [<c026a9d8>] ntfs_end_buffer_async_read+0x228/0x2f9 > > > [10138.220163] [<c01a0e9e>] end_bio_bh_io_sync+0x26/0x3f > > > [10138.225352] [<c01a2b09>] bio_endio+0x42/0x6d > > > [10138.229769] [<c02c2a08>] __end_that_request_first+0x115/0x4ac > > > [10138.235682] [<c02c2da7>] end_that_request_chunk+0x8/0xa > > > [10138.241052] [<c0365943>] ide_end_request+0x55/0x10a > > > [10138.246058] [<c036dae3>] ide_dma_intr+0x6f/0xac > > > [10138.250727] [<c0366d83>] ide_intr+0x93/0x1e0 > > > [10138.255125] [<c015afb4>] handle_IRQ_event+0x5c/0xc9 > > > > Looks like ntfs is kmap()ing from interrupt context. Should be using > > kmap_atomic instead, I think. > > it's not atomic interrupt context but irq thread context - and -rt > remaps kmap_atomic() to kmap() internally. Hm. Looking at the change to mm/bounce.c, perhaps I should do this instead? Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10mm: Protect activate_mm() by preempt_[disable&enable]_rt()Yong Zhang
User preempt_*_rt instead of local_irq_*_rt or otherwise there will be warning on ARM like below: WARNING: at build/linux/kernel/smp.c:459 smp_call_function_many+0x98/0x264() Modules linked in: [<c0013bb4>] (unwind_backtrace+0x0/0xe4) from [<c001be94>] (warn_slowpath_common+0x4c/0x64) [<c001be94>] (warn_slowpath_common+0x4c/0x64) from [<c001bec4>] (warn_slowpath_null+0x18/0x1c) [<c001bec4>] (warn_slowpath_null+0x18/0x1c) from [<c0053ff8>](smp_call_function_many+0x98/0x264) [<c0053ff8>] (smp_call_function_many+0x98/0x264) from [<c0054364>] (smp_call_function+0x44/0x6c) [<c0054364>] (smp_call_function+0x44/0x6c) from [<c0017d50>] (__new_context+0xbc/0x124) [<c0017d50>] (__new_context+0xbc/0x124) from [<c009e49c>] (flush_old_exec+0x460/0x5e4) [<c009e49c>] (flush_old_exec+0x460/0x5e4) from [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) from [<c009d060>] (search_binary_handler+0x94/0x2a4) [<c009d060>] (search_binary_handler+0x94/0x2a4) from [<c009e8fc>] (do_execve+0x254/0x364) [<c009e8fc>] (do_execve+0x254/0x364) from [<c0010e84>] (sys_execve+0x34/0x54) [<c0010e84>] (sys_execve+0x34/0x54) from [<c000da00>] (ret_fast_syscall+0x0/0x30) ---[ end trace 0000000000000002 ]--- The reason is that ARM need irq enabled when doing activate_mm(). According to mm-protect-activate-switch-mm.patch, actually preempt_[disable|enable]_rt() is sufficient. Inspired-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1337061236-1766-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10fs: namespace preemption fixThomas Gleixner
On RT we cannot loop with preemption disabled here as mnt_make_readonly() might have been preempted. We can safely enable preemption while waiting for MNT_WRITE_HOLD to be cleared. Safe on !RT as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10timer-fd: Prevent live lockThomas Gleixner
If hrtimer_try_to_cancel() requires a retry, then depending on the priority setting te retry loop might prevent timer callback completion on RT. Prevent that by waiting for completion on RT, no change for a non RT kernel. Reported-by: Sankara Muthukrishnan <sankara.m@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-04-10buffer_head: Replace bh_uptodate_lock for -rtThomas Gleixner
Wrap the bit_spin_lock calls into a separate inline and add the RT replacements with a real spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-24bio-integrity: Fix bio_integrity_verify segment start bugNicholas Bellinger
commit 5837c80e870bc3b12ac6a98cdc9ce7a9522a8fb6 upstream. This patch addresses a bug in bio_integrity_verify() code that has been causing DIF READ verify operations to be silently skipped. The issue is that bio->bi_idx will have been incremented within bio_advance() code in the normal blk_update_request() -> req_bio_endio() completion path, and bio_integrity_verify() is using bio_for_each_segment() which starts the bio segment walk at the current bio->bi_idx. So instead use bio_for_each_segment_all() to always start the bio segment walk from zero, regardless of the current bio->bi_idx value after bio_advance() has been called. (Context change for v3.10.y -> v3.13.y code - nab) Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22fs/proc/base.c: fix GPF in /proc/$PID/map_filesArtem Fetishev
commit 70335abb2689c8cd5df91bf2d95a65649addf50b upstream. The expected logic of proc_map_files_get_link() is either to return 0 and initialize 'path' or return an error and leave 'path' uninitialized. By the time dname_to_vma_addr() returns 0 the corresponding vma may have already be gone. In this case the path is not initialized but the return value is still 0. This results in 'general protection fault' inside d_path(). Steps to reproduce: CONFIG_CHECKPOINT_RESTORE=y fd = open(...); while (1) { mmap(fd, ...); munmap(fd, ...); } ls -la /proc/$PID/map_files Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991 Signed-off-by: Artem Fetishev <artem_fetishev@epam.com> Signed-off-by: Aleksandr Terekhov <aleksandr_terekhov@epam.com> Reported-by: <wiebittewas@gmail.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateidTrond Myklebust
commit e1253be0ece1a95a02c7f5843194877471af8179 upstream. When nfs4_set_rw_stateid() can fails by returning EIO to indicate that the stateid is completely invalid, then it makes no sense to have it trigger a retry of the READ or WRITE operation. Instead, we should just have it fall through and attempt a recovery. This fixes an infinite loop in which the client keeps replaying the same bad stateid back to the server. Reported-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22NFS: Fix a delegation callback raceTrond Myklebust
commit 755a48a7a4eb05b9c8424e3017d947b2961a60e0 upstream. The clean-up in commit 36281caa839f ended up removing a NULL pointer check that is needed in order to prevent an Oops in nfs_async_inode_return_delegation(). Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/5313E9F6.2020405@intel.com Fixes: 36281caa839f (NFSv4: Further clean-ups of delegation stateid validation) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22ocfs2 syncs the wrong range...Al Viro
commit 1b56e98990bcdbb20b9fab163654b9315bf158e8 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22ocfs2: fix quota file corruptionJan Kara
commit 15c34a760630ca2c803848fba90ca0646a9907dd upstream. Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-14Fix mountpoint reference leakage in linkatOleg Drokin
commit d22e6338db7f613dd4f6095c190682fcc519e4b7 upstream. Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] [js: the second path_put is not in 3.12 yet, hunk removed] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: do not take the nested ep->mtx on EPOLL_CTL_DELJason Baron
commit 4ff36ee94d93ddb4b7846177f1118d9aa33408e2 upstream. The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock. That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple topologies"). The acquistion of the ep->mtx for the destination 'ep' was added such that a concurrent EPOLL_CTL_ADD operation would see the correct state of the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links') However, by simply not acquiring the lock, we do not serialize behind the ep->mtx from the add path, and thus may perform a full path check when if we had waited a little longer it may not have been necessary. However, this is a transient state, and performing the full loop checking in this case is not harmful. The important point is that we wouldn't miss doing the full loop checking when required, since EPOLL_CTL_ADD always locks any 'ep's that its operating upon. The reason we don't need to do lock ordering in the add path, is that we are already are holding the global 'epmutex' whenever we do the double lock. Further, the original posting of this patch, which was tested for the intended performance gains, did not perform this additional locking. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: do not take global 'epmutex' for simple topologiesJason Baron
commit 67347fe4e6326338ee217d7eb826bedf30b2e155 upstream. When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we do not need to hold the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: "On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck." [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc] Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: optimize EPOLL_CTL_DEL using rcuJason Baron
commit ae10b2b4eb01bedc91d29d5c5bb9e416fd806c40 upstream. Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. This patch (of 2): Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> CC: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05quota: Fix race between dqput() and dquot_scan_active()Jan Kara
commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream. Currently last dqput() can race with dquot_scan_active() causing it to call callback for an already deactivated dquot. The race is as follows: CPU1 CPU2 dqput() spin_lock(&dq_list_lock); if (atomic_read(&dquot->dq_count) > 1) { - not taken if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); ->release_dquot(dquot); if (atomic_read(&dquot->dq_count) > 1) - not taken dquot_scan_active() spin_lock(&dq_list_lock); if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) - not taken atomic_inc(&dquot->dq_count); spin_unlock(&dq_list_lock); - proceeds to release dquot ret = fn(dquot, priv); - called for inactive dquot Fix the problem by making sure possible ->release_dquot() is finished by the time we call the callback and new calls to it will notice reference dquot_scan_active() has taken and bail out. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05fs: fix iversion handlingChristoph Hellwig
commit dff6efc326a4d5f305797d4a6bba14f374fdd633 upstream. Currently notify_change directly updates i_version for size updates, which not only is counter to how all other fields are updated through struct iattr, but also breaks XFS, which need inode updates to happen under its own lock, and synchronized to the structure that gets written to the log. Remove the update in the common code, and it to btrfs and ext4, XFS already does a proper updaste internally and currently gets a double update with the existing code. IMHO this is 3.13 and -stable material and should go in through the XFS tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05NFS fix error return in nfs4_select_rw_stateidAndy Adamson
commit 146d70caaa1b87f64597743429d7da4b8073d0c9 upstream. Do not return an error when nfs4_copy_delegation_stateid succeeds. Signed-off-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1392737765-41942-1-git-send-email-andros@netapp.com Fixes: ef1820f9be27b (NFSv4: Don't try to recover NFSv4 locks when...) Cc: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05CIFS: Fix too big maxBuf size for SMB3 mountsPavel Shilovsky
commit 2365c4eaf077c48574ab6f143960048fc0f31518 upstream. SMB3 servers can respond with MaxTransactSize of more than 4M that can cause a memory allocation error returned from kmalloc in a lock codepath. Also the client doesn't support multicredit requests now and allows buffer sizes of 65536 bytes only. Set MaxTransactSize to this maximum supported value. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cifs: ensure that uncached writes handle unmapped areas correctlyJeff Layton
commit 5d81de8e8667da7135d3a32a964087c0faf5483f upstream. It's possible for userland to pass down an iovec via writev() that has a bogus user pointer in it. If that happens and we're doing an uncached write, then we can end up getting less bytes than we expect from the call to iov_iter_copy_from_user. This is CVE-2014-0069 cifs_iovec_write isn't set up to handle that situation however. It'll blindly keep chugging through the page array and not filling those pages with anything useful. Worse yet, we'll later end up with a negative number in wdata->tailsz, which will confuse the sending routines and cause an oops at the very least. Fix this by having the copy phase of cifs_iovec_write stop copying data in this situation and send the last write as a short one. At the same time, we want to avoid sending a zero-length write to the server, so break out of the loop and set rc to -EFAULT if that happens. This also allows us to handle the case where no address in the iovec is valid. [Note: Marking this for stable on v3.4+ kernels, but kernels as old as v2.6.38 may have a similar problem and may need similar fix] Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05jbd2: fix use after free in jbd2_journal_start_reserved()Dan Carpenter
commit 92e3b40537707001d17bbad800d150ab04e53bf4 upstream. If start_this_handle() fails then it leads to a use after free of "handle". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: don't leave i_crtime.tv_sec uninitializedTheodore Ts'o
commit 19ea80603715d473600cd993b9987bc97d042e02 upstream. If the i_crtime field is not present in the inode, don't leave the field uninitialized. Fixes: ef7f38359 ("ext4: Add nanosecond timestamps") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix online resize with a non-standard blocks per group settingTheodore Ts'o
commit 3d2660d0c9c2f296837078c189b68a47f6b2e3b5 upstream. The set_flexbg_block_bitmap() function assumed that the number of blocks in a blockgroup was sb->blocksize * 8, which is normally true, but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block bitmap corruption after: mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jon Bernard <jbernard@tuxion.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix online resize with very large inode tablesTheodore Ts'o
commit b93c95353413041a8cebad915a8109619f66bcc6 upstream. If a file system has a large number of inodes per block group, all of the metadata blocks in a flex_bg may be larger than what can fit in a single block group. Unfortunately, ext4_alloc_group_tables() in resize.c was never tested to see if it would handle this case correctly, and there were a large number of bugs which caused the following sequence to result in a BUG_ON: kernel bug at fs/ext4/resize.c:409! ... call trace: [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0 [<ffffffff811b9df2>] ? final_putname+0x22/0x50 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0 rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180 This can be reproduced with the following command sequence: mke2fs -t ext4 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G To fix this, we need to make sure the right thing happens when a block group's inode table straddles two block groups, which means the following bugs had to be fixed: 1) Not clearing the BLOCK_UNINIT flag in the second block group in ext4_alloc_group_tables --- the was proximate cause of the BUG_ON. 2) Incorrectly determining how many block groups contained contiguous free blocks in ext4_alloc_group_tables(). 3) Incorrectly setting the start of the next block range to be marked in use after a discontinuity in setup_new_flex_group_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: don't try to modify s_flags if the the file system is read-onlyTheodore Ts'o
commit 23301410972330c0ae9a8afc379ba2005e249cc6 upstream. If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix error paths in swap_inode_boot_loader()Zheng Liu
commit 30d29b119ef01776e0a301444ab24defe8d8bef3 upstream. In swap_inode_boot_loader() we forgot to release ->i_mutex and resume unlocked dio for inode and inode_bl if there is an error starting the journal handle. This commit fixes this issue. Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>