summaryrefslogtreecommitdiff
path: root/mm/backing-dev.c
AgeCommit message (Collapse)Author
2011-07-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits) mm: properly reflect task dirty limits in dirty_exceeded logic writeback: don't busy retry writeback on new/freeing inodes writeback: scale IO chunk size up to half device bandwidth writeback: trace global_dirty_state writeback: introduce max-pause and pass-good dirty limits writeback: introduce smoothed global dirty limit writeback: consolidate variable names in balance_dirty_pages() writeback: show bdi write bandwidth in debugfs writeback: bdi write bandwidth estimation writeback: account per-bdi accumulated written pages writeback: make writeback_control.nr_to_write straight writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr() writeback: trace event writeback_queue_io writeback: trace event writeback_single_inode writeback: remove .nonblocking and .encountered_congestion writeback: remove writeback_control.more_io writeback: skip balance_dirty_pages() for in-memory fs writeback: add bdi_dirty_limit() kernel-doc writeback: avoid extra sync work at enqueue time writeback: elevate queue_io() into wb_writeback() ... Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c
2011-07-26Merge 'akpm' patch seriesLinus Torvalds
* Merge akpm patch series: (122 commits) drivers/connector/cn_proc.c: remove unused local Documentation/SubmitChecklist: add RCU debug config options reiserfs: use hweight_long() reiserfs: use proper little-endian bitops pnpacpi: register disabled resources drivers/rtc/rtc-tegra.c: properly initialize spinlock drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time() drivers/rtc: add support for Qualcomm PMIC8xxx RTC drivers/rtc/rtc-s3c.c: support clock gating drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200 init: skip calibration delay if previously done misc/eeprom: add eeprom access driver for digsy_mtc board misc/eeprom: add driver for microwire 93xx46 EEPROMs checkpatch.pl: update $logFunctions checkpatch: make utf-8 test --strict checkpatch.pl: add ability to ignore various messages checkpatch: add a "prefer __aligned" check checkpatch: validate signature styles and To: and Cc: lines checkpatch: add __rcu as a sparse modifier checkpatch: suggest using min_t or max_t ... Did this as a merge because of (trivial) conflicts in - Documentation/feature-removal-schedule.txt - arch/xtensa/include/asm/uaccess.h that were just easier to fix up in the merge than in the patch series.
2011-07-26mm/backing-dev.c: reset bdi min_ratio in bdi_unregister()Peter Zijlstra
Vito said: : The system has many usb disks coming and going day to day, with their : respective bdi's having min_ratio set to 1 when inserted. It works for : some time until eventually min_ratio can no longer be set, even when the : active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to : anywhere near 100. : : This then leads to an unrelated starvation problem caused by write-heavy : fuse mounts being used atop the usb disks, a problem the min_ratio setting : at the underlying devices bdi effectively prevents. Fix this leakage by resetting the bdi min_ratio when unregistering the BDI. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Vito Caputo <lkml@pengaru.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-23backing-dev: use synchronize_rcu_expedited instead of synchronize_rcuMikulas Patocka
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu synchronize_rcu sleeps several timer ticks. synchronize_rcu_expedited is much faster. With 100Hz timer frequency, when we remove 10000 block devices with "dmsetup remove_all" command, it takes 27 minutes. With this patch, removing 10000 block devices takes only 15 seconds. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-10writeback: show bdi write bandwidth in debugfsWu Fengguang
Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats. btw, increase digital field width to 10, for keeping the possibly huge BdiWritten number aligned at least for desktop systems. Impact: this could break user space tools if they are dumb enough to depend on the number of white spaces. CC: Theodore Ts'o <tytso@mit.edu> CC: Jan Kara <jack@suse.cz> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-10writeback: bdi write bandwidth estimationWu Fengguang
The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-10writeback: account per-bdi accumulated written pagesJan Kara
Introduce the BDI_WRITTEN counter. It will be used for estimating the bdi's write bandwidth. Peter Zijlstra <a.p.zijlstra@chello.nl>: Move BDI_WRITTEN accounting into __bdi_writeout_inc(). This will cover and fix fuse, which only calls bdi_writeout_inc(). CC: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-10writeback: make writeback_control.nr_to_write straightWu Fengguang
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: split inode_wb_list_lock into bdi_writeback.list_lockChristoph Hellwig
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-05-20backing-dev: Kill set but not used var in bdi_debug_stats_show()Gustavo F. Padovan
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: simplify iget & friends fs: pull inode->i_lock up out of writeback_single_inode fs: rename inode_lock to inode_hash_lock fs: move i_wb_list out from under inode_lock fs: move i_sb_list out from under inode_lock fs: remove inode_lock from iput_final and prune_icache fs: Lock the inode LRU list separately fs: factor inode disposal fs: protect inode->i_state with inode->i_lock autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() autofs4 - remove autofs4_lock autofs4 - fix d_manage() return on rcu-walk autofs4 - fix autofs4_expire_indirect() traversal autofs4 - fix dentry leak in autofs4_expire_direct() autofs4 - reinstate last used update on access vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
2011-03-25fs: move i_wb_list out from under inode_lockDave Chinner
Protect the inode writeback list with a new global lock inode_wb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-17fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going awayJens Axboe
We don't have proper reference counting for this yet, so we run into cases where the device is pulled and we OOPS on flushing the fs data. This happens even though the dirty inodes have already been migrated to the default_backing_dev_info. Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: remove per-queue pluggingJens Axboe
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...
2010-10-26kernel: remove PF_FLUSHERPeter Zijlstra
PF_FLUSHER is only ever set, not tested, remove it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: do not sleep on the congestion queue if there are no congested ↵Mel Gorman
BDIs or if significant congestion is not being encountered in the current zone If congestion_wait() is called with no BDI congested, the caller will sleep for the full timeout and this may be an unnecessary sleep. This patch adds a wait_iff_congested() that checks congestion and only sleeps if a BDI is congested else, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep. This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: account for time spent congestion_waitedMel Gorman
There is strong evidence to indicate a lot of time is being spent in congestion_wait(), some of it unnecessarily. This patch adds a tracepoint for congestion_wait to record when congestion_wait() was called, how long the timeout was for and how long it actually slept. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26fs: inode split IO and LRU listsNick Piggin
The use of the same inode list structure (inode->i_list) for two different list constructs with different lifecycles and purposes makes it impossible to separate the locking of the different operations. Therefore, to enable the separation of the locking of the writeback and reclaim lists, split the inode->i_list into two separate lists dedicated to their specific tracking functions. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-09-22bdi: Initialize noop_backing_dev_info properlyJan Kara
Properly initialize this backing dev info so that writeback code does not barf when getting to it e.g. via sb->s_bdi. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-27writeback: do not lose wakeup events when forking bdi threadsArtem Bityutskiy
This patch fixes the following issue: INFO: task mount.nfs4:1120 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000 ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000 ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8 00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0 Call Trace: [<ffffffff813bc747>] schedule_timeout+0x34/0xf1 [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130 [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130 [<ffffffff8104159c>] ? default_wake_function+0x0/0xf [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e [<ffffffff81105798>] sync_filesystem+0x47/0x4b [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2 [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs] [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41 [<ffffffff810e7fd6>] deactivate_super+0x40/0x45 [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed [<ffffffff810fc73b>] release_mounts+0x9a/0xb0 [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs] [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs] [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs] [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196 [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8 [<ffffffff810fdf68>] do_mount+0x771/0x7e8 [<ffffffff810fe062>] sys_mount+0x83/0xbd [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b The reason of this hang was a race condition: when the flusher thread is forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep. 'bdi->wb.task' is still NULL. And this is a dangerous time window. If at this time someone queues a work for this bdi, he does not see the bdi thread and wakes up the forker thread instead! But the forker has already forked this bdi thread, but just did not make it visible yet! The result is that we lose the wake up event for this bdi thread and the NFS4 code waits forever. To fix the problem, we should use 'ktrhead_create()' for creating bdi threads, then make them visible in 'bdi->wb.task', and only after this wake them up. This is exactly what this patch does. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-12writeback: avoid unnecessary calculation of bdi dirty thresholdsWu Fengguang
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-07writeback: fix bad _bh spinlock nestingJens Axboe
Fix a bug where a lock is _bh nested within another _bh lock, but forgets to use the _bh variant for unlock. Further more, it's not necessary to test _bh locks, the inner lock can just use spin_lock(). So fix up the bug by making that change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: cleanup bdi_registerArtem Bityutskiy
This patch makes sure we first initialize everything and set the BDI_registered flag, and only after this we add the bdi to 'bdi_list'. Current code adds the bdi to the list too early, and as a result I the WARN(!test_bit(BDI_registered, &bdi->state) in bdi forker is triggered. Also, it is in general good practice to make things visible only when they are fully initialized. Also, this patch does few micro clean-ups: 1. Removes the 'exit' label which does not do anything, just returns. This allows to get rid of few braces and 'ret' variable and make the code smaller. 2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded '-ENOMEM'. Theoretically, some day 'kthread_run()' can return something else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to NULL. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: add new tracepointsArtem Bityutskiy
Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce: 1. trace_writeback_wake_thread(bdi) 2. trace_writeback_wake_forker_thread(bdi) The first event is triggered every time we wake up a bdi thread to start periodic background write-out. The second event is triggered only when the bdi thread does not exist and should be created by the forker thread. This patch was suggested by Dave Chinner and Christoph Hellwig. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: remove unnecessary init_timer callArtem Bityutskiy
The 'setup_timer()' function also calls 'init_timer()', so the extra 'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically 'init_timer()' plus callback function and data pointers initialization. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: optimize periodic bdi thread wakeupsArtem Bityutskiy
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which should take care of the periodic background write-out. However, the write-out will actually start only 'dirty_writeback_interval' centisecs later, so we can delay the wake-up. This change was requested by Nick Piggin who pointed out that if we delay the wake-up, we weed out 2 unnecessary contex switches, which matters because '__mark_inode_dirty()' is a hot-path function. This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which sets up a timer to wake-up the bdi thread and returns. So the wake-up is delayed. We also delete the timer in bdi threads just before writing-back. And synchronously delete it when unregistering bdi. At the unregister point the bdi does not have any users, so no one can arm it again. Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes this change as well. This patch also moves the 'bdi_wb_init()' function down in the file to avoid forward-declaration of 'bdi_wakeup_thread_delayed()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: prevent unnecessary bdi threads wakeupsArtem Bityutskiy
Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very bad for battery-driven devices. There are two types of activities bdi threads do: 1. process bdi works from the 'bdi->work_list' 2. periodic write-back So there are 2 sources of wake-up events for bdi threads: 1. 'bdi_queue_work()' - submits bdi works 2. '__mark_inode_dirty()' - adds dirty I/O to bdi's The former already has bdi wake-up code. The latter does not, and this patch adds it. '__mark_inode_dirty()' is hot-path function, but this patch adds another 'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when the bdi has no dirty inodes. So adding this spinlock should be fine and should not affect performance. This patch makes sure bdi threads and the forker thread do not wake-up if there is nothing to do. The forker thread will nevertheless wake up at least every 5 min. to check whether it has to kill a bdi thread. This can also be optimized, but is not worth it. This patch also tidies up the warning about unregistered bid, and turns it from an ugly crocodile to a simple 'WARN()' statement. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: move bdi threads exiting logic to the forker threadArtem Bityutskiy
Currently, bdi threads can decide to exit if there were no useful activities for 5 minutes. However, this causes nasty races: we can easily oops in the 'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up. And even if we do not oops, but the bdi tread exits immediately after we wake it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs) in the bdi work processing. This patch makes the forker thread to be the central place which not only creates bdi threads, but also kills them if they were inactive long enough. This better design-wise. Another reason why this change was done is to prepare for the further changes which will prevent the bdi threads from waking up every 5 sec and wasting power. Indeed, when the task does not wake up periodically anymore, it won't be able to exit either. This patch also moves the the 'wake_up_bit()' call from the bdi thread to the forker thread as well. So now the forker thread sets the BDI_pending bit, then forks the task or kills it, then clears the bit and wakes up the waiting process. The only process which may wain on the bit is 'bdi_wb_shutdown()'. This function was changed as well - now it first removes the bdi from the 'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is guaranteed that the forker thread won't race with it, because the bdi is not visible. Note, the forker thread sets the 'BDI_pending' bit under the 'bdi->wb_lock' which is essential for proper serialization. And additionally, when we change 'bdi->wb.task', we now take the 'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise would when raced with, say, 'bdi_queue_work()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: restructure bdi forker loop a littleArtem Bityutskiy
This patch re-structures the bdi forker a little: 1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason for this is that the forker thread can start _before_ the 'BDI_registered' flag is set (see 'bdi_register()'), so the WARN() statement will fire for the default bdi. I observed this warning at boot-up. 2. Introduce an enum 'action' and use "switch" statement in the outer loop. This is a preparation to the further patch which will teach the forker thread killing bdi threads, so we'll have another case in the "switch" statement. This change was suggested by Christoph Hellwig. This patch is just a small step towards the coming change where the forker thread will kill the bdi threads. It should simplify reviewing the following changes, which would otherwise be larger. This patch also amends comments a little. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: do not remove bdi from bdi_listArtem Bityutskiy
The forker thread removes bdis from 'bdi_list' before forking the bdi thread. But this is wrong for at least 2 reasons. Reason #1: if we temporary remove a bdi from the list, we may miss works which would otherwise be given to us. Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when it races with the forker thread, it can shut down the bdi thread at the same time as the forker creates it. This patch makes sure the forker thread never removes bdis from 'bdi_list' (which was suggested by Christoph Hellwig). In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending' flag. NOTE! The error path is interesting. Currently, when we fail to create a bdi thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the bdi from the list, we cannot move it to the tail either, because then we can mess up the RCU readers which walk the list. And also, we'll have the race described above in "Reason #2". But I not think that adding to the tail is any important so I just do not do that. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: simplify bdi code a littleArtem Bityutskiy
This patch simplifies bdi code a little by removing the 'pending_list' which is redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is working like this: 1. In a loop, fetch all bdi's which have works but have no writeback thread and move them to the 'pending_list'. 2. If the list is empty, sleep for 5 sec. 3. Otherwise, take one bdi from the list, fork the writeback thread for this bdi, and repeat the loop. IOW, it first moves everything to the 'pending_list', then process only one element, and so on. This patch simplifies the algorithm, which is now as follows. 1. Find the first bdi which has a work and remove it from the global list of bdi's (bdi_list). 2. If there was not such bdi, sleep 5 sec. 3. Fork the writeback thread for this bdi and repeat the loop. IOW, now we find the first bdi to process, process it, and so on. This is simpler and involves less lists. The bonus now is that we can get rid of a couple of functions, as well as remove complications which involve 'rcu_call()' and 'bdi->rcu_head'. This patch also makes sure we use 'list_add_tail_rcu()', instead of plain 'list_add_tail()', but this piece of code is going to be removed in the next patch anyway. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: do not lose wake-ups in the forker thread - 2Artem Bityutskiy
Currently, if someone submits jobs for the default bdi, we can lose wake-up events. E.g., this can happen if 'bdi_queue_work()' is called when 'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but before 'set_current_state(TASK_INTERRUPTIBLE)'. This situation is unlikely, and the result is not very severe - we'll just delay the execution of the work, but this is still not very nice. This patch fixes the issue by checking whether the default bdi has works before the forker thread goes sleep. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: do not lose wake-ups in the forker thread - 1Artem Bityutskiy
Currently the forker thread can lose wake-ups which may lead to unnecessary delays in processing bdi works. E.g., consider the following scenario. 1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to do, and is about to finish the loop. 2. A bdi thread decides to exit because it was inactive for long time. 3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up the forker thread. 4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)' and goes sleep. We lose a wake-up. Losing the wake-up is not fatal, but this means that the bdi work processing will be delayed by up to 5 sec. This race is theoretical, I never hit it, but it is worth fixing. The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking 'bdi_list', not after. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: fix possible race when creating bdi threadsArtem Bityutskiy
This patch fixes a very unlikely race condition on the bdi forker thread error path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code for a short period of time. If at the same time someone submits a work to this bdi, we can end up with an oops 'bdi_queue_work()' while executing 'wake_up_process(wb->task)'. This patch fixes the issue by introducing a temporary variable 'task' and storing the possible error code there, so that 'wb->task' would never take erroneous values. Note, this race is very unlikely and I never hit it, so it is theoretical, but nevertheless worth fixing. This patch also merges 2 comments which were previously separate. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: harmonize writeback threads namingArtem Bityutskiy
The write-back code mixes words "thread" and "task" for the same things. This is not a big deal, but still an inconsistency. hch: a convention I tend to use and I've seen in various places is to always use _task for the storage of the task_struct pointer, and thread everywhere else. This especially helps with having foo_thread for the actual thread and foo_task for a global variable keeping the task_struct pointer This patch renames: * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()' * 'bdi_forker_task()' -> 'bdi_forker_thread()' because bdi threads are 'bdi_writeback_thread()', so these names are more consistent. This patch also amends commentaries and makes them refer the forker and bdi threads as "thread", not "task". Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use 'static void' instead of 'void static' and make checkpatch.pl happy. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Initial tracing supportDave Chinner
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like: sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path. The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: merge bdi_writeback_task and bdi_start_fnChristoph Hellwig
Move all code for the writeback thread into fs/fs-writeback.c instead of splitting it over two functions in two files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: remove wb_listChristoph Hellwig
The wb_list member of struct backing_device_info always has exactly one element. Just use the direct bdi->wb pointer instead and simplify some code. Also remove bdi_task_init which is now trivial to prepare for the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: simplify the write back thread queueChristoph Hellwig
First remove items from work_list as soon as we start working on them. This means we don't have to track any pending or visited state and can get rid of all the RCU magic freeing the work items - we can simply free them once the operation has finished. Second use a real completion for tracking synchronous requests - if the caller sets the completion pointer we complete it, otherwise use it as a boolean indicator that we can free the work item directly. Third unify struct wb_writeback_args and struct bdi_work into a single data structure, wb_writeback_work. Previous we set all parameters into a struct wb_writeback_args, copied it into struct bdi_work, copied it again on the stack to use it there. Instead of just allocate one structure dynamically or on the stack and use it all the way through the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: remove writeback_inodes_wbcChristoph Hellwig
This was just an odd wrapper around writeback_inodes_wb. Removing this also allows to get rid of the bdi member of struct writeback_control which was rather out of place there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-05-21writeback: fixups for !dirty_writeback_centisecsJens Axboe
Commit 69b62d01 fixed up most of the places where we would enter busy schedule() spins when disabling the periodic background writeback. This fixes up the sb timer so that it doesn't get hammered on with the delay disabled, and ensures that it gets rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs gets modified. bdi_forker_task() also needs to check for !dirty_writeback_centisecs and use schedule() appropriately, fix that up too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-25Catch filesystems lacking s_bdiJörn Engel
noop_backing_dev_info is used only as a flag to mark filesystems that don't have any backing store, like tmpfs, procfs, spufs, etc. Signed-off-by: Joern Engel <joern@logfs.org> Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes to the noop_backing_dev_info is not legal and will not result in them being flushed, but we already catch this condition in __mark_inode_dirty() when checking for a registered bdi. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-22bdi: add helper function for doing init and register of a bdi for a file systemJens Axboe
Pretty trivial helper, just sets up the bdi and registers it. An atomic sequence count is used to ensure that the registered sysfs names are unique. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-02backing-dev: Handle class_create() failureAnton Blanchard
I hit this when we had a bug in IDR for a few days. Basically sysfs would fail to create new inodes since it uses an IDR and therefore class_create would fail. While we are unlikely to see this fail we may as well handle it instead of oopsing. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-03flusher: Fix PF_FROZEN raceOGAWA Hirofumi
To touch task->flags directly is racy. thaw_process() still has race (changing non_current->flags, but this is another issue) though, I think it's much better off. So, use thaw_process() instead. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-12Thaw refrigerated bdi flusher threads before invoking kthread_stop on themRomit Dasgupta
Unfreezes the bdi flusher task when the said task needs to exit. Steps to reproduce this. 1) Mount a file system from MMC/SD card. 2) Unmount the file system. This creates a flusher task. 3) Attempt suspend to RAM. System is unresponsive. This is because the bdi flusher thread is already in the refrigerator and will remain so until it is thawed. The MMC driver suspend routine call stack will ultimately issue a 'kthread_stop' on the bdi flusher thread and will block until the flusher thread is exited. Since the bdi flusher thread is in the refrigerator it never cleans up until thawed. Signed-off-by: Romit Dasgupta <romit@ti.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-03backing-dev: bdi sb prune should be in the unregister path, not destroyJens Axboe
Commit 592b09a42fc3ae6737a0f3ecf4fee42ecd0296f8 was different from the tested path, in that it moved the bdi super_block prune from unregister to destroy context. This doesn't fully fix the sync hang bug on unexpected device removal, as need to prune the bdi cache pointer before killing flusher thread. Tested-by: Artur Skawina <art.08.09@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-29backing-dev: ensure that a removed bdi no longer has super_block referencing itJens Axboe
When the bdi is being removed, we have to ensure that no super_blocks currently have that cached in sb->s_bdi. Normally this is ensured by the sb having a longer life span than the bdi, but if the device is suddenly yanked, we have to kill this reference. sb->s_bdi is pointed to freed memory at that point. This fixes a problem with sync(1) hanging when a USB stick is pulled without cleanly umounting it first. Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>