summaryrefslogtreecommitdiff
path: root/mm/filemap.c
AgeCommit message (Collapse)Author
2012-01-13memcg: add mem_cgroup_replace_page_cache() to fix LRU issueKAMEZAWA Hiroyuki
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a function replace_page_cache_page(). This function replaces a page in the radix-tree with a new page. WHen doing this, memory cgroup needs to fix up the accounting information. memcg need to check PCG_USED bit etc. In some(many?) cases, 'newpage' is on LRU before calling replace_page_cache(). So, memcg's LRU accounting information should be fixed, too. This patch adds mem_cgroup_replace_page_cache() and removes the old hooks. In that function, old pages will be unaccounted without touching res_counter and new page will be accounted to the memcg (of old page). WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid races with LRU handling. Background: replace_page_cache_page() is called by FUSE code in its splice() handling. Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated page and may be on LRU. LRU mis-accounting will be critical for memory cgroup because rmdir() checks the whole LRU is empty and there is no account leak. If a page is on the other LRU than it should be, rmdir() will fail. This bug was added in March 2011, but no bug report yet. I guess there are not many people who use memcg and FUSE at the same time with upstream kernels. The result of this bug is that admin cannot destroy a memcg because of account leak. So, no panic, no deadlock. And, even if an active cgroup exist, umount can succseed. So no problem at shutdown. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()Johannes Weiner
Tell the page allocator that pages allocated through grab_cache_page_write_begin() are expected to become dirty soon. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-04should_remove_suid(): inode->i_mode is umode_tAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-22vfs: __read_cache_page should use gfp argument rather than GFP_KERNELDave Kleikamp
lockdep reports a deadlock in jfs because a special inode's rw semaphore is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not used when __read_cache_page() calls add_to_page_cache_lru(). Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-02fs: Make write(2) interruptible by a fatal signalJan Kara
Currently write(2) to a file is not interruptible by any signal. Sometimes this is desirable, e.g. when you want to quickly kill a process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make task in balance_dirty_pages() killable"), it's necessary to abort the current write accordingly to avoid it quickly dirtying lots more pages at unthrottled rate. This patch makes write interruptible by SIGKILL. We do not allow write to be interruptible by any other signal because that has larger potential of screwing some badly written applications. Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com> Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31mm: Map most files to use export.h instead of module.hPaul Gortmaker
The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-28vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriatelyJeff Layton
Currently, when you call iov_iter_advance, then the pointer to the iovec array can be incremented, but it does not decrement the nr_segs value in the iov_iter struct. The result is a iov_iter struct with a nr_segs value that goes beyond the end of the array. While I'm not aware of anything that's specifically broken by this, it seems odd and a bit dangerous not to decrement that value. If someone were to trust the nr_segs value to be correct, then they could end up walking off the end of the array. Changing this might also provide some micro-optimization when dealing with the last iovec in an array. Many of the other routines that deal with iov_iter have optimized codepaths when nr_segs == 1. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-09-15mm: account skipped entries to avoid looping in find_get_pagesShaohua Li
The found entries by find_get_pages() could be all swap entries. In this case we skip the entries, but make sure the skipped entries are accounted, so we don't keep looping. Using nr_found > nr_skip to simplify code as suggested by Eric. Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04mm: clarify the radix_tree exceptional casesHugh Dickins
Make the radix_tree exceptional cases, mostly in filemap.c, clearer. It's hard to devise a suitable snappy name that illuminates the use by shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree generality. And akpm points out that /* radix_tree_deref_retry(page) */ comments look like calls that have been commented out for unknown reason. Skirt the naming difficulty by rearranging these blocks to handle the transient radix_tree_deref_retry(page) case first; then just explain the remaining shmem/tmpfs swap case in a comment. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04mm: a few small updates for radix-swapHugh Dickins
Remove PageSwapBacked (!page_is_file_cache) cases from add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now go through shmem_add_to_page_cache(). Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(), and add a comment on swap entries to invalidate_mapping_pages(). And mincore_page() uses find_get_page() on what might be shmem or a tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04mm: let swap use exceptional entriesHugh Dickins
If swap entries are to be stored along with struct page pointers in a radix tree, they need to be distinguished as exceptional entries. Most of the handling of swap entries in radix tree will be contained in shmem.c, but a few functions in filemap.c's common code need to check for their appearance: find_get_page(), find_lock_page(), find_get_pages() and find_get_pages_contig(). So as not to slow their fast paths, tuck those checks inside the existing checks for unlikely radix_tree_deref_slot(); except for find_lock_page(), where it is an added test. And make it a BUG in find_get_pages_tag(), which is not applied to tmpfs files. A part of the reason for eliminating shmem_readpage() earlier, was to minimize the places where common code would need to allow for swap entries. The swp_entry_t known to swapfile.c must be massaged into a slightly different form when stored in the radix tree, just as it gets massaged into a pte_t when stored in page tables. In an i386 kernel this limits its information (type and page offset) to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum swapfile size of 128GB. Which is less than the 512GB we previously allowed with X86_PAE (where the swap entry can occupy the entire upper 32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and there's not a new limitation on 64-bit (where swap filesize is already limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is probably still enough swap for a 64GB 32-bit machine. Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and enforce filesize limit in read_swap_header(), just as for ptes. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04radix_tree: exceptional entries and indicesHugh Dickins
A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its peculiar swap vector, instead keeping a file's swap entries in the same radix tree as its struct page pointers: thus saving memory, and simplifying its code and locking. This patch: The radix_tree is used by several subsystems for different purposes. A major use is to store the struct page pointers of a file's pagecache for memory management. But what if mm wanted to store something other than page pointers there too? The low bit of a radix_tree entry is already used to denote an indirect pointer, for internal use, and the unlikely radix_tree_deref_retry() case. Define the next bit as denoting an exceptional entry, and supply inline functions radix_tree_exception() to return non-0 in either unlikely case, and radix_tree_exceptional_entry() to return non-0 in the second case. If a subsystem already uses radix_tree with that bit set, no problem: it does not affect internal workings at all, but is defined for the convenience of those storing well-aligned pointers in the radix_tree. The radix_tree_gang_lookups have an implicit assumption that the caller can deduce the offset of each entry returned e.g. by the page->index of a struct page. But that may not be feasible for some kinds of item to be stored there. radix_tree_gang_lookup_slot() allow for an optional indices argument, output array in which to return those offsets. The same could be added to other radix_tree_gang_lookups, but for now keep it to the only one for which we need it. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits) mm: properly reflect task dirty limits in dirty_exceeded logic writeback: don't busy retry writeback on new/freeing inodes writeback: scale IO chunk size up to half device bandwidth writeback: trace global_dirty_state writeback: introduce max-pause and pass-good dirty limits writeback: introduce smoothed global dirty limit writeback: consolidate variable names in balance_dirty_pages() writeback: show bdi write bandwidth in debugfs writeback: bdi write bandwidth estimation writeback: account per-bdi accumulated written pages writeback: make writeback_control.nr_to_write straight writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr() writeback: trace event writeback_queue_io writeback: trace event writeback_single_inode writeback: remove .nonblocking and .encountered_congestion writeback: remove writeback_control.more_io writeback: skip balance_dirty_pages() for in-memory fs writeback: add bdi_dirty_limit() kernel-doc writeback: avoid extra sync work at enqueue time writeback: elevate queue_io() into wb_writeback() ... Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c
2011-07-26mm: consistent truncate and invalidate loopsHugh Dickins
Make the pagevec_lookup loops in truncate_inode_pages_range(), invalidate_mapping_pages() and invalidate_inode_pages2_range() more consistent with each other. They were relying upon page->index of an unlocked page, but apologizing for it: accept it, embrace it, add comments and WARN_ONs, and simplify the index handling. invalidate_inode_pages2_range() had special handling for a wrapped page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere near there, and a corrupt page->index in the radix_tree could cause more trouble than that would catch. Remove that wrapped handling. invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup when near the end of the range: copy that into the other two, although it's less useful than you might think (it limits the use of the buffer, rather than the indices looked up). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26mm: cleanup descriptions of filler argHugh Dickins
The often-NULL data arg to read_cache_page() and read_mapping_page() functions is misdescribed as "destination for read data": no, it's the first arg to the filler function, often struct file * to ->readpage(). Satisfy checkpatch.pl on those filler prototypes, and tidy up the declarations in linux/pagemap.h. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-21fs: kill i_alloc_semChristoph Hellwig
i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bits on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-08writeback: split inode_wb_list_lock into bdi_writeback.list_lockChristoph Hellwig
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-03more conservative S_NOSEC handlingAl Viro
Caching "we have already removed suid/caps" was overenthusiastic as merged. On network filesystems we might have had suid/caps set on another client, silently picked by this client on revalidate, all of that *without* clearing the S_NOSEC flag. AFAICS, the only reasonably sane way to deal with that is * new superblock flag; unless set, S_NOSEC is not going to be set. * local block filesystems set it in their ->mount() (more accurately, mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than local block ones clear it) * if any network filesystem (or a cluster one) wants to use S_NOSEC, it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when inode attribute changes are picked from other clients. It's not an earth-shattering hole (anybody that can set suid on another client will almost certainly be able to write to the file before doing that anyway), but it's a bug that needs fixing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-28Cache xattr security drop check for write v2Andi Kleen
Some recent benchmarking on btrfs showed that a major scaling bottleneck on large systems on btrfs is currently the xattr lookup on every write. Why xattr lookup on every write I hear you ask? write wants to drop suid and security related xattrs that could set o capabilities for executables. To do that it currently looks up security.capability on EVERY write (even for non executables) to decide whether to drop it or not. In btrfs this causes an additional tree walk, hitting some per file system locks and quite bad scalability. In a simple read workload on a 8S system I saw over 90% CPU time in spinlocks related to that. Chris Mason tells me this is also a problem in ext4, where it hits the global mbcache lock. This patch adds a simple per inode to avoid this problem. We only do the lookup once per file and then if there is no xattr cache the decision. All xattr changes clear the flag. I also used the same flag to avoid the suid check, although that one is pretty cheap. A file system can also set this flag when it creates the inode, if it has a cheap way to do so. This is done for some common file systems in followon patches. With this patch a major part of the lock contention disappears for btrfs. Some testing on smaller systems didn't show significant performance changes, but at least it helps the larger systems and is generally more efficient. v2: Rename is_sgid. add file system helper. Cc: chris.mason@oracle.com Cc: josef@redhat.com Cc: viro@zeniv.linux.org.uk Cc: agruen@linbit.com Cc: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-28mm: Wait for writeback when grabbing pages to begin a writeDarrick J. Wong
When grabbing a page for a buffered IO write, the mm should wait for writeback on the page to complete so that the page does not become writable during the IO operation. This change is needed to provide page stability during writes for all filesystems. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-27memcg: add the pagefault count into memcg statsYing Han
Two new stats in per-memcg memory.stat which tracks the number of page faults and number of major page faults. "pgfault" "pgmajfault" They are different from "pgpgin"/"pgpgout" stat which count number of pages charged/discharged to the cgroup and have no meaning of reading/ writing page to disk. It is valuable to track the two stats for both measuring application's performance as well as the efficiency of the kernel page reclaim path. Counting pagefaults per process is useful, but we also need the aggregated value since processes are monitored and controlled in cgroup basis in memcg. Functional test: check the total number of pgfault/pgmajfault of all memcgs and compare with global vmstat value: $ cat /proc/vmstat | grep fault pgfault 1070751 pgmajfault 553 $ cat /dev/cgroup/memory.stat | grep fault pgfault 1071138 pgmajfault 553 total_pgfault 1071142 total_pgmajfault 553 $ cat /dev/cgroup/A/memory.stat | grep fault pgfault 199 pgmajfault 0 total_pgfault 199 total_pgmajfault 0 Performance test: run page fault test(pft) wit 16 thread on faulting in 15G anon pages in 16G container. There is no regression noticed on the "flt/cpu/s" Sample output from pft: TAG pft:anon-sys-default: Gb Thr CLine User System Wall flt/cpu/s fault/wsec 15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260 +-------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 16682.962 17344.027 16913.524 16928.812 166.5362 + 10 16695.568 16923.896 16820.604 16824.652 84.816568 No difference proven at 95.0% confidence [akpm@linux-foundation.org: fix build] [hughd@google.com: shmem fix] Signed-off-by: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem: xen: cleancache shim to Xen Transcendent Memory ocfs2: add cleancache support ext4: add cleancache support btrfs: add cleancache support ext3: add cleancache support mm/fs: add hooks to support cleancache mm: cleancache core ops functions and config fs: add field to superblock to support cleancache mm/fs: cleancache documentation Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26mm/fs: add hooks to support cleancacheDan Magenheimer
This fourth patch of eight in this cleancache series provides the core hooks in VFS for: initializing cleancache per filesystem; capturing clean pages reclaimed by page cache; attempting to get pages from cleancache before filesystem read; and ensuring coherency between pagecache, disk, and cleancache. Note that the placement of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic change was required due to a patchset in 2.6.39. All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become a check of a boolean global if CONFIG_CLEANCACHE is set but no cleancache "backend" has claimed cleancache_ops. Details and a FAQ can be found in Documentation/vm/cleancache.txt [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function] Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik Van Riel <riel@redhat.com> Cc: Jan Beulich <JBeulich@novell.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-25readahead: trigger mmap sequential readahead on PG_readaheadWu Fengguang
Previously the mmap sequential readahead is triggered by updating ra->prev_pos on each page fault and compare it with current page offset. It costs dirtying the cache line on each _minor_ page fault. So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably and reduce cache line bouncing on concurrent page faults on shared struct file. In the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss and ra->prev_pos updates are found to cause excessive cache line bouncing on tmpfs, which actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably on concurrent reads on shared struct file. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Tim Chen <tim.c.chen@intel.com> Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25readahead: reduce unnecessary mmap_miss increasesAndi Kleen
The original INT_MAX is too large, reduce it to - avoid unnecessarily dirtying/bouncing the cache line - restore mmap read-around faster on changed access pattern Background: in the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss updates are found to cause excessive cache line bouncing on tmpfs. The ra state updates are needless for tmpfs because it actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). Tested-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25readahead: return early when readahead is disabledWu Fengguang
Reduce readahead overheads by returning early in do_sync_mmap_readahead(). tmpfs has ra_pages=0 and it can page fault really fast (not constraint by IO if not swapping). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Tim Chen <tim.c.chen@intel.com> Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Convert i_mmap_lock to a mutexPeter Zijlstra
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25x86,mm: make pagefault killableKOSAKI Motohiro
When an oom killing occurs, almost all processes are getting stuck at the following two points. 1) __alloc_pages_nodemask 2) __lock_page_or_retry 1) is not very problematic because TIF_MEMDIE leads to an allocation failure and getting out from page allocator. 2) is more problematic. In an OOM situation, zones typically don't have page cache at all and memory starvation might lead to greatly reduced IO performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly, meaning that a fork bomb may create new process quickly rather than the oom-killer killing it. Then, the system may become livelocked. This patch makes the pagefault interruptible by SIGKILL. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: introduce wait_on_page_locked_killable()KOSAKI Motohiro
commit 2687a356 ("Add lock_page_killable") introduced killable lock_page(). Similarly this patch introdues killable wait_on_page_locked(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: simplify iget & friends fs: pull inode->i_lock up out of writeback_single_inode fs: rename inode_lock to inode_hash_lock fs: move i_wb_list out from under inode_lock fs: move i_sb_list out from under inode_lock fs: remove inode_lock from iput_final and prune_icache fs: Lock the inode LRU list separately fs: factor inode disposal fs: protect inode->i_state with inode->i_lock autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() autofs4 - remove autofs4_lock autofs4 - fix d_manage() return on rcu-walk autofs4 - fix autofs4_expire_indirect() traversal autofs4 - fix dentry leak in autofs4_expire_direct() autofs4 - reinstate last used update on access vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
2011-03-25fs: move i_wb_list out from under inode_lockDave Chinner
Protect the inode writeback list with a new global lock inode_wb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-25fs: protect inode->i_state with inode->i_lockDave Chinner
Protect inode state transitions and validity checks with the inode->i_lock. This enables us to make inode state transitions independently of the inode_lock and is the first step to peeling away the inode_lock from the code. This requires that __iget() is done atomically with i_state checks during list traversals so that we don't race with another thread marking the inode I_FREEING between the state check and grabbing the reference. Also remove the unlock_new_inode() memory barrier optimisation required to avoid taking the inode_lock when clearing I_NEW. Simplify the code by simply taking the inode->i_lock around the state change and wakeup. Because the wakeup is no longer tricky, remove the wake_up_inode() function and open code the wakeup where necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-23mm: don't return 0 too early from find_get_pages()Hugh Dickins
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably truncate_inode_pages_range() - stop looking further when it returns 0. But if an interrupt comes just after its radix_tree_gang_lookup_slot(), especially if we have preemptible RCU enabled, isn't it conceivable that all 14 pages returned could be removed from the page cache by shrink_page_list(), before find_get_pages() gets to process them? So causing it to return 0 although there may be plenty more pages beyond. Make find_get_pages() and find_get_pages_tag() check for this unlikely case, and restart should it occur; but callers of find_get_pages_contig() have no such expectation, it's okay for that to return 0 early. I have not seen this in practice, just worried by the possibility. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: remove worrying dead code from find_get_pages()Hugh Dickins
The radix_tree_deref_retry() case in find_get_pages() has a strange little excrescence, not seen in the other gang lookups: it looks like the start of an abandoned attempt to guarantee forward progress in a case that cannot arise. ret should always be 0 here: if it isn't, then going back to restart will leak references to pages already gotten. There used to be a comment saying nr_found is necessarily 1 here: that's not quite true, but the radix_tree_deref_retry() case is peculiar to the entry at index 0, when we race with it being moved out of the radix_tree root or back. Remove the worrisome two lines, add a brief comment here and in find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in find_get_pages() should it ever be seen elsewhere than at 0. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: change __remove_from_page_cache()Minchan Kim
Now we renamed remove_from_page_cache with delete_from_page_cache. As consistency of __remove_from_swap_cache and remove_from_swap_cache, we change internal page cache handling function name, too. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: goodbye remove_from_page_cache()Minchan Kim
Now delete_from_page_cache() replaces remove_from_page_cache(). So we remove remove_from_page_cache so fs or something out of mainline will notice it when compile time and can fix it. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: introduce delete_from_page_cache()Minchan Kim
Presently we increase the page refcount in add_to_page_cache() but don't decrease it in remove_from_page_cache(). Such asymmetry adds confusion, requiring that callers notice it and a comment explaining why they release a page reference. It's not a good API. A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but gave up because reiser4's drop_page() had to unlock the page between removing it from page cache and doing the page_cache_release(). But now the situation is changed. I think at least things in current mainline don't have any obstacles. The problem is for out-of-mainline filesystems - if they have done such things as reiser4, this patch could be a problem but they will discover this at compile time since we remove remove_from_page_cache(). This patch: This function works as just wrapper remove_from_page_cache(). The difference is that it decreases page references in itself. So caller have to make sure it has a page reference before calling. This patch is ready for removing remove_from_page_cache(). Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: add replace_page_cache_page() functionMiklos Szeredi
This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: allow GUP to fail instead of waiting on a pageGleb Natapov
GUP user may want to try to acquire a reference to a page if it is already in memory, but not if IO, to bring it in, is needed. For example KVM may tell vcpu to schedule another guest process if current one is trying to access swapped out page. Meanwhile, the page will be swapped in and the guest process, that depends on it, will be able to run again. This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY instead. [akpm@linux-foundation.org: improve FOLL_NOWAIT comment] Signed-off-by: Gleb Natapov <gleb@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-10fs: make generic file read/write functions plugJens Axboe
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: remove per-queue pluggingJens Axboe
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14mm: remove likely() from grab_cache_page_write_begin()Steven Rostedt
Running the annotated branch profiler on a box doing average work (firefox, evolution, xchat, distcc farm), the likely() used in grab_cache_page_write_begin() was incorrect most of the time: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206 Adding a trace_printk() and running the function tracer limited to just this function I can see: gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7 gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8 gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9 gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10 gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11 gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin Which shows that a lot of calls from ext3_write_begin will result in the page returned by "find_lock_page" will be NULL. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14mm: clear PageError bit in msync & fsyncRik van Riel
Temporary IO failures, eg. due to loss of both multipath paths, can permanently leave the PageError bit set on a page, resulting in msync or fsync returning -EIO over and over again, even if IO is now getting to the disk correctly. We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the filemap_fdatawait_range function. Also clearing the PageError bit on the page allows subsequent msync or fsync calls on this file to return without an error, if the subsequent IO succeeds. Unfortunately data written out in the msync or fsync call that returned -EIO can still get lost, because the page dirty bit appears to not get restored on IO error. However, the alternative could be potentially all of memory filling up with uncleanable dirty pages, hanging the system, so there is no nice choice here... Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Valerie Aurora <vaurora@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14mm: find_get_pages_contig fixletNick Piggin
Testing ->mapping and ->index without a ref is not stable as the page may have been reused at this point. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-07fs: dcache remove dcache_lockNick Piggin
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2010-12-02Call the filesystem back whenever a page is removed from the page cacheLinus Torvalds
NFS needs to be able to release objects that are stored in the page cache once the page itself is no longer visible from the page cache. This patch adds a callback to the address space operations that allows filesystems to perform page cleanups once the page has been removed from the page cache. Original patch by: Linus Torvalds <torvalds@linux-foundation.org> [trondmy: cover the cases of invalidate_inode_pages2() and truncate_inode_pages()] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-12radix-tree: fix RCU bugNick Piggin
Salman Qazi describes the following radix-tree bug: In the following case, we get can get a deadlock: 0. The radix tree contains two items, one has the index 0. 1. The reader (in this case find_get_pages) takes the rcu_read_lock. 2. The reader acquires slot(s) for item(s) including the index 0 item. 3. The non-zero index item is deleted, and as a consequence the other item is moved to the root of the tree. The place where it used to be is queued for deletion after the readers finish. 3b. The zero item is deleted, removing it from the direct slot, it remains in the rcu-delayed indirect node. 4. The reader looks at the index 0 slot, and finds that the page has 0 ref count 5. The reader looks at it again, hoping that the item will either be freed or the ref count will increase. This never happens, as the slot it is looking at will never be updated. Also, this slot can never be reclaimed because the reader is holding rcu_read_lock and is in an infinite loop. The fix is to re-use the same "indirect" pointer case that requires a slot lookup retry into a general "retry the lookup" bit. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reported-by: Salman Qazi <sqazi@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12mm/vfs: revalidate page->mapping in do_generic_file_read()Dave Hansen
70 hours into some stress tests of a 2.6.32-based enterprise kernel, we ran into a NULL dereference in here: int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, unsigned long from) { ----> struct inode *inode = page->mapping->host; It looks like page->mapping was the culprit. (xmon trace is below). After closer examination, I realized that do_generic_file_read() does a find_get_page(), and eventually locks the page before calling block_is_partially_uptodate(). However, it doesn't revalidate the page->mapping after the page is locked. So, there's a small window between the find_get_page() and ->is_partially_uptodate() where the page could get truncated and page->mapping cleared. We _have_ a reference, so it can't get reclaimed, but it certainly can be truncated. I think the correct thing is to check page->mapping after the trylock_page(), and jump out if it got truncated. This patch has been running in the test environment for a month or so now, and we have not seen this bug pop up again. xmon info: 1f:mon> e cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770] pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100 lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770 sp: c0000002ae36f9f0 msr: 8000000000009032 dar: 0 dsisr: 40000000 current = 0xc000000378f99e30 paca = 0xc000000000f66300 pid = 21946, comm = bash 1f:mon> r R00 = 0025c0500000006d R16 = 0000000000000000 R01 = c0000002ae36f9f0 R17 = c000000362cd3af0 R02 = c000000000e8cd80 R18 = ffffffffffffffff R03 = c0000000031d0f88 R19 = 0000000000000001 R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0 R05 = 0000000000000000 R21 = c0000002ae36fa68 R06 = 0000000000000000 R22 = 0000000000000000 R07 = 0000000000000001 R23 = c0000002ae36fbb0 R08 = 0000000000000002 R24 = 0000000000000000 R09 = 0000000000000000 R25 = c000000362cd3a80 R10 = 0000000000000000 R26 = 0000000000000002 R11 = c0000000001e7b60 R27 = 0000000000000000 R12 = 0000000042000484 R28 = 0000000000000001 R13 = c000000000f66300 R29 = c0000003bb97b9b8 R14 = 0000000000000001 R30 = c000000000e28a08 R15 = 000000000000ffff R31 = c0000000031d0f88 pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100 lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770 msr = 8000000000009032 cr = 22000488 ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300 dar = 0000000000000000 dsisr = 40000000 1f:mon> t [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770 [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable) [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160 [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0 [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0 [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 00000080a840bc54 SP (fffca15df30) is in userspace 1f:mon> di c0000000001e7a6c c0000000001e7a6c e9290000 ld r9,0(r9) c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100 c0000000001e7a74 e9440008 ld r10,8(r4) c0000000001e7a78 78a80020 clrldi r8,r5,32 c0000000001e7a7c 3c000001 lis r0,1 c0000000001e7a80 812900a8 lwz r9,168(r9) c0000000001e7a84 39600001 li r11,1 c0000000001e7a88 7c080050 subf r0,r8,r0 c0000000001e7a8c 7f805040 cmplw cr7,r0,r10 c0000000001e7a90 7d6b4830 slw r11,r11,r9 c0000000001e7a94 796b0020 clrldi r11,r11,32 c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100 c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11 c0000000001e7aa0 7d004214 add r8,r0,r8 c0000000001e7aa4 79080020 clrldi r8,r8,32 c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100 Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: <arunabal@in.ibm.com> Cc: <sbest@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-02Release page reference during page fault retryMichel Lespinasse
This slipped by when unifying the filemap and swap versions of lock_page_or_retry()... Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>