summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2011-03-23sys_swapon: call swap_cgroup_swapon() earlierCesar Eduardo Barros
The call to swap_cgroup_swapon is in the middle of loading the swap map and extents. As it only does memory allocation and does not depend on the swapfile layout (map/extents), it can be called earlier (or later). Move it to just after the allocation of swap_map, since it is conceptually similar (allocates a map). Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: simplify error flow in read_swap_header()Cesar Eduardo Barros
Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: separate parsing of swapfile headerCesar Eduardo Barros
Move the code which parses and checks the swapfile header (except for the bad block list) to a separate function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: move setting of swapfilepages near useCesar Eduardo Barros
There is no reason I can see to read inode->i_size long before it is needed. Move its read to just before it is needed, to reduce the variable lifetime. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: simplify error flow in claim_swapfile()Cesar Eduardo Barros
Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: separate bdev claim and inode lockCesar Eduardo Barros
Move the code which claims the bdev (S_ISBLK) or locks the inode (S_ISREG) to a separate function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: use a single error labelCesar Eduardo Barros
sys_swapon currently has two error labels, bad_swap and bad_swap_2. bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is jumped to. With a single extra test for inode (matching the one in the S_ISREG case below), all the error paths in the function can go to bad_swap. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: do only cleanup in the cleanup blocksCesar Eduardo Barros
The only way error is 0 in the cleanup blocks is when the function is returning successfully. In this case, the cleanup blocks were setting S_SWAPFILE in the S_ISREG case. But this is not a cleanup. Move the setting of S_SWAPFILE to just before the "goto out;" to make this more clear. At this point, we do not need to test for inode because it will never be NULL. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: remove bdev variableCesar Eduardo Barros
The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ? p->bdev : NULL), as long as it being set is moved to a bit earlier. Use this fact to remove the bdev variable. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: move setting of error nearer useCesar Eduardo Barros
Move the setting of the error variable nearer the goto in a few places. Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the error condition more explicit in two other places. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: remove did_down variableCesar Eduardo Barros
Since mutex_lock(&inode->i_mutex) is called just after setting inode, did_down is always equivalent to (inode && S_ISREG(inode->i_mode)). Use this fact to remove the did_down variable. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: remove initial value of name variableCesar Eduardo Barros
Now there is nothing which jumps to the cleanup blocks before the name variable is set. There is no need to set it initially to NULL anymore. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: simplify error flow in alloc_swap_info()Cesar Eduardo Barros
Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: simplify error return from swap_info allocationCesar Eduardo Barros
At this point in sys_swapon, there is nothing to free. Return directly instead of jumping to the cleanup block at the end of the function. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: separate swap_info allocationCesar Eduardo Barros
Move the swap_info allocation to its own function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: do not depend on "type" after allocationCesar Eduardo Barros
Within sys_swapon, after the swap_info entry has been allocated, we always have type == p->type and swap_info[type] == p. Use this fact to reduce the dependency on the "type" local variable within the function, as a preparation to move the allocation of the swap_info entry to a separate function. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: remove changelog from function commentCesar Eduardo Barros
Changelogs belong in the git history instead of in the source code. Also, "The swapon system call" is redundant with "SYSCALL_DEFINE2(swapon, ...)". Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Gaah. That's a _historical_ comment. But the patch-series depends on removal ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sys_swapon: use vzalloc() instead of vmalloc/memsetCesar Eduardo Barros
This patch series refactors the sys_swapon function. sys_swapon is currently a very large function, with 313 lines (more than 12 25-line screens), which can make it a bit hard to read. This patch series reduces this size by half, by extracting large chunks of related code to new helper functions. One of these chunks of code was nearly identical to the part of sys_swapoff which is used in case of a failure return from try_to_unuse(), so this patch series also makes both share the same code. As a side effect of all this refactoring, the compiled code gets a bit smaller (from v1 of this patch series): text data bss dec hex filename 14012 944 276 15232 3b80 mm/swapfile.o.before 13941 944 276 15161 3b39 mm/swapfile.o.after This patch: Use vzalloc() instead of vmalloc/memset. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: use __GFP_OTHER_NODE for transparent huge pagesAndi Kleen
Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the hugepages daemon. This way the low level accounting for local versus remote pages works correctly. Contains improvements from Andrea Arcangeli [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: add __GFP_OTHER_NODE flagAndi Kleen
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in zone_statistics() that an allocation is on behalf of another thread. This way the local and remote counters can be still correct, even when background daemons like khugepaged are changing memory mappings. This only affects the accounting, but I think it's worth doing that right to avoid confusing users. I first tried to just pass down the right node, but this required a lot of changes to pass down this parameter and at least one addition of a 10th argument to a 9 argument function. Using the flag is a lot less intrusive. Open: should be also used for migration? [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writebackAndrea Arcangeli
__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed as they have graceful fallback. Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Unfortunately, even with async migration it's still possible for the process to be blocked waiting for a request slot (e.g. get_request_wait in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on dirty page cache for asynchronous migration. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142 [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Arthur Marsh <arthur.marsh@internode.on.net> Cc: Clemens Ladisch <cladisch@googlemail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: compaction: minimise the time IRQs are disabled while isolating pages ↵Andrea Arcangeli
for migration compaction_alloc() isolates pages for migration in isolate_migratepages. While it's scanning, IRQs are disabled on the mistaken assumption the scanning should be short. Tests show this to be true for the most part but contention times on the LRU lock can be increased. Before this patch, the IRQ disabled times for a simple test looked like Total sampled time IRQs off (not real total time): 5493 Event shrink_inactive_list..shrink_zone 1596 us count 1 Event shrink_inactive_list..shrink_zone 1530 us count 1 Event shrink_inactive_list..shrink_zone 956 us count 1 Event shrink_inactive_list..shrink_zone 541 us count 1 Event shrink_inactive_list..shrink_zone 531 us count 1 Event split_huge_page..add_to_swap 232 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __wake_up..__wake_up 1 us count 1 This patch reduces the worst-case IRQs-disabled latencies by releasing the lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if necessary. The cost of this is that the processing performing compaction will be slower but IRQs being disabled for too long a time has worse consequences as the following report shows; Total sampled time IRQs off (not real total time): 4367 Event shrink_inactive_list..shrink_zone 881 us count 1 Event shrink_inactive_list..shrink_zone 875 us count 1 Event shrink_inactive_list..shrink_zone 868 us count 1 Event shrink_inactive_list..shrink_zone 555 us count 1 Event split_huge_page..add_to_swap 495 us count 1 Event compact_zone..compact_zone_order 269 us count 1 Event split_huge_page..add_to_swap 266 us count 1 Event shrink_inactive_list..shrink_zone 85 us count 1 Event save_args..call_softirq 36 us count 2 Event __wake_up..__wake_up 1 us count 1 [akpm@linux-foundation.org: simplify with s/unlocked/locked/] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Arthur Marsh <arthur.marsh@internode.on.net> Cc: Clemens Ladisch <cladisch@googlemail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: compaction: minimise the time IRQs are disabled while isolating free pagesMel Gorman
compaction_alloc() isolates free pages to be used as migration targets. While its scanning, IRQs are disabled on the mistaken assumption the scanning should be short. Analysis showed that IRQs were in fact being disabled for substantial time. A simple test was run using large anonymous mappings with transparent hugepage support enabled to trigger frequent compactions. A monitor sampled what the worst IRQ-off latencies were and a post-processing tool found the following; Total sampled time IRQs off (not real total time): 22355 Event compaction_alloc..compaction_alloc 8409 us count 1 Event compaction_alloc..compaction_alloc 7341 us count 1 Event compaction_alloc..compaction_alloc 2463 us count 1 Event compaction_alloc..compaction_alloc 2054 us count 1 Event shrink_inactive_list..shrink_zone 1864 us count 1 Event shrink_inactive_list..shrink_zone 88 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __make_request..__blk_run_queue 24 us count 1 Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1 i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in one instance. The full report generated by the tool can be found at http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report This patch reduces the time IRQs are disabled by simply disabling IRQs at the last possible minute. An updated IRQs-off summary report then looks like; Total sampled time IRQs off (not real total time): 5493 Event shrink_inactive_list..shrink_zone 1596 us count 1 Event shrink_inactive_list..shrink_zone 1530 us count 1 Event shrink_inactive_list..shrink_zone 956 us count 1 Event shrink_inactive_list..shrink_zone 541 us count 1 Event shrink_inactive_list..shrink_zone 531 us count 1 Event split_huge_page..add_to_swap 232 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __wake_up..__wake_up 1 us count 1 A full report is again available at http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report As should be obvious, IRQ disabled latencies due to compaction are almost elimimnated for this particular test. [aarcange@redhat.com: Fix initialisation of isolated] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Arthur Marsh <arthur.marsh@internode.on.net> Cc: Clemens Ladisch <cladisch@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: don't return 0 too early from find_get_pages()Hugh Dickins
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably truncate_inode_pages_range() - stop looking further when it returns 0. But if an interrupt comes just after its radix_tree_gang_lookup_slot(), especially if we have preemptible RCU enabled, isn't it conceivable that all 14 pages returned could be removed from the page cache by shrink_page_list(), before find_get_pages() gets to process them? So causing it to return 0 although there may be plenty more pages beyond. Make find_get_pages() and find_get_pages_tag() check for this unlikely case, and restart should it occur; but callers of find_get_pages_contig() have no such expectation, it's okay for that to return 0 early. I have not seen this in practice, just worried by the possibility. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: remove worrying dead code from find_get_pages()Hugh Dickins
The radix_tree_deref_retry() case in find_get_pages() has a strange little excrescence, not seen in the other gang lookups: it looks like the start of an abandoned attempt to guarantee forward progress in a case that cannot arise. ret should always be 0 here: if it isn't, then going back to restart will leak references to pages already gotten. There used to be a comment saying nr_found is necessarily 1 here: that's not quite true, but the radix_tree_deref_retry() case is peculiar to the entry at index 0, when we race with it being moved out of the radix_tree root or back. Remove the worrisome two lines, add a brief comment here and in find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in find_get_pages() should it ever be seen elsewhere than at 0. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23hugetlbfs: correct handling of negative input to /proc/sys/vm/nr_hugepagesPetr Holasek
When the user inserts a negative value into /proc/sys/vm/nr_hugepages it will cause the kernel to allocate as many hugepages as possible and to then update /proc/meminfo to reflect this. This changes the behavior so that the negative input will result in nr_hugepages value being unchanged. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Anton Arapov <anton@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: vmscan: kswapd should not free an excessive number of pages when ↵Mel Gorman
balancing small zones When reclaiming for order-0 pages, kswapd requires that all zones be balanced. Each cycle through balance_pgdat() does background ageing on all zones if necessary and applies equal pressure on the inactive zone unless a lot of pages are free already. A "lot of free pages" is defined as a "balance gap" above the high watermark which is currently 7*high_watermark. Historically this was reasonable as min_free_kbytes was small. However, on systems using huge pages, it is recommended that min_free_kbytes is higher and it is tuned with hugeadm --set-recommended-min_free_kbytes. With the introduction of transparent huge page support, this recommended value is also applied. On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M of memory to be free. The Normal zone is approximately 35000 pages so under even normal memory pressure such as copying a large file, it gets exhausted quickly. As it is getting exhausted, kswapd applies pressure equally to all zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with a high watermark of around 23,000 pages. In this situation, kswapd will reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 * high watermark) pages or 718M of pages before the zone is ignored. What the user sees is that free memory far higher than it should be. To avoid an excessive number of pages being reclaimed from the larger zones, explicitely defines the "balance gap" to be either 1% of the zone or the low watermark for the zone, whichever is smaller. While kswapd will check all zones to apply pressure, it'll ignore zones that meets the (high_wmark + balance_gap) watermark. To test this, 80G were copied from a partition and the amount of memory being used was recorded. A comparison of a patch and unpatched kernel can be seen at http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps and shows that kswapd is not reclaiming as much memory with the patch applied. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mempolicy: remove redundant check in __mpol_equal()Namhyung Kim
The 'flags' field is already checked, no need to do it again. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23smaps: have smaps show transparent huge pagesDave Hansen
Now that the mere act of _looking_ at /proc/$pid/smaps will not destroy transparent huge pages, tell how much of the VMA is actually mapped with them. This way, we can make sure that we're getting THPs where we expect to see them. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23smaps: teach smaps_pte_range() about THP pmdsDave Hansen
This adds code to explicitly detect and handle pmd_trans_huge() pmds. It then passes HPAGE_SIZE units in to the smap_pte_entry() function instead of PAGE_SIZE. This means that using /proc/$pid/smaps now will no longer cause THPs to be broken down in to small pages. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23smaps: pass pte size argument in to smaps_pte_entry()Dave Hansen
Add an argument to the new smaps_pte_entry() function to let it account in things other than PAGE_SIZE units. I changed all of the PAGE_SIZE sites, even though not all of them can be reached for transparent huge pages, just so this will continue to work without changes as THPs are improved. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23smaps: break out smaps_pte_entry() from smaps_pte_range()Dave Hansen
We will use smaps_pte_entry() in a moment to handle both small and transparent large pages. But, we must break it out of smaps_pte_range() first. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23pagewalk: only split huge pages when necessaryDave Hansen
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will unconditionally split any transparent huge pages it runs in to. In practice, that means that anyone doing a cat /proc/$pid/smaps will unconditionally break down every huge page in the process and depend on khugepaged to re-collapse it later. This is fairly suboptimal. This patch changes that behavior. It teaches each ->pmd_entry handler (there are five) that they must break down the THPs themselves. Also, the _generic_ code will never break down a THP unless a ->pte_entry handler is actually set. This means that the ->pmd_entry handlers can now choose to deal with THPs without breaking them down. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: reclaim invalidated page ASAPMinchan Kim
invalidate_mapping_pages is very big hint to reclaimer. It means user doesn't want to use the page any more. So in order to prevent working set page eviction, this patch move the page into tail of inactive list by PG_reclaim. Please, remember that pages in inactive list are working set as well as active list. If we don't move pages into inactive list's tail, pages near by tail of inactive list can be evicted although we have a big clue about useless pages. It's totally bad. Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for preventing fast reclaiming readahead marker page. In this series, PG_reclaim is used by invalidated page, too. If VM find the page is invalidated and it's dirty, it sets PG_reclaim to reclaim asap. Then, when the dirty page will be writeback, clear_page_dirty_for_io will clear PG_reclaim unconditionally. It disturbs this serie's goal. I think it's okay to clear PG_readahead when the page is dirty, not writeback time. So this patch moves ClearPageReadahead. In v4, ClearPageReadahead in set_page_dirty has a problem which is reported by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls set_page_dirty with compound page which isn't on LRU. but my patch does ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks PageTail flag. I think it doesn't affect THP and pass my test with THP enabling but Cced Andrea for double check. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reported-by: Steven Barrett <damentz@liquorix.net> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23memcg: move memcg reclaimable page into tail of inactive listMinchan Kim
The rotate_reclaimable_page function moves just written out pages, which the VM wanted to reclaim, to the end of the inactive list. That way the VM will find those pages first next time it needs to free memory. This patch applies the rule in memcg. It can help to prevent unnecessary working page eviction of memcg. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: deactivate invalidated pagesMinchan Kim
Recently, there are reported problem about thrashing. (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup workloads(ex, nightly rsync). That's because the workload makes just use-once pages and touches pages twice. It promotes the page into active list so that it results in working set page eviction. Some app developer want to support POSIX_FADV_NOREUSE. But other OSes don't support it, either. (http://marc.info/?l=linux-mm&m=128928979512086&w=2) By other approach, app developers use POSIX_FADV_DONTNEED. But it has a problem. If kernel meets page is writing during invalidate_mapping_pages, it can't work. It makes for application programmer to use it since they always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could be discardable. At last, they can't use deferred write of kernel so that they could see performance loss. (http://insights.oetiker.ch/linux/fadvise.html) In fact, invalidation is very big hint to reclaimer. It means we don't use the page any more. So let's move the writing page into inactive list's head if we can't truncate it right now. Why I move page to head of lru on this patch, Dirty/Writeback page would be flushed sooner or later. It can prevent writeout of pageout which is less effective than flusher's writeout. Originally, I reused lru_demote of Peter with some change so added his Signed-off-by. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reported-by: Ben Gamari <bgamari.foss@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: mm_struct: remove 16 bytes of alignment padding on 64 bit buildsRichard Kennedy
Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit builds. On my config this shrinks mm_struct by enough to fit in one fewer cache lines and allows more objects per slab in mm_struct kmem_cache under SLUB. slabinfo before patch :- Sizes (bytes) Slabs -------------------------------- Object : 848 Total : 9 SlabObj: 896 Full : 2 SlabSiz: 16384 Partial: 5 Loss : 48 CpuSlab: 2 Align : 64 Objects: 18 slabinfo after :- Sizes (bytes) Slabs -------------------------------- Object : 832 Total : 7 SlabObj: 832 Full : 2 SlabSiz: 16384 Partial: 3 Loss : 0 CpuSlab: 2 Align : 64 Objects: 19 Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: remove unused TestSetPageLocked() interfaceMichel Lespinasse
TestSetPageLocked() isn't being used anywhere. Also, using it would likely be an error, since the proper interface trylock_page() provides stronger ordering guarantees. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: simplify anon_vma refcountsPeter Zijlstra
This patch changes the anon_vma refcount to be 0 when the object is free. It does this by adding 1 ref to being in use in the anon_vma structure (iow. the anon_vma->head list is not empty). This allows a simpler release scheme without having to check both the refcount and the list as well as avoids taking a ref for each entry on the list. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: move anon_vma ref out from under CONFIG_fooPeter Zijlstra
We need the anon_vma refcount unconditionally to simplify the anon_vma lifetime rules. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: rename drop_anon_vma() to put_anon_vma()Peter Zijlstra
The normal code pattern used in the kernel is: get/put. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: debug-pagealloc: fix kconfig dependency warningAkinobu Mita
Fix kconfig dependency warning to satisfy dependencies: warning: (PAGE_POISONING) selects DEBUG_PAGEALLOC which has unmet direct dependencies (DEBUG_KERNEL && ARCH_SUPPORTS_DEBUG_PAGEALLOC && (!HIBERNATION || !PPC && !SPARC) && !KMEMCHECK) Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: batch-free pcp list if possibleNamhyung Kim
free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion by keeping batch_free counter. But it doesn't need to spin if there is only one non-empty list. This can be checked by batch_free == MIGRATE_PCPTYPES. [akpm@linux-foundation.org: fix comment] Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: change __remove_from_page_cache()Minchan Kim
Now we renamed remove_from_page_cache with delete_from_page_cache. As consistency of __remove_from_swap_cache and remove_from_swap_cache, we change internal page cache handling function name, too. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: goodbye remove_from_page_cache()Minchan Kim
Now delete_from_page_cache() replaces remove_from_page_cache(). So we remove remove_from_page_cache so fs or something out of mainline will notice it when compile time and can fix it. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: truncate: change remove_from_page_cacheMinchan Kim
This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: shmem: change remove_from_page_cacheMinchan Kim
This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: hugetlbfs: change remove_from_page_cacheMinchan Kim
This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: William Irwin <wli@holomorphy.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: introduce delete_from_page_cache()Minchan Kim
Presently we increase the page refcount in add_to_page_cache() but don't decrease it in remove_from_page_cache(). Such asymmetry adds confusion, requiring that callers notice it and a comment explaining why they release a page reference. It's not a good API. A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but gave up because reiser4's drop_page() had to unlock the page between removing it from page cache and doing the page_cache_release(). But now the situation is changed. I think at least things in current mainline don't have any obstacles. The problem is for out-of-mainline filesystems - if they have done such things as reiser4, this patch could be a problem but they will discover this at compile time since we remove remove_from_page_cache(). This patch: This function works as just wrapper remove_from_page_cache(). The difference is that it decreases page references in itself. So caller have to make sure it has a page reference before calling. This patch is ready for removing remove_from_page_cache(). Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23mm: add replace_page_cache_page() functionMiklos Szeredi
This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>