summaryrefslogtreecommitdiff
path: root/mm/page_alloc.c
AgeCommit message (Collapse)Author
2012-12-12mm: introduce putback_movable_pages()Rafael Aquini
The PATCH "mm: introduce compaction and migration for virtio ballooned pages" hacks around putback_lru_pages() in order to allow ballooned pages to be re-inserted on balloon page list as if a ballooned page was like a LRU page. As ballooned pages are not legitimate LRU pages, this patch introduces putback_movable_pages() to properly cope with cases where the isolated pageset contains ballooned pages and LRU pages, thus fixing the mentioned inelegant hack around putback_lru_pages(). Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12memory-hotplug: allocate zone's pcp before onlining pagesWen Congyang
We use __free_page() to put a page to buddy system when onlining pages. __free_page() will store NR_FREE_PAGES in zone's pcp.vm_stat_diff, so we should allocate zone's pcp before onlining pages, otherwise we will lose some free pages. [mhocko@suse.cz: make zone_pcp_reset independent of MEMORY_HOTREMOVE] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12memory-hotplug: fix NR_FREE_PAGES mismatchWen Congyang
NR_FREE_PAGES will be wrong after offlining pages. We add/dec NR_FREE_PAGES like this now: 1. move all pages in buddy system to MIGRATE_ISOLATE, and dec NR_FREE_PAGES 2. don't add NR_FREE_PAGES when it is freed and the migratetype is MIGRATE_ISOLATE 3. dec NR_FREE_PAGES when offlining isolated pages. 4. add NR_FREE_PAGES when undoing isolate pages. When we come to step 3, all pages are in MIGRATE_ISOLATE list, and NR_FREE_PAGES are right. When we come to step4, all pages are not in buddy system, so we don't change NR_FREE_PAGES in this step, but we change NR_FREE_PAGES in step3. So NR_FREE_PAGES is wrong after offlining pages. So there is no need to change NR_FREE_PAGES in step3. This patch also fixs a problem in step2: if the migratetype is MIGRATE_ISOLATE, we should not add NR_FRR_PAGES when we remove pages from pcppages. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jianguo Wu <wujianguo106@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12memory-hotplug: skip HWPoisoned page when offlining pagesWen Congyang
hwpoisoned may be set when we offline a page by the sysfs interface /sys/devices/system/memory/soft_offline_page or /sys/devices/system/memory/hard_offline_page. If we don't clear this flag when onlining pages, this page can't be freed, and will not in free list. So we can't offline these pages again. So we should skip such page when offlining pages. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILDKirill A. Shutemov
We don't need custom NUMA_BUILD anymore, since we have handy IS_ENABLED(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12mm: show migration types in show_memRabin Vincent
This is useful to diagnose the reason for page allocation failure for cases where there appear to be several free pages. Example, with this alloc_pages(GFP_ATOMIC) failure: swapper/0: page allocation failure: order:0, mode:0x0 ... Mem-info: Normal per-cpu: CPU 0: hi: 90, btch: 15 usd: 48 CPU 1: hi: 90, btch: 15 usd: 21 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:84 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:4026 slab_reclaimable:75 slab_unreclaimable:484 mapped:0 shmem:0 pagetables:0 bounce:0 Normal free:16104kB min:2296kB low:2868kB high:3444kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:336kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:331776kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:300kB slab_unreclaimable:1936kB kernel_stack:328kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 Before the patch, it's hard (for me, at least) to say why all these free chunks weren't considered for allocation: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 3*4096kB = 16128kB After the patch, it's obvious that the reason is that all of these are in the MIGRATE_CMA (C) freelist: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 1*1024kB (C) 1*2048kB (C) 3*4096kB (C) = 16128kB Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: numa: Introduce last_nid to the page frameMel Gorman
This patch introduces a last_nid field to the page struct. This is used to build a two-stage filter in the next patch that is aimed at mitigating a problem whereby pages migrate to the wrong node when referenced by a process that was running off its home node. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Add a tracepoint for migrate_pagesMel Gorman
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-10Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damageLinus Torvalds
This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and d7c3b937bdf45f0b844400b7bf6fd3ed50bac604. This is a revert of a revert of a revert. In addition, it reverts the even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the original commits in linux-next. It turns out that the original patch really was bogus, and that the original revert was the correct thing to do after all. We thought we had fixed the problem, and then reverted the revert, but the problem really is fundamental: waking up kswapd simply isn't the right thing to do, and direct reclaim sometimes simply _is_ the right thing to do. When certain allocations fail, we simply should try some direct reclaim, and if that fails, fail the allocation. That's the right thing to do for THP allocations, which can easily fail, and the GPU allocations want to do that too. So starting kswapd is sometimes simply wrong, and removing the flag that said "don't start kswapd" was a mistake. Let's hope we never revisit this mistake again - and certainly not this many times ;) Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-10Revert "mm: avoid waking kswapd for THP allocations when compaction is ↵Linus Torvalds
deferred or contended" This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0. We are going to reinstate the __GFP_NO_KSWAPD flag that has been removed, the removal reverted, and then removed again. Making this commit a pointless fixup for a problem that was caused by the removal of __GFP_NO_KSWAPD flag. The thing is, we really don't want to wake up kswapd for THP allocations (because they fail quite commonly under any kind of memory pressure, including when there is tons of memory free), and these patches were just trying to fix up the underlying bug: the original removal of __GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD") was simply bogus. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06propagate name change to comments in kernel sourceNadia Yvette Chambers
I've legally changed my name with New York State, the US Social Security Administration, et al. This patch propagates the name change and change in initials and login to comments in the kernel source as well. Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-11-30mm: avoid waking kswapd for THP allocations when compaction is deferred or ↵Mel Gorman
contended With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" reverted, Zdenek Kabelac reported the following Hmm, so it's just took longer to hit the problem and observe kswapd0 spinning on my CPU again - it's not as endless like before - but still it easily eats minutes - it helps to turn off Firefox or TB (memory hungry apps) so kswapd0 stops soon - and restart those apps again. (And I still have like >1GB of cached memory) kswapd0 R running task 0 30 2 0x00000000 Call Trace: preempt_schedule+0x42/0x60 _raw_spin_unlock+0x55/0x60 put_super+0x31/0x40 drop_super+0x22/0x30 prune_super+0x149/0x1b0 shrink_slab+0xba/0x510 The sysrq+m indicates the system has no swap so it'll never reclaim anonymous pages as part of reclaim/compaction. That is one part of the problem but not the root cause as file-backed pages could also be reclaimed. The likely underlying problem is that kswapd is woken up or kept awake for each THP allocation request in the page allocator slow path. If compaction fails for the requesting process then compaction will be deferred for a time and direct reclaim is avoided. However, if there are a storm of THP requests that are simply rejected, it will still be the the case that kswapd is awake for a prolonged period of time as pgdat->kswapd_max_order is updated each time. This is noticed by the main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead it will loopp, shrinking a small number of pages and calling shrink_slab() on each iteration. This patch defers when kswapd gets woken up for THP allocations. For !THP allocations, kswapd is always woken up. For THP allocations, kswapd is woken up iff the process is willing to enter into direct reclaim/compaction. [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30revert "Revert "mm: remove __GFP_NO_KSWAPD""Andrew Morton
It apepars that this patch was innocent, and we hope that "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" will fix the final kswapd-spinning cause. Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30mm: compaction: fix return value of capture_free_page()Mel Gorman
Commit ef6c5be658f6 ("fix incorrect NR_FREE_PAGES accounting (appears like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the return value which was also missed by this reviewer until today. That return value is used by compaction when adding pages to a list of isolated free pages and without this follow-up fix, there is a risk of free list corruption. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-27Revert "mm: remove __GFP_NO_KSWAPD"Mel Gorman
With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" reverted, Zdenek Kabelac reported the following Hmm, so it's just took longer to hit the problem and observe kswapd0 spinning on my CPU again - it's not as endless like before - but still it easily eats minutes - it helps to turn off Firefox or TB (memory hungry apps) so kswapd0 stops soon - and restart those apps again. (And I still have like >1GB of cached memory) kswapd0 R running task 0 30 2 0x00000000 Call Trace: preempt_schedule+0x42/0x60 _raw_spin_unlock+0x55/0x60 put_super+0x31/0x40 drop_super+0x22/0x30 prune_super+0x149/0x1b0 shrink_slab+0xba/0x510 The sysrq+m indicates the system has no swap so it'll never reclaim anonymous pages as part of reclaim/compaction. That is one part of the problem but not the root cause as file-backed pages could also be reclaimed. The likely underlying problem is that kswapd is woken up or kept awake for each THP allocation request in the page allocator slow path. If compaction fails for the requesting process then compaction will be deferred for a time and direct reclaim is avoided. However, if there are a storm of THP requests that are simply rejected, it will still be the the case that kswapd is awake for a prolonged period of time as pgdat->kswapd_max_order is updated each time. This is noticed by the main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead it will loopp, shrinking a small number of pages and calling shrink_slab() on each iteration. The temptation is to supply a patch that checks if kswapd was woken for THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not backed up by proper testing. As 3.7 is very close to release and this is not a bug we should release with, a safer path is to revert "mm: remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the balance_pgdat() logic in general. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-21fix incorrect NR_FREE_PAGES accounting (appears like memory leak)Dave Hansen
There have been some 3.7-rc reports of vm issues, including some kswapd bugs and, more importantly, some memory "leaks": http://www.spinics.net/lists/linux-mm/msg46187.html https://bugzilla.kernel.org/show_bug.cgi?id=50181 Commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") took split_free_page() and reused it for the compaction code. It does something curious with capture_free_page() (previously known as split_free_page()): int capture_free_page(struct page *page, int alloc_order, ... __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); - /* Split into individual pages */ - set_page_refcounted(page); - split_page(page, order); + if (alloc_order != order) + expand(zone, page, alloc_order, order, + &zone->free_area[order], migratetype); Note that expand() puts the pages _back_ in the allocator, but it does not bump NR_FREE_PAGES. We "return" 'alloc_order' worth of pages, but we accounted for removing 'order' in the __mod_zone_page_state() call. For the old split_page()-style use (order==alloc_order) the bug will not trigger. But, when called from the compaction code where we occasionally get a larger page out of the buddy allocator than we need, we will run in to this. This patch simply changes the NR_FREE_PAGES manipulation to the correct 'alloc_order' instead of 'order'. I've been able to repeatedly trigger this in my testing environment. The amount "leaked" very closely tracks the imbalance I see in buddy pages vs. NR_FREE_PAGES. I have confirmed that this patch fixes the imbalance Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16revert "mm: fix-up zone present pages"Andrew Morton
Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16memcg: fix hotplugged memory zone oopsHugh Dickins
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distantDavid Rientjes
Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim distance") caused zone_reclaim_mode to be set for all systems where two nodes are within RECLAIM_DISTANCE of each other. This is the opposite of what we actually want: zone_reclaim_mode should be set if two nodes are sufficiently distant. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Julian Wollrath <jwollrath@web.de> Tested-by: Julian Wollrath <jwollrath@web.de> Cc: Hugh Dickins <hughd@google.com> Cc: Patrik Kullman <patrik.kullman@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25mm/page_alloc.c:alloc_contig_range(): return early for err pathBob Liu
If start_isolate_page_range() failed, unset_migratetype_isolate() has been done inside it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09cma: decrease cc.nr_migratepages after reclaiming pagelistMinchan Kim
reclaim_clean_pages_from_list() reclaims clean pages before migration so cc.nr_migratepages should be updated. Currently, there is no problem but it can be wrong if we try to use the value in future. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09CMA: migrate mlocked pagesMinchan Kim
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: fix zone stat mismatchMinchan Kim
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, numa: reclaim from all nodes within reclaim distanceDavid Rientjes
RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove free_page_mlockHugh Dickins
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be checking it, reporting "BUG: Bad page state" if it's ever found set. Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix-up zone present pagesJianguo Wu
I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reported-by: Petr Tesarik <ptesarik@suse.cz> Tested-by: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/page_alloc: refactor out __alloc_contig_migrate_alloc()Minchan Kim
__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor it out (move + rename as a common name) into page_isolation.c. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: clear PG_migrate_skip based on compaction and reclaim activityMel Gorman
Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: cache if a pageblock was scanned and no pages were isolatedMel Gorman
When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09revert "mm: have order > 0 compaction start off where it left"Mel Gorman
This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") and commit de74f1cc ("mm: have order > 0 compaction start near a pageblock with free pages"). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/memblock: cleanup early_node_map[] related commentsWanpeng Li
Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed early_node_map[]. Clean up the comments to comply with that change. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: fix pages missed by race rather than failingMinchan Kim
If race between allocation and isolation in memory-hotplug offline happens, some pages could be in MIGRATE_MOVABLE of free_list although the pageblock's migratetype of the page is MIGRATE_ISOLATE. The race could be detected by get_freepage_migratetype in __test_page_isolated_in_pageblock. If it is detected, now EBUSY gets bubbled all the way up and the hotplug operations fails. But better idea is instead of returning and failing memory-hotremove, move the free page to the correct list at the time it is detected. It could enhance memory-hotremove operation success ratio although the race is really rare. Suggested by Mel Gorman. [akpm@linux-foundation.org: small cleanup] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remain migratetype in freed pageMinchan Kim
The page allocator caches the pageblock information in page->private while it is in the PCP freelists but this is overwritten with the order of the page when freed to the buddy allocator. This patch stores the migratetype of the page in the page->index field so that it is available at all times when the page remain in free_list. This patch adds a new call site in __free_pages_ok so it might be overhead a bit but it's for high order allocation. So I believe damage isn't hurt. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: page_alloc: use get_freepage_migratetype() instead of page_private()Minchan Kim
The page allocator uses set_page_private and page_private for handling migratetype when it frees page. Let's replace them with [set|get] _freepage_migratetype to make it more clear. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09cma: fix watermark checkingBartlomiej Zolnierkiewicz
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok() (from Minchan Kim). * During watermark check decrease available free pages number by free CMA pages number if necessary (unmovable allocations cannot use pages from CMA areas). Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09cma: count free CMA pagesBartlomiej Zolnierkiewicz
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [akpm@linux-foundation.org: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09cma: fix counting of isolated pagesBartlomiej Zolnierkiewicz
Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix it by properly decreasing/increasing NR_FREE_PAGES counter in set_migratetype_isolate()/unset_migratetype_isolate() and removing counter adjustment for isolated pages from free_one_page() and split_free_page(). Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix tracing in free_pcppages_bulk()Bartlomiej Zolnierkiewicz
page->private gets re-used in __free_one_page() to store page order (so trace_mm_page_pcpu_drain() may print order instead of migratetype) thus migratetype value must be cached locally. Fixes regression introduced in commit a7016235a61d ("mm: fix migratetype bug which slowed swapping"). This caused incorrect data to be attached to the mm_page_pcpu_drain trace event. [akpm@linux-foundation.org: add comment] Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: cma: discard clean pages during contiguous allocation instead of migrationMinchan Kim
Drop clean cache pages instead of migration during alloc_contig_range() to minimise allocation latency by reducing the amount of migration that is necessary. It's useful for CMA because latency of migration is more important than evicting the background process's working set. In addition, as pages are reclaimed then fewer free pages for migration targets are required so it avoids memory reclaiming to get free pages, which is a contributory factor to increased latency. I measured elapsed time of __alloc_contig_migrate_range() which migrates 10M in 40M movable zone in QEMU machine. Before - 146ms, After - 7ms [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rik van Riel <riel@redhat.com> Tested-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: capture a suitable high-order page immediately when it is ↵Mel Gorman
made available While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove __GFP_NO_KSWAPDRik van Riel
When transparent huge pages were introduced, memory compaction and swap storms were an issue, and the kernel had to be careful to not make THP allocations cause pageout or compaction. Now that we have working compaction deferral, kswapd is smart enough to invoke compaction and the quadratic behaviour around isolate_free_pages has been fixed, it should be safe to remove __GFP_NO_KSWAPD. [minchan@kernel.org: Comment fix] [mgorman@suse.de: Avoid direct reclaim for deferred compaction] Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17mm/page_alloc: fix the page address of higher page's buddy calculationLi Haifeng
The heuristic method for buddy has been introduced since commit 43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists"). But the page address of higher page's buddy was wrongly calculated, which will lead page_is_buddy to fail for ever. IOW, the heuristic method would be disabled with the wrong page address of higher page's buddy. Calculating the page address of higher page's buddy should be based higher_page with the offset between index of higher page and index of higher page's buddy. Signed-off-by: Haifeng Li <omycle@gmail.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KyongHo Cho <pullip.cho@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: compaction: Abort async compaction if locks are contended or taking too longMel Gorman
Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0 207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call | | __recv | | | | | --100.00%-- (nil) | | | --12.61%-- memcpy --2.70%-- [...] There was other data but primarily it is all showing that compaction is contended heavily on the zone->lock and zone->lru_lock. commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled while isolating pages for migration] noted that it was possible for migration to hold the lru_lock for an excessive amount of time. Very broadly speaking this patch expands the concept. This patch introduces compact_checklock_irqsave() to check if a lock is contended or the process needs to be scheduled. If either condition is true then async compaction is aborted and the caller is informed. The page allocator will fail a THP allocation if compaction failed due to contention. This patch also introduces compact_trylock_irqsave() which will acquire the lock only if it is not contended and the process does not need to schedule. Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: correct page->pfmemalloc to fix deactivate_slab regressionAlex Shi
Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc setting, but it missed some places the pfmemalloc should be set. So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS cause incorrect deactivate_slab() on our core2 server: 64.73% fio [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |---0.34%-- deactivate_slab | __slab_alloc | kmem_cache_alloc | | That causes our fio sync write performance to have a 40% regression. Move the checking in get_page_from_freelist() which resolves this issue. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-02mm: remove node_start_pfn checking in new WARN_ON for nowLinus Torvalds
Borislav Petkov reports that the new warning added in commit 88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero") triggers for him, and it is the node_start_pfn field that has already been initialized once. The call trace looks like this: x86_64_start_kernel -> x86_64_start_reservations -> start_kernel -> setup_arch -> paging_init -> zone_sizes_init -> free_area_init_nodes -> free_area_init_node and (with the warning replaced by debug output), Borislav sees On node 0 totalpages: 4193848 DMA zone: 64 pages used for memmap DMA zone: 6 pages reserved DMA zone: 3890 pages, LIFO batch:0 DMA32 zone: 16320 pages used for memmap DMA32 zone: 798464 pages, LIFO batch:31 Normal zone: 52736 pages used for memmap Normal zone: 3322368 pages, LIFO batch:31 free_area_init_node: pgdat->node_start_pfn: 4423680 <---- On node 1 totalpages: 4194304 Normal zone: 65536 pages used for memmap Normal zone: 4128768 pages, LIFO batch:31 free_area_init_node: pgdat->node_start_pfn: 8617984 <---- On node 2 totalpages: 4194304 Normal zone: 65536 pages used for memmap Normal zone: 4128768 pages, LIFO batch:31 free_area_init_node: pgdat->node_start_pfn: 12812288 <---- On node 3 totalpages: 4194304 Normal zone: 65536 pages used for memmap Normal zone: 4128768 pages, LIFO batch:31 so remove the bogus warning for now to avoid annoying people. Minchan Kim is looking at it. Reported-by: Borislav Petkov <bp@amd64.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: remove redundant initializationMinchan Kim
pg_data_t is zeroed before reaching free_area_init_core(), so remove the now unnecessary initializations. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: warn if pg_data_t isn't initialized with zeroMinchan Kim
Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero when it is allocated. Arch code and memory hotplug already initiailize pg_data_t. So this warning should never happen. I select fields randomly near the beginning, middle and end of pg_data_t for checking. This patch isn't for performance but for removing initialization code which is necessary to add whenever we adds new field to pg_data_t or zone. Firstly, Andrew suggested clearing out of pg_data_t in MM core part but Tejun doesn't like it because in the future, some archs can initialize some fields in arch code and pass them into general MM part so blindly clearing it out in mm core part would be very annoying. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is ↵Mel Gorman
backed by network storage If swap is backed by network storage such as NBD, there is a risk that a large number of reclaimers can hang the system by consuming all PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune min_free_kbytes in advance which is a bit fragile. This patch throttles direct reclaimers if half the PF_MEMALLOC reserves are in use. If the system is routinely getting throttled the system administrator can increase min_free_kbytes so degradation is smoother but the system will keep running. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: ignore mempolicies when using ALLOC_NO_WATERMARKMel Gorman
The reserve is proportionally distributed over all !highmem zones in the system. So we need to allow an emergency allocation access to all zones. In order to do that we need to break out of any mempolicy boundaries we might have. In my opinion that does not break mempolicies as those are user oriented and not system oriented. That is, system allocations are not guaranteed to be within mempolicy boundaries. For instance IRQs do not even have a mempolicy. So breaking out of mempolicy boundaries for 'rare' emergency allocations, which are always system allocations (as opposed to user) is ok. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>