summaryrefslogtreecommitdiff
path: root/mm/memcontrol.c
AgeCommit message (Collapse)Author
2012-03-22memcg: fix performance of mem_cgroup_begin_update_page_stat()KAMEZAWA Hiroyuki
mem_cgroup_begin_update_page_stat() should be very fast because it's called very frequently. Now, it needs to look up page_cgroup and its memcg....this is slow. This patch adds a global variable to check "any memcg is moving or not". With this, the caller doesn't need to visit page_cgroup and memcg. Here is a test result. A test program makes page faults onto a file, MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the range by madvise() and page fault again. This program causes 26214400 times of page fault onto a file(size was 1G.) and shows shows the cost of mem_cgroup_begin_update_page_stat(). Before this patch for mem_cgroup_begin_update_page_stat() [kamezawa@bluextal test]$ time ./mmap 1G real 0m21.765s user 0m5.999s sys 0m15.434s 27.46% mmap mmap [.] reader 21.15% mmap [kernel.kallsyms] [k] page_fault 9.17% mmap [kernel.kallsyms] [k] filemap_fault 2.96% mmap [kernel.kallsyms] [k] __do_fault 2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat After this patch [root@bluextal test]# time ./mmap 1G real 0m21.373s user 0m6.113s sys 0m15.016s In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away. Note: we may be able to remove this optimization in future if we can get pointer to memcg directly from struct page. [akpm@linux-foundation.org: don't return a void] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove PCG_FILE_MAPPEDKAMEZAWA Hiroyuki
With the new lock scheme for updating memcg's page stat, we don't need a flag PCG_FILE_MAPPED which was duplicated information of page_mapped(). [hughd@google.com: cosmetic fix] [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: use new logic for page stat accountingKAMEZAWA Hiroyuki
Now, page-stat-per-memcg is recorded into per page_cgroup flag by duplicating page's status into the flag. The reason is that memcg has a feature to move a page from a group to another group and we have race between "move" and "page stat accounting", Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B does "page stat accounting". When CPU-A goes 1st, CPU-A CPU-B update "struct page" info. move_lock_mem_cgroup(memcg) see pc->flags copy page stat to new group overwrite pc->mem_cgroup. move_unlock_mem_cgroup(memcg) move_lock_mem_cgroup(mem) set pc->flags update page stat accounting move_unlock_mem_cgroup(mem) stat accounting is guarded by move_lock_mem_cgroup() and "move" logic (CPU-A) doesn't see changes in "struct page" information. But it's costly to have the same information both in 'struct page' and 'struct page_cgroup'. And, there is a potential problem. For example, assume we have PG_dirty accounting in memcg. PG_..is a flag for struct page. PCG_ is a flag for struct page_cgroup. (This is just an example. The same problem can be found in any kind of page stat accounting.) CPU-A CPU-B TestSet PG_dirty (delay) TestClear PG_dirty if (TestClear(PCG_dirty)) memcg->nr_dirty-- if (TestSet(PCG_dirty)) memcg->nr_dirty++ Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg Thelen <gthelen@google.com>. Now, only FILE_MAPPED is supported but fortunately, it's serialized by page table lock and this is not real bug, _now_, If this potential problem is caused by having duplicated information in struct page and struct page_cgroup, we may be able to fix this by using original 'struct page' information. But we'll have a problem in "move account" Assume we use only PG_dirty. CPU-A CPU-B TestSet PG_dirty (delay) move_lock_mem_cgroup() if (PageDirty(page)) new_memcg->nr_dirty++ pc->mem_cgroup = new_memcg; move_unlock_mem_cgroup() move_lock_mem_cgroup() memcg = pc->mem_cgroup new_memcg->nr_dirty++ accounting information may be double-counted. This was original reason to have PCG_xxx flags but it seems PCG_xxx has another problem. I think we need a bigger lock as move_lock_mem_cgroup(page) TestSetPageDirty(page) update page stats (without any checks) move_unlock_mem_cgroup(page) This fixes both of problems and we don't have to duplicate page flag into page_cgroup. Please note: move_lock_mem_cgroup() is held only when there are possibility of "account move" under the system. So, in most path, status update will go without atomic locks. This patch introduces mem_cgroup_begin_update_page_stat() and mem_cgroup_end_update_page_stat() both should be called at modifying 'struct page' information if memcg takes care of it. as mem_cgroup_begin_update_page_stat() modify page information mem_cgroup_update_page_stat() => never check any 'struct page' info, just update counters. mem_cgroup_end_update_page_stat(). This patch is slow because we need to call begin_update_page_stat()/ end_update_page_stat() regardless of accounted will be changed or not. A following patch adds an easy optimization and reduces the cost. [akpm@linux-foundation.org: s/lock/locked/] [hughd@google.com: fix deadlock by avoiding stat lock when anon] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove PCG_MOVE_LOCK flag from page_cgroupKAMEZAWA Hiroyuki
PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting pc->mem_cgroup and page statistics accounting per memcg. This lock helps to avoid the race but the race is very rare because moving tasks between cgroup is not a usual job. So, it seems using 1bit per page is too costly. This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK. If smaller lock is required, we'll be able to add some hashes but I'd like to start from this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: simplify move_account() checkKAMEZAWA Hiroyuki
In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a logic, flag + rcu_read_lock(), is used. This works as following CPU-A CPU-B rcu_read_lock() set flag if(flag is set) take heavy lock do job. synchronize_rcu() rcu_read_unlock() take heavy lock. In recent discussion, it's argued that using per-cpu value for this flag just complicates the code because 'set flag' is very rare. This patch changes 'flag' implementation from percpu to atomic_t. This will be much simpler. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)KAMEZAWA Hiroyuki
As described in the log, I guess EXPORT was for preparing dirty accounting. But _now_, we don't need to export this. Remove this for now. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove PCG_CACHE page_cgroup flagKAMEZAWA Hiroyuki
We record 'the page is cache' with the PCG_CACHE bit in page_cgroup. Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't include shmem. Considering callers, at charge/uncharge, the caller should know what the page is and we don't need to record it by using one bit per page. This patch removes PCG_CACHE bit and make callers of mem_cgroup_charge_statistics() to specify what the page is. About page migration: Mapping of the used page is not touched during migra tion (see page_remove_rmap) so we can rely on it and push the correct charge type down to __mem_cgroup_uncharge_common from end_migration for unused page. The force flag was misleading was abused for skipping the needless page_mapped() / PageCgroupMigration() check, as we know the unused page is no longer mapped and cleared the migration flag just a few lines up. But doing the checks is no biggie and it's not worth adding another flag just to skip them. [akpm@linux-foundation.org: checkpatch fixes] [hughd@google.com: fix PageAnon uncharging] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove unnecessary thp check in page stat accountingKAMEZAWA Hiroyuki
Commit e94c8a9cbce1 ("memcg: make mem_cgroup_split_huge_fixup() more efficient") removed move_lock_page_cgroup(). So we do not have to check PageTransHuge in mem_cgroup_update_page_stat() and fallback into the locked accounting because both move_account() and thp split are done with compound_lock so they cannot race. The race between update vs. move is protected by mem_cgroup_stealed. PageTransHuge pages shouldn't appear in this code path currently because we are tracking only file pages at the moment but later we are planning to track also other pages (e.g. mlocked ones). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Acked-by: Ying Han<yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: remove redundant returnsHugh Dickins
Remove redundant returns from ends of functions, and one blank line. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: enum lru_list lruHugh Dickins
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: lru_size instead of MEM_CGROUP_ZSTATHugh Dickins
I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to obscure the LRU counts. For easier searching? So call it lru_size rather than bare count (lru_length sounds better, but would be wrong, since each huge page raises lru_size hugely). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22memcg: replace mem and mem_cont stragglersHugh Dickins
Replace mem and mem_cont stragglers in memcontrol.c by memcg. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22mm, memcg: pass charge order to oom killerDavid Rientjes
The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read modeAndrea Arcangeli
In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()
2012-03-16memcg: free mem_cgroup by RCU to fix oopsHugh Dickins
After fixing the GPF in mem_cgroup_lru_del_list(), three times one machine running a similar load (moving and removing memcgs while swapping) has oopsed in mem_cgroup_zone_nr_lru_pages(), when retrieving memcg zone numbers for get_scan_count() for shrink_mem_cgroup_zone(): this is where a struct mem_cgroup is first accessed after being chosen by mem_cgroup_iter(). Just what protects a struct mem_cgroup from being freed, in between mem_cgroup_iter()'s css_get_next() and its css_tryget()? css_tryget() fails once css->refcnt is zero with CSS_REMOVED set in flags, yes: but what if that memory is freed and reused for something else, which sets "refcnt" non-zero? Hmm, and scope for an indefinite freeze if refcnt is left at zero but flags are cleared. It's tempting to move the css_tryget() into css_get_next(), to make it really "get" the css, but I don't think that actually solves anything: the same difficulty in moving from css_id found to stable css remains. But we already have rcu_read_lock() around the two, so it's easily fixed if __mem_cgroup_free() just uses kfree_rcu() to free mem_cgroup. However, a big struct mem_cgroup is allocated with vzalloc() instead of kzalloc(), and we're not allowed to vfree() at interrupt time: there doesn't appear to be a general vfree_rcu() to help with this, so roll our own using schedule_work(). The compiler decently removes vfree_work() and vfree_rcu() when the config doesn't need them. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-09memcg: revert fix to mapcount check for this releaseHugh Dickins
Respectfully revert commit e6ca7b89dc76 "memcg: fix mapcount check in move charge code for anonymous page" for the 3.3 release, so that it behaves exactly like releases 2.6.35 through 3.2 in this respect. Horiguchi-san's commit is correct in itself, 1 makes much more sense than 2 in that check; but it does not go far enough - swapcount should be considered too - if we really want such a check at all. We appear to have reached agreement now, and expect that 3.4 will remove the mapcount check, but had better not make 3.3 different. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05memcg: fix mapcount check in move charge code for anonymous pageNaoya Horiguchi
Currently the charge on shared anonyous pages is supposed not to moved in task migration. To implement this, we need to check that mapcount > 1, instread of > 2. So this patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05memcg: fix GPF when cgroup removal races with last exitHugh Dickins
When moving tasks from old memcg (with move_charge_at_immigrate on new memcg), followed by removal of old memcg, hit General Protection Fault in mem_cgroup_lru_del_list() (called from release_pages called from free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from exit_mmap from mmput from exit_mm from do_exit). Somewhat reproducible, takes a few hours: the old struct mem_cgroup has been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is still trying to update its stats, and take page off lru before freeing. A task, or a charge, or a page on lru: each secures a memcg against removal. In this case, the last task has been moved out of the old memcg, and it is exiting: anonymous pages are uncharged one by one from the memcg, as they are zapped from its pagetables, so the charge gets down to 0; but the pages themselves are queued in an mmu_gather for freeing. Most of those pages will be on lru (and force_empty is careful to lru_add_drain_all, to add pages from pagevec to lru first), but not necessarily all: perhaps some have been isolated for page reclaim, perhaps some isolated for other reasons. So, force_empty may find no task, no charge and no page on lru, and let the removal proceed. There would still be no problem if these pages were immediately freed; but typically (and the put_page_testzero protocol demands it) they have to be added back to lru before they are found freeable, then removed from lru and freed. We don't see the issue when adding, because the mem_cgroup_iter() loops keep their own reference to the memcg being scanned; but when it comes to mem_cgroup_lru_del_list(). I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and PageCgroupUsed flags were used (like a trick with mirrors) to deflect view of pc->mem_cgroup to the stable root_mem_cgroup when neither set. 38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully removed those convolutions, but left this General Protection Fault. But it's surprisingly easy to restore the old behaviour: just check PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec to add), and reset pc to root_mem_cgroup if page is uncharged. A risky change? just going back to how it worked before; testing, and an audit of uses of pc->mem_cgroup, show no problem. And there's a nice bonus: with mem_cgroup_lru_add_list() itself making sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg: clear pc->mem_cgroup if necessary"). Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c is not strictly necessary: the lru_lock there, with RCU before memcg structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe without that; but it seems cleaner to rely on one dependency less. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05memcg: fix deadlock by inverting lrucare nestingHugh Dickins
We have forgotten the rules of lock nesting: the irq-safe ones must be taken inside the non-irq-safe ones, otherwise we are open to deadlock: CPU0 CPU1 ---- ---- lock(&(&pc->lock)->rlock); local_irq_disable(); lock(&(&zone->lru_lock)->rlock); lock(&(&pc->lock)->rlock); <Interrupt> lock(&(&zone->lru_lock)->rlock); To check a different locking issue, I happened to add a spin_lock to memcg's bit_spin_lock in lock_page_cgroup(), and lockdep very quickly complained about __mem_cgroup_commit_charge_lrucare() (on CPU1 above). So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare to __mem_cgroup_commit_charge() instead, taking zone->lru_lock under lock_page_cgroup() in the lrucare case. The original was using spin_lock_irqsave, but we'd be in more trouble if it were ever called at interrupt time: unconditional _irq is enough. And ClearPageLRU before del from lru, SetPageLRU before add to lru: no strong reason, but that is the ordering used consistently elsewhere. Fixes 36b62ad539498d00c2d280a151a ("memcg: simplify corner case handling of LRU"). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24mm: memcg: Correct unregistring of events attached to the same eventfdAnton Vorontsov
There is an issue when memcg unregisters events that were attached to the same eventfd: - On the first call mem_cgroup_usage_unregister_event() removes all events attached to a given eventfd, and if there were no events left, thresholds->primary would become NULL; - Since there were several events registered, cgroups core will call mem_cgroup_usage_unregister_event() again, but now kernel will oops, as the function doesn't expect that threshold->primary may be NULL. That's a good question whether mem_cgroup_usage_unregister_event() should actually remove all events in one go, but nowadays it can't do any better as cftype->unregister_event callback doesn't pass any private event-associated cookie. So, let's fix the issue by simply checking for threshold->primary. FWIW, w/o the patch the following oops may be observed: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0 Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs RIP: 0010:[<ffffffff810be32c>] [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0 RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246 Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0) Call Trace: [<ffffffff8107092b>] cgroup_event_remove+0x2b/0x60 [<ffffffff8103db94>] process_one_work+0x174/0x450 [<ffffffff8103e413>] worker_thread+0x123/0x2d0 Cc: stable <stable@vger.kernel.org> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-04mm/memcontrol.c: fix warning with CONFIG_NUMA=nAndrew Morton
mm/memcontrol.c: In function 'memcg_check_events': mm/memcontrol.c:779: warning: unused variable 'do_numainfo' Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-02cgroup: remove cgroup_subsys argument from callbacksLi Zefan
The argument is not used at all, and it's not necessary, because a specific callback handler of course knows which subsys it belongs to. Now only ->pupulate() takes this argument, because the handlers of this callback always call cgroup_add_file()/cgroup_add_files(). So we reduce a few lines of code, though the shrinking of object size is minimal. 16 files changed, 113 insertions(+), 162 deletions(-) text data bss dec hex filename 5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig 5486170 656987 7039960 13183117 c9288d vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-01-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Davem says: 1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet. 2) tg3 header length computation correction from Eric Dumazet. 3) More build and reference counting fixes for socket memory cgroup code from Glauber Costa. 4) module.h snuck back into a core header after all the hard work we did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer. 5) Fix PHY naming regression and add some new PCI IDs in stmmac, from Alessandro Rubini. 6) Netlink message generation fix in new team driver, should only advertise the entries that changed during events, from Jiri Pirko. 7) SRIOV VF registration and unregistration fixes, and also add a missing PCI ID, from Roopa Prabhu. 8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka. 9) ftgmac100/ftmac100 build fix, missing interrupt.h include. 10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun. 11) Off by one fix in netem packet scheduler, from Vijay Subramanian. 12) TCP loss detection fix from Yuchung Cheng. 13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu. 14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger. 15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC congestion control modules, fix from Neal Cardwell. 16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław. 17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri. 18) Statistics bug fixes in mlx4 from Eugenia Emantayev. 19) rds locking bug fix during info dumps, from your's truly. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits) rds: Make rds_sock_lock BH rather than IRQ safe. netprio_cgroup.h: dont include module.h from other includes net: flow_dissector.c missing include linux/export.h team: send only changed options/ports via netlink net/hyperv: fix possible memory leak in do_set_multicast() drivers/net: dsa/mv88e6xxx.c files need linux/module.h stmmac: added PCI identifiers llc: Fix race condition in llc_ui_recvmsg stmmac: fix phy naming inconsistency dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches. tg3: fix ipv6 header length computation skge: add byte queue limit support mv643xx_eth: Add Rx Discard and Rx Overrun statistics bnx2x: fix compilation error with SOE in fw_dump bnx2x: handle CHIP_REVISION during init_one bnx2x: allow user to change ring size in ISCSI SD mode bnx2x: fix Big-Endianess in ethtool -t bnx2x: fixed ethtool statistics for MF modes bnx2x: credit-leakage fixup on vlan_mac_del_all macvlan: fix a possible use after free ...
2012-01-23mm: memcg: update the correct soft limit tree during migrationJohannes Weiner
end_migration() passes the old page instead of the new page to commit the charge. This page descriptor is not used for committing itself, though, since we also pass the (correct) page_cgroup descriptor. But it's used to find the soft limit tree through the page's zone, so the soft limit tree of the old page's zone is updated instead of that of the new page's, which might get slightly out of date until the next charge reaches the ratelimit point. This glitch has been present since 5564e88 ("memcg: condense page_cgroup-to-page lookup points"). This fixes a bug that I introduced in 2.6.38. It's benign enough (to my knowledge) that we probably don't want this for stable. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-22net: fix socket memcg build with !CONFIG_NETGlauber Costa
There is still a build bug with the sock memcg code, that triggers with !CONFIG_NET, that survived my series of randconfig builds. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17net: move sock_update_memcg outside of CONFIG_INETGlauber Costa
Although only used currently for tcp sockets, this function is now used in common sock code (for sock_clone()) Commit 475f1b52645a29936b9df1d8fcd45f7e56bd4a9f moved the declaration of sock_update_clone() to inside sock.c, but this only fixes the problem when CONFIG_CGROUP_MEM_RES_CTLR_KMEM is also not defined. This patch here is verified to fix both problems, although reverting the previous one is not necessary. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: David S. Miller <davem@davemloft.net> CC: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-13memcg: fix mem_cgroup_print_bad_pageHugh Dickins
If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page() shows a "Bad page state" message, removes page from circulation, adds a taint and continues. This is at a very low level, often when a spinlock is held (sometimes when page table lock is held, for example). We want to recover from this badness, not make it worse: we must not kmalloc memory here, we must not do a cgroup path lookup via dubious pointers. No doubt that code was useful to debug a particular case at one time, and may be again, but take it out of the mainline kernel. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: fix split_huge_page_refcounts()Hugh Dickins
This patch started off as a cleanup: __split_huge_page_refcounts() has to cope with two scenarios, when the hugepage being split is already on LRU, and when it is not; but why does it have to split that accounting across three different sites? Consolidate it in lru_add_page_tail(), handling evictable and unevictable alike, and use standard add_page_to_lru_list() when accounting is needed (when the head is not yet on LRU). But a recent regression in -next, I guess the removal of PageCgroupAcctLRU test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix: under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number, messing up reclaim calculations and causing a freeze at rmdir of cgroup. Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that count - this has not been the only such incident. Document that lru_add_page_tail() is for Transparent HugePages by #ifdef around it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: cleanup for_each_node_state()Bob Liu
We already have for_each_node(node) define in nodemask.h, better to use it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: simplify LRU handling by new ruleKAMEZAWA Hiroyuki
Now, at LRU handling, memory cgroup needs to do complicated works to see valid pc->mem_cgroup, which may be overwritten. This patch is for relaxing the protocol. This patch guarantees - when pc->mem_cgroup is overwritten, page must not be on LRU. By this, LRU routine can believe pc->mem_cgroup and don't need to check bits on pc->flags. This new rule may adds small overheads to swapin. But in most case, lru handling gets faster. After this patch, PCG_ACCT_LRU bit is obsolete and removed. [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree] [akpm@linux-foundation.org: clean up code comment] [hughd@google.com: fix NULL mem_cgroup_try_charge] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: clear pc->mem_cgroup if necessary.KAMEZAWA Hiroyuki
This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup and reducing atomic ops/complexity in memcg LRU handling. In some cases, pages are added to lru before charge to memcg and pages are not classfied to memory cgroup at lru addtion. Now, the lru where the page should be added is determined a bit in page_cgroup->flags and pc->mem_cgroup. I'd like to remove the check of flag. To handle the case pc->mem_cgroup may contain stale pointers if pages are added to LRU before classification. This patch resets pc->mem_cgroup to root_mem_cgroup before lru additions. [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build] [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build] [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal] [hughd@google.com: stop oops in mem_cgroup_reset_owner()] [hughd@google.com: fix page migration to reset_owner] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: simplify corner case handling of LRU.KAMEZAWA Hiroyuki
This patch simplifies LRU handling of racy case (memcg+SwapCache). At charging, SwapCache tend to be on LRU already. So, before overwriting pc->mem_cgroup, the page must be removed from LRU and added to LRU later. This patch does spin_lock(zone->lru_lock); if (PageLRU(page)) remove from LRU overwrite pc->mem_cgroup if (PageLRU(page)) add to new LRU. spin_unlock(zone->lru_lock); And guarantee all pages are not on LRU at modifying pc->mem_cgroup. This patch also unfies lru handling of replace_page_cache() and swapin. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: simplify page cache chargingKAMEZAWA Hiroyuki
This patch is a clean up. No functional/logical changes. Because of commit ef6a3c6311 ("mm: add replace_page_cache_page() function") , FUSE uses replace_page_cache() instead of add_to_page_cache(). Then, mem_cgroup_cache_charge() is not called against FUSE's pages from splice. So now, mem_cgroup_cache_charge() gets pages that are not on the LRU with the exception of PageSwapCache pages. For checking, WARN_ON_ONCE(PageLRU(page)) is added. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13oom, memcg: fix exclusion of memcg threads after they have detached their mmDavid Rientjes
The oom killer relies on logic that identifies threads that have already been oom killed when scanning the tasklist and, if found, deferring until such threads have exited. This is done by checking for any candidate threads that have the TIF_MEMDIE bit set. For memcg ooms, candidate threads are first found by calling task_in_mem_cgroup() since the oom killer should not defer if there's an oom killed thread in another memcg. Unfortunately, task_in_mem_cgroup() excludes threads if they have detached their mm in the process of exiting so TIF_MEMDIE is never detected for such conditions. This is different for global, mempolicy, and cpuset oom conditions where a detached mm is only excluded after checking for TIF_MEMDIE and deferring, if necessary, in select_bad_process(). The fix is to return true if a task has a detached mm but is still in the memcg or its hierarchy that is currently oom. This will allow the oom killer to appropriately defer rather than kill unnecessarily or, in the worst case, panic the machine if nothing else is available to kill. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: free entries in soft_limit_tree if allocation failsMichal Hocko
If we are not able to allocate tree nodes for all NUMA nodes then we should release those that were allocated. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13page_cgroup: add helper function to get swap_cgroupBob Liu
There are multiple places which need to get the swap_cgroup address, so add a helper function: static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent, struct swap_cgroup_ctrl **ctrl); to simplify the code. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: remove unneeded checks from uncharge_page()Johannes Weiner
mem_cgroup_uncharge_page() is only called on either freshly allocated pages without page->mapping or on rmapped PageAnon() pages. There is no need to check for a page->mapping that is not an anon_vma. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: remove unneeded checks from newpage_charge()Johannes Weiner
All callsites pass in freshly allocated pages and a valid mm. As a result, all checks pertaining to the page's mapcount, page->mapping or the fallback to init_mm are unneeded. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: lookup_page_cgroup (almost) never returns NULLJohannes Weiner
Pages have their corresponding page_cgroup descriptors set up before they are used in userspace, and thus managed by a memory cgroup. The only time where lookup_page_cgroup() can return NULL is in the CONFIG_DEBUG_VM-only page sanity checking code that executes while feeding pages into the page allocator for the first time. Remove the NULL checks against lookup_page_cgroup() results from all callsites where we know that corresponding page_cgroup descriptors must be allocated, and add a comment to the callsite that actually does have to check the return value. [hughd@google.com: stop oops in mem_cgroup_update_page_stat()] Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: clean up fault accountingJohannes Weiner
The fault accounting functions have a single, memcg-internal user, so they don't need to be global. In fact, their one-line bodies can be directly folded into the caller. And since faults happen one at a time, use this_cpu_inc() directly instead of this_cpu_add(foo, 1). Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: unify remaining mem_cont, mem, etc. variable names to memcgJohannes Weiner
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: shorten preempt-disabled section around event checksJohannes Weiner
Only the ratelimit checks themselves have to run with preemption disabled, the resulting actions - checking for usage thresholds, updating the soft limit tree - can and should run with preemption enabled. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reported-by: Yong Zhang <yong.zhang0@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Reported-by: Luis Henriques <henrix@camandro.org> Tested-by: Luis Henriques <henrix@camandro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13memcg: make mem_cgroup_split_huge_fixup() more efficientKAMEZAWA Hiroyuki
In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times. But thinking again, - compound_lock() is held at move_accout...then, it's not necessary to take move_lock_page_cgroup(). - LRU is locked and all tail pages will go into the same LRU as head is now on. - page_cgroup is contiguous in huge page range. This patch fixes mem_cgroup_split_huge_fixup() as to be called once per hugepage and reduce costs for spliting. [akpm@linux-foundation.org: fix typo, per Michal] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: make per-memcg LRU lists exclusiveJohannes Weiner
Now that all code that operated on global per-zone LRU lists is converted to operate on per-memory cgroup LRU lists instead, there is no reason to keep the double-LRU scheme around any longer. The pc->lru member is removed and page->lru is linked directly to the per-memory cgroup LRU lists, which removes two pointers from a descriptor that exists for every page frame in the system. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: collect LRU list heads into struct lruvecJohannes Weiner
Having a unified structure with a LRU list set for both global zones and per-memcg zones allows to keep that code simple which deals with LRU lists and does not care about the container itself. Once the per-memcg LRU lists directly link struct pages, the isolation function and all other list manipulations are shared between the memcg case and the global LRU case. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists emptyJohannes Weiner
root_mem_cgroup, lacking a configurable limit, was never subject to limit reclaim, so the pages charged to it could be kept off its LRU lists. They would be found on the global per-zone LRU lists upon physical memory pressure and it made sense to avoid uselessly linking them to both lists. The global per-zone LRU lists are about to go away on memcg-enabled kernels, with all pages being exclusively linked to their respective per-memcg LRU lists. As a result, pages of the root_mem_cgroup must also be linked to its LRU lists again. This is purely about the LRU list, root_mem_cgroup is still not charged. The overhead is temporary until the double-LRU scheme is going away completely. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: move memcg hierarchy reclaim to generic reclaim codeJohannes Weiner
Memory cgroup limit reclaim and traditional global pressure reclaim will soon share the same code to reclaim from a hierarchical tree of memory cgroups. In preparation of this, move the two right next to each other in shrink_zone(). The mem_cgroup_hierarchical_reclaim() polymath is split into a soft limit reclaim function, which still does hierarchy walking on its own, and a limit (shrinking) reclaim function, which relies on generic reclaim code to walk the hierarchy. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: per-priority per-zone hierarchy scan generationsJohannes Weiner
Memory cgroup limit reclaim currently picks one memory cgroup out of the target hierarchy, remembers it as the last scanned child, and reclaims all zones in it with decreasing priority levels. The new hierarchy reclaim code will pick memory cgroups from the same hierarchy concurrently from different zones and priority levels, it becomes necessary that hierarchy roots not only remember the last scanned child, but do so for each zone and priority level. Until now, we reclaimed memcgs like this: mem = mem_cgroup_iter(root) for each priority level: for each zone in zonelist: reclaim(mem, zone) But subsequent patches will move the memcg iteration inside the loop over the zones: for each priority level: for each zone in zonelist: mem = mem_cgroup_iter(root) reclaim(mem, zone) And to keep with the original scan order - memcg -> priority -> zone - the last scanned memcg has to be remembered per zone and per priority level. Furthermore, global reclaim will be switched to the hierarchy walk as well. Different from limit reclaim, which can just recheck the limit after some reclaim progress, its target is to scan all memcgs for the desired zone pages, proportional to the memcg size, and so reliably detecting a full hierarchy round-trip will become crucial. Currently, the code relies on one reclaimer encountering the same memcg twice, but that is error-prone with concurrent reclaimers. Instead, use a generation counter that is increased every time the child with the highest ID has been visited, so that reclaimers can stop when the generation changes. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13mm: memcg: consolidate hierarchy iteration primitivesJohannes Weiner
The memcg naturalization series: Memory control groups are currently bolted onto the side of traditional memory management in places where better integration would be preferrable. To reclaim memory, for example, memory control groups maintain their own LRU list and reclaim strategy aside from the global per-zone LRU list reclaim. But an extra list head for each existing page frame is expensive and maintaining it requires additional code. This patchset disables the global per-zone LRU lists on memory cgroup configurations and converts all its users to operate on the per-memory cgroup lists instead. As LRU pages are then exclusively on one list, this saves two list pointers for each page frame in the system: page_cgroup array size with 4G physical memory vanilla: allocated 31457280 bytes of page_cgroup patched: allocated 15728640 bytes of page_cgroup At the same time, system performance for various workloads is unaffected: 100G sparse file cat, 4G physical memory, 10 runs, to test for code bloat in the traditional LRU handling and kswapd & direct reclaim paths, without/with the memory controller configured in vanilla: 71.603(0.207) seconds patched: 71.640(0.156) seconds vanilla: 79.558(0.288) seconds patched: 77.233(0.147) seconds 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code bloat in the traditional memory cgroup LRU handling and reclaim path vanilla: 96.844(0.281) seconds patched: 94.454(0.311) seconds 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M swap on SSD, 10 runs, to test for regressions in kswapd & direct reclaim using per-memcg LRU lists with multiple memcgs and multiple allocators within each memcg vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ] patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ] 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M swap on SSD, 10 runs, to test for regressions in hierarchical memcg setups vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ] patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ] This patch: There are currently two different implementations of iterating over a memory cgroup hierarchy tree. Consolidate them into one worker function and base the convenience looping-macros on top of it. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>