summaryrefslogtreecommitdiff
path: root/mm/ksm.c
AgeCommit message (Collapse)Author
2016-05-12ksm: fix conflict between mmput and scan_get_next_rmap_itemZhou Chengming
A concurrency issue about KSM in the function scan_get_next_rmap_item. task A (ksmd): |task B (the mm's task): | mm = slot->mm; | down_read(&mm->mmap_sem); | | ... | | spin_lock(&ksm_mmlist_lock); | | ksm_scan.mm_slot go to the next slot; | | spin_unlock(&ksm_mmlist_lock); | |mmput() -> | ksm_exit(): | |spin_lock(&ksm_mmlist_lock); |if (mm_slot && ksm_scan.mm_slot != mm_slot) { | if (!mm_slot->rmap_list) { | easy_to_free = 1; | ... | |if (easy_to_free) { | mmdrop(mm); | ... | |So this mm_struct may be freed in the mmput(). | up_read(&mm->mmap_sem); | As we can see above, the ksmd thread may access a mm_struct that already been freed to the kmem_cache. Suppose a fork will get this mm_struct from the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will cause mmap_sem.count to become -1. As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has the same SMP race condition, so fix it too. My prev fix in function scan_get_next_rmap_item will introduce a different SMP race condition, so just invert the up_read/spin_unlock order as Andrea Arcangeli said. Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Geliang Tang <geliangtang@163.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18mm/core: Do not enforce PKEY permissions on remote mm accessDave Hansen
We try to enforce protection keys in software the same way that we do in hardware. (See long example below). But, we only want to do this when accessing our *own* process's memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then tried to PTRACE_POKE a target process which just happened to have some mprotect_pkey(pkey=6) memory, we do *not* want to deny the debugger access to that memory. PKRU is fundamentally a thread-local structure and we do not want to enforce it on access to _another_ thread's data. This gets especially tricky when we have workqueues or other delayed-work mechanisms that might run in a random process's context. We can check that we only enforce pkeys when operating on our *own* mm, but delayed work gets performed when a random user context is active. We might end up with a situation where a delayed-work gup fails when running randomly under its "own" task but succeeds when running under another process. We want to avoid that. To avoid that, we use the new GUP flag: FOLL_REMOTE and add a fault flag: FAULT_FLAG_REMOTE. They indicate that we are walking an mm which is not guranteed to be the same as current->mm and should not be subject to protection key enforcement. Thanks to Jerome Glisse for pointing out this scenario. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Dominik Vogt <vogt@linux.vnet.ibm.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geliang Tang <geliangtang@163.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Low <jason.low2@hp.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: iommu@lists.linux-foundation.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16mm/gup: Switch all callers of get_user_pages() to not pass tsk/mmDave Hansen
We will soon modify the vanilla get_user_pages() so it can no longer be used on mm/tasks other than 'current/current->mm', which is by far the most common way it is called. For now, we allow the old-style calls, but warn when they are used. (implemented in previous patch) This patch switches all callers of: get_user_pages() get_user_pages_unlocked() get_user_pages_locked() to stop passing tsk/mm so they will no longer see the warnings. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-16mm/ksm.c: mark stable page dirtyMinchan Kim
The MADV_FREE patchset changes page reclaim to simply free a clean anonymous page with no dirty ptes, instead of swapping it out; but KSM uses clean write-protected ptes to reference the stable ksm page. So be sure to mark that page dirty, so it's never mistakenly discarded. [hughd@google.com: adjusted comments] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16ksm: prepare to new THP semanticsKirill A. Shutemov
We don't need special code to stabilize THP. If you've got reference to any subpage of THP it will not be split under you. New split_huge_page() also accepts tail pages: no need in special code to get reference to head page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16rmap: add argument to charge compound pageKirill A. Shutemov
We're going to allow mapping of individual 4k pages of THP compound page. It means we cannot rely on PageTransHuge() check to decide if map/unmap small page or THP. The patch adds new argument to rmap functions to indicate whether we want to operate on whole compound page or only the small page. [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16page-flags: define PG_locked behavior on compound pagesKirill A. Shutemov
lock_page() must operate on the whole compound page. It doesn't make much sense to lock part of compound page. Change code to use head page's PG_locked, if tail page is passed. This patch also gets rid of custom helper functions -- __set_page_locked() and __clear_page_locked(). They are replaced with helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these helper would trigger VM_BUG_ON(). SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never appear there. VM_BUG_ON() is added to make sure that this assumption is correct. [akpm@linux-foundation.org: fix fs/cifs/file.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm/ksm.c: use list_for_each_entry_safeGeliang Tang
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ksm: unstable_tree_search_insert error checking cleanupAndrea Arcangeli
get_mergeable_page() can only return NULL (also in case of errors) or the pinned mergeable page. It can't return an error different than NULL. This optimizes away the unnecessary error check. Add a return after the "out:" label in the callee to make it more readable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ksm: use find_mergeable_vma in try_to_merge_with_ksm_pageAndrea Arcangeli
Doing the VM_MERGEABLE check after the page == kpage check won't provide any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is the only superfluous bit in using find_mergeable_vma because the !PageAnon check of try_to_merge_one_page() implicitly checks for that, but it still looks cleaner to share the same find_mergeable_vma(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ksm: use the helper method to do the hlist_empty checkAndrea Arcangeli
This just uses the helper function to cleanup the assumption on the hlist_node internals. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ksm: don't fail stable tree lookups if walking over stale stable_nodesAndrea Arcangeli
The stable_nodes can become stale at any time if the underlying pages gets freed. The stable_node gets collected and removed from the stable rbtree if that is detected during the rbtree lookups. Don't fail the lookup if running into stale stable_nodes, just restart the lookup after collecting the stale stable_nodes. Otherwise the CPU spent in the preparation stage is wasted and the lookup must be repeated at the next loop potentially failing a second time in a second stale stable_node. If we don't prune aggressively we delay the merging of the unstable node candidates and at the same time we delay the freeing of the stale stable_nodes. Keeping stale stable_nodes around wastes memory and it can't provide any benefit. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ksm: add cond_resched() to the rmap_walksAndrea Arcangeli
While at it add it to the file and anon walks too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm: remove rest of ACCESS_ONCE() usagesJason Low
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10mm: remove rest usage of VM_NONLINEAR and pte_file()Kirill A. Shutemov
One bit in ->vm_flags is unused now! Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-29vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mmu_notifier: call mmu_notifier_invalidate_range() from VMMJoerg Roedel
Add calls to the new mmu_notifier_invalidate_range() function to all places in the VMM that need it. Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Jay Cornwall <Jay.Cornwall@amd.com> Cc: Oded Gabbay <Oded.Gabbay@amd.com> Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Oded Gabbay <oded.gabbay@amd.com>
2014-10-10mm: ksm use pr_err instead of printkPaul McQuade
WARNING: Prefer: pr_err(... to printk(KERN_ERR ... [akpm@linux-foundation.org: remove KERN_ERR] Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-16sched: Remove proliferation of wait_on_bit() action functionsNeilBrown
The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-23mm: let mm_find_pmd fix buggy race with THP faultHugh Dickins
Trinity has reported: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1)) CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398 lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) remove_migration_pte (mm/migrate.c:137) rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699) remove_migration_ptes (mm/migrate.c:224) migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126) migrate_misplaced_page (mm/migrate.c:1733) __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925) handle_mm_fault (mm/memory.c:3948) __get_user_pages (mm/memory.c:1851) __mlock_vma_pages_range (mm/mlock.c:255) __mm_populate (mm/mlock.c:711) SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791) I believe this comes about because, whereas collapsing and splitting THP functions take anon_vma lock in write mode (which excludes concurrent rmap walks), faulting THP functions (write protection and misplaced NUMA) do not - and mostly they do not need to. But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for an instant (indeed, for a long instant, given the inter-CPU TLB flush in there), leaves *pmd neither present not trans_huge. Which can confuse a concurrent rmap walk, as when removing migration ptes, seen in the dumped trace. Although that rmap walk has a 4k page to insert, anon_vmas containing THPs are in no way segregated from 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with that instant when a trans_huge pmd is temporarily absent. I don't think we need strengthen the locking at the THP end: it's easily handled with an ACCESS_ONCE() before testing both conditions. And since mm_find_pmd() had only one caller who wanted a THP rather than a pmd, let's slightly repurpose it to fail when it hits a THP or non-present pmd, and open code split_huge_page_address() again. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Jones <davej@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04mm: close PageTail raceDavid Rientjes
Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-24mm: audit/fix non-modular users of module_init in core codePaul Gortmaker
Code that is obj-y (always built-in) or dependent on a bool Kconfig (built-in or absent) can never be modular. So using module_init as an alias for __initcall can be somewhat misleading. Fix these up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. The audit targets the following module_init users for change: mm/ksm.c bool KSM mm/mmap.c bool MMU mm/huge_memory.c bool TRANSPARENT_HUGEPAGE mm/mmu_notifier.c bool MMU_NOTIFIER Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of subsys_initcall (which makes sense for these files) will thus change this registration from level 6-device to level 4-subsys (i.e. slightly earlier). However no observable impact of that difference has been observed during testing. One might think that core_initcall (l2) or postcore_initcall (l3) would be more appropriate for anything in mm/ but if we look at some actual init functions themselves, we see things like: mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes mm/ksm.c --> ksm_init --> sysfs_create_group and hence the choice of subsys_initcall (l4) seems reasonable, and at the same time minimizes the risk of changing the priority too drastically all at once. We can adjust further in the future. Also, several instances of missing ";" at EOL are fixed. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-24mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22mm/rmap: use rmap_walk() in page_referenced()Joonsoo Kim
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in page_referenced(). In this patch, I change following things. 1. remove some variants of rmap traversing functions. cf> page_referenced_ksm, page_referenced_anon, page_referenced_file 2. introduce new struct page_referenced_arg and pass it to page_referenced_one(), main function of rmap_walk, in order to count reference, to store vm_flags and to check finish condition. 3. mechanical change to use rmap_walk() in page_referenced(). [liwanp@linux.vnet.ibm.com: fix BUG at rmap_walk] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22mm/rmap: use rmap_walk() in try_to_munlock()Joonsoo Kim
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in try_to_munlock(). In this patch, I change following things. 1. remove some variants of rmap traversing functions. cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file 2. mechanical change to use rmap_walk() in try_to_munlock(). 3. copy and paste comments. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22mm/rmap: use rmap_walk() in try_to_unmap()Joonsoo Kim
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in try_to_unmap(). In this patch, I change following things. 1. enable rmap_walk() if !CONFIG_MIGRATION. 2. mechanical change to use rmap_walk() in try_to_unmap(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22mm/rmap: extend rmap_walk_xxx() to cope with different casesJoonsoo Kim
There are a lot of common parts in traversing functions, but there are also a little of uncommon parts in it. By assigning proper function pointer on each rmap_walker_control, we can handle these difference correctly. Following are differences we should handle. 1. difference of lock function in anon mapping case 2. nonlinear handling in file mapping case 3. prechecked condition: checking memcg in page_referenced(), checking VM_SHARE in page_mkclean() checking temporary vma in try_to_unmap() 4. exit condition: checking page_mapped() in try_to_unmap() So, in this patch, I introduce 4 function pointers to handle above differences. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22mm/rmap: make rmap_walk to get the rmap_walk_control argumentJoonsoo Kim
In each rmap traverse case, there is some difference so that we need function pointers and arguments to them in order to handle these For this purpose, struct rmap_walk_control is introduced in this patch, and will be extended in following patch. Introducing and extending are separate, because it clarify changes. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13ksm: remove redundant __GFP_ZERO from kcallocJoe Perches
kcalloc returns zeroed memory. There's no need to use this flag. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: replace strict_strtoul() with kstrtoul()Jingoo Han
The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08ksm: fix m68k build: only NUMA needs pfn_to_nidHugh Dickins
A CONFIG_DISCONTIGMEM=y m68k config gave mm/ksm.c: In function `get_kpfn_nid': mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid' linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM (see arch/mips/include/asm/mmzone.h for example). Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze, and m68k got away without it so far, so fix the build in mm/ksm.c. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: allocate roots when neededHugh Dickins
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically allocated, particularly when very few users will ever actually tune merge_across_nodes 0 to use more than 1+1 of those trees. Not a big deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity. Start off with 1+1 statically allocated, then if merge_across_nodes is ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to free up the extra if it's tuned back, that would be a waste of effort. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm,ksm: FOLL_MIGRATION do migration_entry_waitHugh Dickins
In "ksm: remove old stable nodes more thoroughly" I said that I'd never seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing, but it soon appeared once I tried fuller tests on the whole series. It turned out to be due to the KSM page migration itself: unmerge_and_ remove_all_rmap_items() failed to locate and replace all the KSM pages, because of that hiatus in page migration when old pte has been replaced by migration entry, but not yet by new pte. follow_page() finds no page at that instant, but a KSM page reappears shortly after, without a fault. Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait() for KSM's break_cow(). I'd have preferred to avoid another flag, and do it every time, in case someone else makes the same easy mistake; but did not find another transgressor (the common get_user_pages() is of course safe), and cannot be sure that every follow_page() caller is prepared to sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can already sleep there, since anon_vma locking was changed to mutex, but maybe that's somehow excluded. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: shrink 32-bit rmap_item back to 32 bytesHugh Dickins
Think of struct rmap_item as an extension of struct page (restricted to MADV_MERGEABLE areas): there may be a lot of them, we need to keep them small, especially on 32-bit architectures of limited lowmem. Siting "int nid" after "unsigned int checksum" works nicely on 64-bit, making no change to its 64-byte struct rmap_item; but bloats the 32-bit struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which rounds up to 40 bytes once allocated from slab. We'd better avoid that. Hey, I only just remembered that the anon_vma pointer in struct rmap_item has no purpose until the rmap_item is hung from a stable tree node (which has its own nid field); and rmap_item's nid field no purpose than to say which tree root to tell rb_erase() when unlinking from an unstable tree. Double them up in a union. There's just one place where we set anon_vma early (when we already hold mmap_sem): now we must remove tree_rmap_item from its unstable tree there, before overwriting nid. No need to spatter BUG()s around: we'd be seeing oopses if this were wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: treat unstable nid like in stable treeHugh Dickins
An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. Now, it might be that a wrong NUMA node in an unstable tree is more likely to correlate with instablility (different content, with rbnode now misplaced) than page migration; but even so, we are accustomed to instablility in the unstable tree. Without strong evidence for which strategy is generally better, I'd rather be consistent with what's done in the stable tree: accept a page from the wrong NUMA node for comparison, but not as a leaf for merging. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: add some commentsHugh Dickins
Added slightly more detail to the Documentation of merge_across_nodes, a few comments in areas indicated by review, and renamed get_ksm_page()'s argument from "locked" to "lock_it". No functional change. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: stop hotremove lockdep warningHugh Dickins
Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: make !merge_across_nodes migration safeHugh Dickins
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: make KSM page migration possibleHugh Dickins
KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: remove old stable nodes more thoroughlyHugh Dickins
Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: get_ksm_page lockedHugh Dickins
In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: reorganize ksm_check_stable_treeHugh Dickins
Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: trivial tidyupsHugh Dickins
Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: allow trees per NUMA nodePetr Holasek
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) This patch: Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm/ksm.c: use new hashtable implementationSasha Levin
Switch ksm to use the new hashtable implementation. This reduces the amount of generic unrelated code in the ksm module. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm: reduce rmap overhead for ex-KSM page copies created on swap faultsJohannes Weiner
When ex-KSM pages are faulted from swap cache, the fault handler is not capable of re-establishing anon_vma-spanning KSM pages. In this case, a copy of the page is created instead, just like during a COW break. These freshly made copies are known to be exclusive to the faulting VMA and there is no reason to go look for this page in parent and sibling processes during rmap operations. Use page_add_new_anon_rmap() for these copies. This also puts them on the proper LRU lists and marks them SwapBacked, so we can get rid of doing this ad-hoc in the KSM copy code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20ksm: make rmap walks more scalableHugh Dickins
The rmap walks in ksm.c are like those in rmap.c: they can safely be done with anon_vma_lock_read(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-16Merge tag 'balancenuma-v11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
2012-12-12mm, oom: fix race when specifying a thread as the oom originDavid Rientjes
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>