summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2015-01-29vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUSLinus Torvalds
The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-29vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-26mm/vmscan: fix highidx argument typeMichael S. Tsirkin
for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is passed gfp_t: mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types) mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask convert argument to the correct type. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-26memcg: remove extra newlines from memcg oom kill logGreg Thelen
Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t <<< extra #1 killed as a result of limit of /t <<< extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c55c24 ("cgroup: remove cgroup->name") Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-26mm: page_alloc: embed OOM killing naturally into allocation slowpathJohannes Weiner
The OOM killing invocation does a lot of duplicative checks against the task's allocation context. Rework it to take advantage of the existing checks in the allocator slowpath. The OOM killer is invoked when the allocator is unable to reclaim any pages but the allocation has to keep looping. Instead of having a check for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM invocation to the true branch of should_alloc_retry(). The __GFP_FS check from oom_gfp_allowed() can then be moved into the OOM avoidance branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test. __alloc_pages_may_oom() can then signal to the caller whether the OOM killer was invoked, instead of requiring it to duplicate the order and high_zoneidx checks to guess this when deciding whether to continue. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-13mm: mmu_gather: use tlb->end != 0 only for TLB invalidationWill Deacon
When batching up address ranges for TLB invalidation, we check tlb->end != 0 to indicate that some pages have actually been unmapped. As of commit f045bbb9fa1b ("mmu_gather: fix over-eager tlb_flush_mmu_free() calling"), we use the same check for freeing these pages in order to avoid a performance regression where we call free_pages_and_swap_cache even when no pages are actually queued up. Unfortunately, the range could have been reset (tlb->end = 0) by tlb_end_vma, which has been shown to cause memory leaks on arm64. Furthermore, investigation into these leaks revealed that the fullmm case on task exit no longer invalidates the TLB, by virtue of tlb->end == 0 (in 3.18, need_flush would have been set). This patch resolves the problem by reverting commit f045bbb9fa1b, using instead tlb->local.nr as the predicate for page freeing in tlb_flush_mmu_free and ensuring that tlb->end is initialised to a non-zero value in the fullmm case. Tested-by: Mark Langsdorf <mlangsdo@redhat.com> Tested-by: Dave Hansen <dave@sr71.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-11mm: fix corner case in anon_vma endless growing preventionKonstantin Khlebnikov
Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Anon_vma_clone() is usually called for a copy of source vma in destination argument. If source vma has anon_vma it should be already in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's called from anon_vma_fork(). In this case anon_vma_clone() finds anon_vma for reusing. Vma_adjust() calls it differently and this breaks anon_vma reusing logic: anon_vma_clone() links vma to old anon_vma and updates degree counters but vma_adjust() overrides vma->anon_vma right after that. As a result final unlink_anon_vmas() decrements degree for wrong anon_vma. This patch assigns ->anon_vma before calling anon_vma_clone(). Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com> Reported-and-tested-by: Oded Gabbay <oded.gabbay@amd.com> Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu> Cc: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-11mm: Don't count the stack guard page towards RLIMIT_STACKLinus Torvalds
Commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") made sure that we return the error properly for stack growth conditions. It also theorized that counting the guard page towards the stack limit might break something, but also said "Let's see if anybody notices". Somebody did notice. Apparently android-x86 sets the stack limit very close to the limit indeed, and including the guard page in the rlimit check causes the android 'zygote' process problems. So this adds the (fairly trivial) code to make the stack rlimit check be against the actual real stack size, rather than the size of the vma that includes the guard page. Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Cc: Jay Foad <jay.foad@gmail.com> Cc: stable@kernel.org # to match back-porting of fee7e49d4514 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process ↵Vlastimil Babka
being killed Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep(): if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; // kswapd will not go to sleep } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to make sure that no process is left sleeping on pfmemalloc_wait when kswapd itself goes to sleep. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue actually empties. To prevent processes from being left sleeping, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Note that if any process puts itself in the queue after this waitqueue_active() check, or after the wake up itself, it means that the process will also wake up kswapd - and since we are under prepare_to_wait(), the wake up won't be missed. Also we update the comment prepare_kswapd_sleep() to hopefully more clearly describe the races it is preventing. Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08memcg: fix destination cgroup leak on task charges migrationVladimir Davydov
We are supposed to take one css reference per each memory page and per each swap entry accounted to a memory cgroup. However, during task charges migration we take a reference to the destination cgroup twice per each swap entry: first in mem_cgroup_do_precharge()->try_charge() and then in mem_cgroup_move_swap_account(), permanently leaking the destination cgroup. The hunk taking the second reference seems to be a leftover from the pre-00501b531c472 ("mm: memcontrol: rewrite charge API") era. Remove it to fix the leak. Fixes: e8ea14cc6ead (mm: memcontrol: take a css reference for each charged page) Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: memcontrol: switch soft limit default back to infinityJohannes Weiner
Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") accidentally switched the soft limit default from infinity to zero, which turns all memcgs with even a single page into soft limit excessors and engages soft limit reclaim on all of them during global memory pressure. This makes global reclaim generally more aggressive, but also inverts the meaning of existing soft limit configurations where unset soft limits are usually more generous than set ones. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm/debug_pagealloc: remove obsolete Kconfig optionsJoonsoo Kim
These are obsolete since commit e30825f1869a ("mm/debug-pagealloc: prepare boottime configurable") was merged. So remove them. [pebolle@tiscali.nl: find obsolete Kconfig options] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: protect set_page_dirty() from ongoing truncationJohannes Weiner
Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: prevent endless growth of anon_vma hierarchyKonstantin Khlebnikov
Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: Michal Hocko <mhocko@suse.cz> Tested-by: Jerome Marchand <jmarchan@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-06mm: propagate error from stack expansion even for guard pageLinus Torvalds
Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-29mm: get rid of radix tree gfp mask for pagecache_get_pageMichal Hocko
Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-22Revert "mm/memory.c: share the i_mmap_rwsem"Kirill A. Shutemov
This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946. There are several[1][2] of bug reports which points to this commit as potential cause[3]. Let's revert it until we figure out what's going on. [1] https://lkml.org/lkml/2014/11/14/342 [2] https://lkml.org/lkml/2014/12/22/213 [3] https://lkml.org/lkml/2014/12/9/741 Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-21Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger: "kernel: Provide READ_ONCE and ASSIGN_ONCE As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com ACCESS_ONCE might fail with specific compilers for non-scalar accesses. Here is a set of patches to tackle that problem. The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data structure is larger than the machine word size memcpy is used and a warning is emitted. The next patches fix up several in-tree users of ACCESS_ONCE on non-scalar types. This does not yet contain a patch that forces ACCESS_ONCE to work only on scalar types. This is targetted for the next merge window as Linux next already contains new offenders regarding ACCESS_ONCE vs. non-scalar types" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux: s390/kvm: REPLACE barrier fixup with READ_ONCE arm/spinlock: Replace ACCESS_ONCE with READ_ONCE arm64/spinlock: Replace ACCESS_ONCE READ_ONCE mips/gup: Replace ACCESS_ONCE with READ_ONCE x86/gup: Replace ACCESS_ONCE with READ_ONCE x86/spinlock: Replace ACCESS_ONCE with READ_ONCE mm: replace ACCESS_ONCE with READ_ONCE or barriers kernel: Provide READ_ONCE and ASSIGN_ONCE
2014-12-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #3 from Al Viro: "Assorted fixes and patches from the last cycle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [regression] chunk lost from bd9b51 vfs: make mounts and mountstats honor root dir like mountinfo does vfs: cleanup show_mountinfo init: fix read-write root mount unfuck binfmt_misc.c (broken by commit e6084d4) vm_area_operations: kill ->migrate() new helper: iter_is_iovec() move_extent_per_page(): get rid of unused w_flags lustre: get rid of playing with ->fs btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
2014-12-19mm/zsmalloc: adjust order of functionsGanesh Mahendran
Currently functions in zsmalloc.c does not arranged in a readable and reasonable sequence. With the more and more functions added, we may meet below inconvenience. For example: Current functions: void zs_init() { } static void get_maxobj_per_zspage() { } Then I want to add a func_1() which is called from zs_init(), and this new added function func_1() will used get_maxobj_per_zspage() which is defined below zs_init(). void func_1() { get_maxobj_per_zspage() } void zs_init() { func_1() } static void get_maxobj_per_zspage() { } This will cause compiling issue. So we must add a declaration: static void get_maxobj_per_zspage(); before func_1() if we do not put get_maxobj_per_zspage() before func_1(). In addition, puting module_[init|exit] functions at the bottom of the file conforms to our habit. So, this patch ajusts function sequence as: /* helper functions */ ... obj_location_to_handle() ... /* Some exported functions */ ... zs_map_object() zs_unmap_object() zs_malloc() zs_free() zs_init() zs_exit() Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-19mm/memory.c:do_shared_fault(): add commentAndrew Morton
Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce do_shared_fault() and drop do_fault()"). Cc: Andi Kleen <ak@linux.intel.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-19mm: cma: split cma-reserved in dmesg logPintu Kumar
When the system boots up, in the dmesg logs we can see the memory statistics along with total reserved as below. Memory: 458840k/458840k available, 65448k reserved, 0K highmem When CMA is enabled, still the total reserved memory remains the same. However, the CMA memory is not considered as reserved. But, when we see /proc/meminfo, the CMA memory is part of free memory. This creates confusion. This patch corrects the problem by properly subtracting the CMA reserved memory from the total reserved memory in dmesg logs. Below is the dmesg snapshot from an arm based device with 512MB RAM and 12MB single CMA region. Before this change: Memory: 458840k/458840k available, 65448k reserved, 0K highmem After this change: Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-19mm/mempolicy.c: remove unnecessary is_valid_nodemask()Zhihui Zhang
When nodes is true, nsc->mask2 has already been filtered by nsc->mask1, which has already factored in node_states[N_MEMORY]. Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18mm: replace ACCESS_ONCE with READ_ONCE or barriersChristian Borntraeger
ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Let's change the code to access the page table elements with READ_ONCE that does implicit scalar accesses for the gup code. mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t as array of longs. This code requires just that the pmd_present and pmd_trans_huge check are done on the same value, so a barrier is sufficent. A similar case is in handle_pte_fault. On ppc44x the word size is 32 bit, but a pte is 64 bit. A barrier is ok as well. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: linux-mm@kvack.org Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-12-17mmu_gather: fix over-eager tlb_flush_mmu_free() callingLinus Torvalds
Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal range calculations into generic code") caused a performance problem: "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch applied, but does not show up at all on the commit before" and the reason is that Will moved the test for whether we need to flush from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that tlb_flush_mmu_free() basically lost that check. Move it back into tlb_flush_mmu() where it belongs, so that it covers both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free(). Reported-and-tested-by: Dave Hansen <dave@sr71.net> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17vm_area_operations: kill ->migrate()Al Viro
the only instance this method has ever grown was one in kernfs - one that call ->migrate() of another vm_ops if it exists. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17new helper: iter_is_iovec()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-16Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "A comparatively quieter cycle for nfsd this time, but still with two larger changes: - RPC server scalability improvements from Jeff Layton (using RCU instead of a spinlock to find idle threads). - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna Schumaker, enabling fallocate on new clients" * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd4: fix xdr4 count of server in fs_location4 nfsd4: fix xdr4 inclusion of escaped char sunrpc/cache: convert to use string_escape_str() sunrpc: only call test_bit once in svc_xprt_received fs: nfsd: Fix signedness bug in compare_blob sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt sunrpc: convert to lockless lookup of queued server threads sunrpc: fix potential races in pool_stats collection sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it sunrpc: require svc_create callers to pass in meaningful shutdown routine sunrpc: have svc_wake_up only deal with pool 0 sunrpc: convert sp_task_pending flag to use atomic bitops sunrpc: move rq_cachetype field to better optimize space sunrpc: move rq_splice_ok flag into rq_flags sunrpc: move rq_dropme flag into rq_flags sunrpc: move rq_usedeferral flag to rq_flags sunrpc: move rq_local field to rq_flags sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it nfsd: minor off by one checks in __write_versions() sunrpc: release svc_pool_map reference when serv allocation fails ...
2014-12-15Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm updates from Dave Airlie: "Highlights: - AMD KFD driver merge This is the AMD HSA interface for exposing a lowlevel interface for GPGPU use. They have an open source userspace built on top of this interface, and the code looks as good as it was going to get out of tree. - Initial atomic modesetting work The need for an atomic modesetting interface to allow userspace to try and send a complete set of modesetting state to the driver has arisen, and been suffering from neglect this past year. No more, the start of the common code and changes for msm driver to use it are in this tree. Ongoing work to get the userspace ioctl finished and the code clean will probably wait until next kernel. - DisplayID 1.3 and tiled monitor exposed to userspace. Tiled monitor property is now exposed for userspace to make use of. - Rockchip drm driver merged. - imx gpu driver moved out of staging Other stuff: - core: panel - MIPI DSI + new panels. expose suggested x/y properties for virtual GPUs - i915: Initial Skylake (SKL) support gen3/4 reset work start of dri1/ums removal infoframe tracking fixes for lots of things. - nouveau: tegra k1 voltage support GM204 modesetting support GT21x memory reclocking work - radeon: CI dpm fixes GPUVM improvements Initial DPM fan control - rcar-du: HDMI support added removed some support for old boards slave encoder driver for Analog Devices adv7511 - exynos: Exynos4415 SoC support - msm: a4xx gpu support atomic helper conversion - tegra: iommu support universal plane support ganged-mode DSI support - sti: HDMI i2c improvements - vmwgfx: some late fixes. - qxl: use suggested x/y properties" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits) drm: sti: fix module compilation issue drm/i915: save/restore GMBUS freq across suspend/resume on gen4 drm: sti: correctly cleanup CRTC and planes drm: sti: add HQVDP plane drm: sti: add cursor plane drm: sti: enable auxiliary CRTC drm: sti: fix delay in VTG programming drm: sti: prepare sti_tvout to support auxiliary crtc drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off} drm: sti: fix hdmi avi infoframe drm: sti: remove event lock while disabling vblank drm: sti: simplify gdp code drm: sti: clear all mixer control drm: sti: remove gpio for HDMI hot plug detection drm: sti: allow to change hdmi ddc i2c adapter drm/doc: Document drm_add_modes_noedid() usage drm/i915: Remove '& 0xffff' from the mask given to WA_REG() drm/i915: Invert the mask and val arguments in wa_add() and WA_REG() drm: Zero out DRM object memory upon cleanup drm/i915/bdw: Fix the write setting up the WIZ hashing mode ...
2014-12-14Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds
Pull aio updates from Benjamin LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: Skip timer for io_getevents if timeout=0 aio: Make it possible to remap aio ring
2014-12-13aio: Make it possible to remap aio ringPavel Emelyanov
There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13mm/cma: make kmemleak ignore CMA regionsThierry Reding
kmemleak will add allocations as objects to a pool. The memory allocated for each object in this pool is periodically searched for pointers to other allocated objects. This only works for memory that is mapped into the kernel's virtual address space, which happens not to be the case for most CMA regions. Furthermore, CMA regions are typically used to store data transferred to or from a device and therefore don't contain pointers to other objects. Without this, the kernel crashes on the first execution of the scan_gray_list() because it tries to access highmem. Perhaps a more appropriate fix would be to reject any object that can't map to a kernel virtual address? [akpm@linux-foundation.org: add comment] [akpm@linux-foundation.org: fix comment, per Catalin] [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()] Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13slub: fix cpuset check in get_any_partialVladimir Davydov
If we fail to allocate from the current node's stock, we look for free objects on other nodes before calling the page allocator (see get_any_partial). While checking other nodes we respect cpuset constraints by calling cpuset_zone_allowed. We enforce hardwall check. As a result, we will fallback to the page allocator even if there are some pages cached on other nodes, but the current cpuset doesn't have them set. However, the page allocator uses softwall check for kernel allocations, so it may allocate from one of the other nodes in this case. Therefore we should use softwall cpuset check in get_any_partial to conform with the cpuset check in the page allocator. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13slab: fix cpuset check in fallback_allocVladimir Davydov
fallback_alloc is called on kmalloc if the preferred node doesn't have free or partial slabs and there's no pages on the node's free list (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries to locate a free or partial slab on other allowed nodes' lists. While iterating over the preferred node's zonelist it skips those zones which hardwall cpuset check returns false for. That means that for a task bound to a specific node using cpusets fallback_alloc will always ignore free slabs on other nodes and go directly to the reclaimer, which, however, may allocate from other nodes if cpuset.mem_hardwall is unset (default). As a result, we may get lists of free slabs grow without bounds on other nodes, which is bad, because inactive slabs are only evicted by cache_reap at a very slow rate and cannot be dropped forcefully. To reproduce the issue, run a process that will walk over a directory tree with lots of files inside a cpuset bound to a node that constantly experiences memory pressure. Look at num_slabs vs active_slabs growth as reported by /proc/slabinfo. To avoid this we should use softwall cpuset check in fallback_alloc. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zbud: init user ops only when it is neededHeesub Shin
When zbud is initialized through the zpool wrapper, pool->ops which points to user-defined operations is always set regardless of whether it is specified from the upper layer. This causes zbud_reclaim_page() to iterate its loop for evicting pool pages out without any gain. This patch sets the user-defined ops only when it is needed, so that zbud_reclaim_page() can bail out the reclamation loop earlier if there is no user-defined operations specified. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Sunae Seo <sunae.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zswap: delete unnecessary check before calling free_percpu()Markus Elfring
free_percpu() tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zswap: add __init to some functions in zswapMahendran Ganesh
zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by __init init_zswap() Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zsmalloc: allocate exactly size of struct zs_poolGanesh Mahendran
In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool) ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); This patch allocate memory of exactly needed size. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zsmalloc: avoid duplicate assignment of prev_classGanesh Mahendran
In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times. And the prev_class only references to the previous size_class. So we do not need unnecessary assignement. This patch assigns *prev_class* when a new size_class structure is allocated and uses prev_class to check whether the first class has been allocated. [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES] Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZEMahendran Ganesh
I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim found zsmalloc even does not support allocating an obj with the size of ZS_MAX_ALLOC_SIZE in some situations. For example: In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then: ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by: MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE) ZS_SIZE_CLASS_DELTA is 256 bytes So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1 = 256 In zs_create_pool(), the max size obj which can be allocated will be: ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312 We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper users we can do. [1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html This patch fixes this issue by dynamiclly calculating zs_size_classes when module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it. [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability] Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13zsmalloc: correct fragile [kmap|kunmap]_atomic useMinchan Kim
The kunmap_atomic should use virtual address getting by kmap_atomic. However, some pieces of code in zsmalloc uses modified address, not the one got by kmap_atomic for kunmap_atomic. It's okay for working because zsmalloc modifies the address inner PAGE_SIZE bounday so it works with current kmap_atomic's implementation. But it's still fragile with potential changing of kmap_atomic so let's correct it. I got a subtle bug when I implemented a new feature of zsmalloc (compaction) due to a link's mishandling (the link was over page boundary). Although it was totally my mistake, it took a while to find the cause because an unpredictable kmapped address was unmapped causing an almost random crash. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13zsmalloc: fix zs_init cpu notifier error handlingSergey Senozhatsky
Mahendran Ganesh reported that zpool-enabled zsmalloc should not call zpool_unregister_driver() from zs_init() if cpu notifier registration has failed, because error handling is performed before we register the driver via zpool_register_driver() call. Factor out cpu notifier registration and unregistration code and fix zs_init() error handling. link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html [akpm@linux-foundation.org: squash bogus gcc warning] [akpm@linux-foundation.org: use __init and __exit] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13zsmalloc: merge size_class to reduce fragmentationJoonsoo Kim
zsmalloc has many size_classes to reduce fragmentation and they are in 16 bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And, zsmalloc has constraint that each zspage has 4 pages at maximum. In this situation, we can see interesting aspect. Let's think about size_class for 1488, 1472, ..., 1376. To prevent external fragmentation, they uses 4 pages per zspage and so all they can contain 11 objects at maximum. 16384 (4096 * 4) = 1488 * 11 + remains 16384 (4096 * 4) = 1472 * 11 + remains 16384 (4096 * 4) = ... 16384 (4096 * 4) = 1376 * 11 + remains It means that they have same characteristics and classification between them isn't needed. If we use one size_class for them, we can reduce fragementation and save some memory since both the 1488 and 1472 sized classes can only fit 11 objects into 4 pages, and an object that's 1472 bytes can fit into an object that's 1488 bytes, merging these classes to always use objects that are 1488 bytes will reduce the total number of size classes. And reducing the total number of size classes reduces overall fragmentation, because a wider range of compressed pages can fit into a single size class, leaving less unused objects in each size class. For this purpose, this patch implement size_class merging. If there is size_class that have same pages_per_zspage and same number of objects per zspage with previous size_class, we don't create new size_class. Instead, we use previous, same characteristic size_class. With this way, above example sizes (1488, 1472, ..., 1376) use just one size_class so we can get much more memory utilization. Below is result of my simple test. TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel source code, remove directory in descending order in size. (drivers arch fs sound include net Documentation firmware kernel tools) Each line represents orig_data_size, compr_data_size, mem_used_total, fragmentation overhead (mem_used - compr_data_size) and overhead ratio (overhead to compr_data_size), respectively, after untar and remove operation is executed. * untar-nomerge.out orig_size compr_size used_size overhead overhead_ratio 525.88MB 199.16MB 210.23MB 11.08MB 5.56% 288.32MB 97.43MB 105.63MB 8.20MB 8.41% 177.32MB 61.12MB 69.40MB 8.28MB 13.55% 146.47MB 47.32MB 56.10MB 8.78MB 18.55% 124.16MB 38.85MB 48.41MB 9.55MB 24.58% 103.93MB 31.68MB 40.93MB 9.25MB 29.21% 84.34MB 22.86MB 32.72MB 9.86MB 43.13% 66.87MB 14.83MB 23.83MB 9.00MB 60.70% 60.67MB 11.11MB 18.60MB 7.49MB 67.48% 55.86MB 8.83MB 16.61MB 7.77MB 88.03% 53.32MB 8.01MB 15.32MB 7.31MB 91.24% * untar-merge.out orig_size compr_size used_size overhead overhead_ratio 526.23MB 199.18MB 209.81MB 10.64MB 5.34% 288.68MB 97.45MB 104.08MB 6.63MB 6.80% 177.68MB 61.14MB 66.93MB 5.79MB 9.47% 146.83MB 47.34MB 52.79MB 5.45MB 11.51% 124.52MB 38.87MB 44.30MB 5.43MB 13.96% 104.29MB 31.70MB 36.83MB 5.13MB 16.19% 84.70MB 22.88MB 27.92MB 5.04MB 22.04% 67.11MB 14.83MB 19.26MB 4.43MB 29.86% 60.82MB 11.10MB 14.90MB 3.79MB 34.17% 55.90MB 8.82MB 12.61MB 3.79MB 42.97% 53.32MB 8.01MB 11.73MB 3.73MB 46.53% As you can see above result, merged one has better utilization (overhead ratio, 5th column) and uses less memory (mem_used_total, 3rd column). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Luigi Semenzato <semenzato@google.com> Cc: <juno.choi@lge.com> Cc: "seungho1.park" <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate()Rickard Strandqvist
Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON() to the beginning of memcg_stat_show(). This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: fix possible use-after-free in memcg_kmem_get_cache()Vladimir Davydov
Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/memcontrol.c: fix defined but not used compiler warningMichele Curti
test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so move it into the compiler if statement [akpm@linux-foundation.org: clean up layout] Signed-off-by: Michele Curti <michele.curti@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pagesMel Gorman
A random seek IO benchmark appeared to regress because of a change to readahead but the real problem was the benchmark. To ensure the IO request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary (512K) but the hint is ignored by the kernel. This is correct but not necessarily obvious behaviour. As much as I dislike comment patches, the explanation for this behaviour predates current git history. Clarify why it behaves like this in case someone "fixes" fadvise or readahead for the wrong reasons. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/vmalloc.c: fix memory ordering bugDmitry Vyukov
Read memory barriers must follow the read operations. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13oom: kill the insufficient and no longer needed PT_TRACE_EXIT checkOleg Nesterov
After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was "frozen" by ptrace, but it doesn't really work. If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13oom: don't assume that a coredumping thread will exit soonOleg Nesterov
oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>